weekly Duplicate checking

  • Im looking at ways of keep duplicates on our system to a low. one way that I wanted to try was to match all the data entered in a week to the current data. and then score the data each time.

    For example my query below is what ive started but realise its much harder and involved at teh same time.

    SELECT CX.ROW_ID, CX.FST_NAME, CX.LAST_NAME, F.NAME ,'PARENT' =CASE WHEN CX.CREATED > '2006-02-01 00:00:00.000' THEN 'X' ELSE '' END

    FROM S_CONTACT CX

    JOIN FIRM F ON F.ID = CX.FIRM_ID

                  WHERE lower(CX.LAST_NAME) in (

                          SELECT Lower(C.LAST_NAME)

                          FROM CONTACT C

                          WHERE C.CREATED>'2006-02-01 00:00:00.000'

                          AND C.STATUS <> 'Marked For Archive')

    AND CX.STATUS <> 'Marked For Archive'

    Order by CX.LAST_NAME ASC,CX.CREATED desc, CX.FST_NAME

    This gives me a nice list of Contacts listing the newest one at the top and ordering the rest in a relatively easyt to use list

    Id like this to be a little smarter , ie showing only where

    A) the initial is the same

    B) if the firm name is similar (i wanted to do some kind of score based on the characters in the firm names

     

    any ideas on how I could do this

  •  

    To find out the duplicates names you should use this query:-

     

    select CX.FST_NAME+CX.LAST_NAME,count(*) num

    from table_name

    group by CX.FST_NAME+CX.LAST_NAME

    having count(*) > 1

  • Try looking at the SOUNDEX and DIFFERENCE functions.

    --Jeff

  • Thanks Amit unfortunately it never is that easy, suppose you have two people entered at the Smith & Co Company

    The First is Abbey Smith, the second is A B Smith

    This would mean that your query wouln't see that

    Further to the query I posted though I was looking at Just joining as a subquery but I can't do that either as if someone is entered on to the database and someone else with the surname in the week, the report will return 2x the number of people sharing that name

    Jeff, thanks for your note, on a much early post I did include Soundex and difference as part of the query but that seemed to hit the CPU much harder as the Names will have non Alpha numeric characters and inorder for the Query to work positively I would first strip down all the Characters in the Last/First name

    and then when it scans the whole contact table I will need to remove the characters from each of the names in turn inorder to ensure that the SOUNDEX / DIFFERENCE worked

  • One thing to do is to store the SOUNDEX of each row as a column.  Also, store the stripped down names as colums.  It's easier to amortize the cost of manipulating each name over each insert/update than doing it everytime a query is done for all records.

    --Jeff

  • Intresting, how do you mean store the value of soundex as a column ?(or did you mean store the value in a column)

    unfortunately our DBA doesn't like us writing to the database unless its through the front end so just adding a column would take at least 2 months before it actually gets done.

    time to create everything in excel then

     

  • What I mean is that as part of the code that inserts a new customer, populate a column with the SOUNDEX of the stripped down customer name.  An example script would be:

    create table Customer

    (

    CustomerID int identity(1,1),

    --REST OF CUSTOMER FIELDS

    CustomerSoundex char(4)

    )

    On the insert:

    insert customer (...column declaration...)

    values (...rest of customer data..., SOUNDEX(dbo.fnStripCustomerName(...))

    You would also have to write the function dbo.fnStripCustomerName() to take the customer name arguements and strip and concatentate them according to the rule you mentioned.

     

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply