I would like to get the distinct value from specified column. I would appreciate if any one provide the answer of this

  • I have table A_ having near 55 million rows. Name column having some duplicates, i would like to remove it.

    Ex.

    1. Dr Vasudev Narasima MD Vasudev Narasima MBBS

    output

    Dr Vasudev Narasima MD MBBS

    2. Ruban s Soundrapandian Ruban Soundrpandian

    output

    Ruban s Soundrapandian

    I would appreciate if any one provide the script for this.

    Regards

    Ruban

  • Please provide sample data (insert statements) for all your problem cases and the required output for all of them.

  • Here is the script

    --INPUT

    CREATE TABLE #TMP (NAME_1 VARCHAR(300))

    INSERT INTO #TMP VALUES ('DR AARON SASAKI SASAKI AARON MD')

    INSERT INTO #TMP VALUES ('DR ALBERT LUGO ALBERTO LUGO DMD PC')

    --OUTPUT

    I would like the output in following format

    DR AARON SASAKI MD

    DR ALBERT LUGO DMD PC

  • Those are all the possible cases of duplicates??? Are you sure you parsed all the data yet?

  • It's not yet all 55 million rows, it occurs near 100 K records.

    I want to remove duplicates from this kind of data (100K) out of 55 millions.

  • "This kind of data" does not really cut it as far as a spec is concerned. Also, this is not strictly duplicate data.

    So far, my spec would be along the lines of

    "Break the field into space-delimited words and remove the second and subsequent occurrences of any repeated words"

    Does that describe it?

    Of course, names such as Bruce Bruce (any Iron Maiden fans out there?) Would be corrupted by this.


  • Phil Parkin (5/26/2011)


    "This kind of data" does not really cut it as far as a spec is concerned. Also, this is not strictly duplicate data.

    So far, my spec would be along the lines of

    "Break the field into space-delimited words and remove the second and subsequent occurrences of any repeated words"

    Does that describe it?

    Of course, names such as Bruce Bruce (any Iron Maiden fans out there?) Would be corrupted by this.

    Exactly my point.

  • Ninja's_RGR'us (5/26/2011)


    Please provide sample data (insert statements) for all your problem cases and the required output for all of them.

    Heh... if there are thousands, wouldn't the script necessary to do that be the solution itself? 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sounds like what he wants is a way to find duplicate patterns of strings in a field and remove the second occurrence. Of course this would require parsing the field on each row into columns and keep the order so you can do about a zillion different combinations to see IF there are any matching patterns. Even doing it for each word would be painful and hideously slow with 100k records. With 55M it almost seems worthwhile to hire some temps and have them manually fix the records. Even IF this could be scripted, it would take forever to run and you would have to audit each record and manually review them to make sure the new values are correct. A replacement this "loose" will catch all sorts of outliers (Bruce Bruce for example) and these will have to be manually corrected again after the script incorrectly corrects them.

    I don't envy you on this project as it is going to be painful.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Rani-434268 (5/26/2011)


    I have table A_ having near 55 million rows. Name column having some duplicates, i would like to remove it.

    Ex.

    1. Dr Vasudev Narasima MD Vasudev Narasima MBBS

    output

    Dr Vasudev Narasima MD MBBS

    2. Ruban s Soundrapandian Ruban Soundrpandian

    output

    Ruban s Soundrapandian

    I would appreciate if any one provide the script for this.

    Regards

    Ruban

    This sounds like fun. The script to do this (Phil described it best) actually isn't that complicated. Sean pointed out the tough problem... finding out if something does need to be fixed.

    With that in mind, please describe what the PK on this table looks like because the method I'm thinking of is going to need to use the PK.

    I also need to know... are there ANY triggers on the table?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/27/2011)


    Rani-434268 (5/26/2011)


    I have table A_ having near 55 million rows. Name column having some duplicates, i would like to remove it.

    Ex.

    1. Dr Vasudev Narasima MD Vasudev Narasima MBBS

    output

    Dr Vasudev Narasima MD MBBS

    2. Ruban s Soundrapandian Ruban Soundrpandian

    output

    Ruban s Soundrapandian

    I would appreciate if any one provide the script for this.

    Regards

    Ruban

    This sounds like fun. The script to do this (Phil described it best) actually isn't that complicated. Sean pointed out the tough problem... finding out if something does need to be fixed.

    With that in mind, please describe what the PK on this table looks like because the method I'm thinking of is going to need to use the PK.

    I also need to know... are there ANY triggers on the table?

    Really? You're going that route? Hmm, time to get busy on this myself then - I have a different idea that still needs a little fleshing out.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • WayneS (5/27/2011)


    Jeff Moden (5/27/2011)


    Rani-434268 (5/26/2011)


    I have table A_ having near 55 million rows. Name column having some duplicates, i would like to remove it.

    Ex.

    1. Dr Vasudev Narasima MD Vasudev Narasima MBBS

    output

    Dr Vasudev Narasima MD MBBS

    2. Ruban s Soundrapandian Ruban Soundrpandian

    output

    Ruban s Soundrapandian

    I would appreciate if any one provide the script for this.

    Regards

    Ruban

    This sounds like fun. The script to do this (Phil described it best) actually isn't that complicated. Sean pointed out the tough problem... finding out if something does need to be fixed.

    With that in mind, please describe what the PK on this table looks like because the method I'm thinking of is going to need to use the PK.

    I also need to know... are there ANY triggers on the table?

    Really? You're going that route? Hmm, time to get busy on this myself then - I have a different idea that still needs a little fleshing out.

    ohhh this is starting to sound like a show down. This is going to be a good one to watch!!! Ruban sit back and watch the masters go at it. This has the markings of some pretty cool learning for the rest of us. 😉

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean Lange (5/27/2011)


    ohhh this is starting to sound like a show down. This is going to be a good one to watch!!! Ruban sit back and watch the masters go at it. This has the markings of some pretty cool learning for the rest of us. 😉

    :blush: Master? Not me... Jeff maybe.. but thanks for the vote of confidence!

    Well, as is usually the case, the performance of Jeff's code will usually blow mine out of the park. But, I thoroughly agree with the learning part!

    Anyway, here's my solution. It was only tested on the sample code (all 2 rows... I justdon't feel like generating the million rows of test data that the OP should be providing). Based on experience, it should perform pretty good (but like I said, Jeff has a way of doing the unexpected to totally amaze you), although it will hit the source table twice.

    Like others have mentioned... it needs to be thoroughly tested, and the edge cases (Duran Duran, etc.) will get messed up.

    As is becoming my norm... comments are in the code (wonder where I picked that up from???)

    ;

    WITH cte AS

    (

    -- Add a row number for each row.

    -- This step could be eliminated by adding an identity column

    -- to the source table.

    SELECT t.NAME_1,

    RN = ROW_NUMBER() OVER (ORDER BY (SELECT 1))

    FROM @tmp t

    ), cte2 AS

    (

    -- Break apart all of the names on the space.

    -- The latest version of the Delimited Split Function is at http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    SELECT *

    FROM cte t -- change to @tmp if identity column available

    CROSS APPLY dbo.DelimitedSplit8K(t.name_1, ' ') ds

    ), cte3 AS

    (

    -- Group the results by the name,

    -- Get the lowest item for each name part.

    SELECT NAME_1,

    RN,

    Item,

    ItemNumber = MIN(ItemNumber)

    FROM cte2

    GROUP BY NAME_1, RN, Item

    )

    -- Put the remaining results back together

    SELECT t.NAME_1,

    ca.ColList

    FROM cte t -- change to @tmp if add identity column.

    CROSS APPLY (SELECT ColList = STUFF((SELECT ' ' + t2.Item

    FROM cte3 t2

    WHERE t.RN = t2.RN -- change t.rn to @tmp.identity column if available

    ORDER BY t2.RN, t2.ItemNumber

    FOR XML PATH(''),TYPE).value('.','varchar(max)'),1,1,'')) ca

    ORDER BY t.RN;

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Jeff Moden (5/27/2011)


    Rani-434268 (5/26/2011)


    I have table A_ having near 55 million rows. Name column having some duplicates, i would like to remove it.

    Ex.

    1. Dr Vasudev Narasima MD Vasudev Narasima MBBS

    output

    Dr Vasudev Narasima MD MBBS

    2. Ruban s Soundrapandian Ruban Soundrpandian

    output

    Ruban s Soundrapandian

    I would appreciate if any one provide the script for this.

    Regards

    Ruban

    This sounds like fun. The script to do this (Phil described it best) actually isn't that complicated. Sean pointed out the tough problem... finding out if something does need to be fixed.

    With that in mind, please describe what the PK on this table looks like because the method I'm thinking of is going to need to use the PK.

    I also need to know... are there ANY triggers on the table?

    Hey, Ruban... howz 'bout some answers, please? My questions are important to helping you.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • While we're waiting for Ruban to actually show up for his own post, I went ahead and made the assumption that the table with the "duped names" actually does have a clustered primary key.

    I also just can't bring myself to test only on 2 rows. I didn't gen my normal million rows but the following will work as a small test.

    --===== Identify a nice safe place to play in that everyone has

    USE TempDB;

    --===== Conditionally drop the test table to make reruns in SSMS easier

    IF OBJECT_ID('TempDB.dbo.SomeTable','U') IS NOT NULL DROP TABLE TempDB.dbo.SomeTable;

    --===== Build the test table with some oddly duplicated names and some normal names

    SELECT SomeID = IDENTITY(INT,1,1),

    Name_1 =

    RTRIM(

    ISNULL(Title + ' ', '')

    + CASE

    WHEN ContactID % 13 = 0 --The "Bruce Bruce" syndrome

    THEN ISNULL(FirstName + ' ', '') + ISNULL(FirstName + ' ', '')

    WHEN ContactID % 11 = 0 --Partial dupe

    THEN ISNULL(FirstName + ' ', '') + ISNULL(FirstName + ' ', '')

    + ISNULL(LastName + ' ', '')

    WHEN ContactID % 7 = 0 --Reversed dupe

    THEN ISNULL(FirstName + ' ', '') + ISNULL(LastName + ' ', '')

    + ISNULL(LastName + ' ', '') + ISNULL(FirstName + ' ', '')

    WHEN ContactID % 5 = 0 --Simple dupe

    THEN ISNULL(FirstName + ' ', '') + ISNULL(FirstName + ' ', '')

    + ISNULL(LastName + ' ', '') + ISNULL(LastName + ' ', '')

    WHEN ContactID % 3 = 0 --Fully repeated dupe

    THEN ISNULL(FirstName + ' ', '') + ISNULL(MiddleName + ' ', '') + ISNULL(LastName + ' ', '')

    + ISNULL(FirstName + ' ', '') + ISNULL(MiddleName + ' ', '') + ISNULL(LastName + ' ', '')

    ELSE -- Normal

    ISNULL(FirstName + ' ', '') + ISNULL(MiddleName + ' ', '') + ISNULL(LastName + ' ', '')

    END

    + ISNULL(Suffix, '')

    )

    INTO dbo.SomeTable

    FROM AdventureWorks.Person.Contact

    ;

    --===== Create the expected PK on the test table

    ALTER TABLE dbo.SomeTable

    ADD PRIMARY KEY CLUSTERED (SomeID)

    ;

    Now for my first guess at how we can accomplish the problem. Like always, the details are in the comments in the code. This takes about 5 seconds on my 9 year old P4 including the time to print the result to the grid.

    ------------------------------------------------------------------------

    --===== Conditionally drop the work tables to make reruns in SSMS easier

    IF OBJECT_ID('TempDB..#MyHead','U') IS NOT NULL DROP TABLE #MyHead;

    IF OBJECT_ID('TempDB..#ForReview','U') IS NOT NULL DROP TABLE #ForReview;

    --===== First, we do pretty much what the name of the following cte is...

    -- We split the individual words out and tag the duplicate words with an Enum of the occurance

    -- of the word

    WITH

    cteSplitAndEnumerate AS

    (

    SELECT st.SomeID,

    split.ItemNumber,

    EnumOccurance = DENSE_RANK() OVER (PARTITION BY st.SomeID, split.Item ORDER BY split.ItemNumber),

    split.Item

    FROM dbo.SomeTable st

    CROSS APPLY dbo.DelimitedSplit8K(st.Name_1, ' ') split

    ) --=== This saves only the first occurance of each split item in a table for reconstitution later

    SELECT SomeID = ISNULL(SomeID,0), --ISNULL makes a NOT NULL column for #MyHead PK

    ItemNumber = ISNULL(ItemNumber,0), --ISNULL makes a NOT NULL column for #MyHead PK

    Item

    INTO #MyHead

    FROM cteSplitAndEnumerate

    WHERE EnumOccurance = 1

    ;

    --==== Add this PK shaves off about 30% on duration

    ALTER TABLE #MyHead

    ADD PRIMARY KEY CLUSTERED (SomeID, ItemNumber)

    ;

    --===== Again, we do just like the name of the following cte says... we reconstitute the data

    -- back into full names.

    WITH

    cteReconstitute AS

    (

    SELECT t1.SomeID,

    t1.Name_1,

    DeDupedName =

    STUFF( --===== This gets rid of the first blank

    ( --=== This reconstitutes several rows of data into just one.

    SELECT ' '+Item

    FROM #MyHead t2

    WHERE t1.SomeID = t2.SomeID

    ORDER BY ItemNumber

    FOR XML PATH(''),TYPE

    ).value('.','varchar(max)')

    ,1,1,'')

    FROM dbo.SomeTable t1

    ) --=== This saves only the items that will change in a separate table for review

    SELECT SomeID, Name_1, DeDupedName

    INTO #ForReview

    FROM cteReconstitute

    WHERE Name_1 <> DeDupedName

    ;

    --===== This just shows what we've got. Notice that all the "Bruce-Bruce" syndrome names are wrong.

    SELECT *

    FROM #ForReview

    ;

    As Wayne was kind enough to point out, you can get the splitter from the following URL...

    http://www.sqlservercentral.com/articles/Tally+Table/72993/

    I used the same column names as Wayne so you can use my test data for his stuff. You just need to change "@tmp" to "dbo.SomeTable" and you'll be ready to rock.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply