I would like to get the distinct value from specified column. I would appreciate if any one provide the answer of this

Question

Post reply

I would like to get the distinct value from specified column. I would appreciate if any one provide the answer of this

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 1

Please provide sample data (insert statements) for all your problem cases and the required output for all of them.

Ruban S SSC Veteran Points: 201 More actions · Answer 2

Here is the script

--INPUT

CREATE TABLE #TMP (NAME_1 VARCHAR(300))

INSERT INTO #TMP VALUES ('DR AARON SASAKI SASAKI AARON MD')

INSERT INTO #TMP VALUES ('DR ALBERT LUGO ALBERTO LUGO DMD PC')

--OUTPUT

I would like the output in following format

DR AARON SASAKI MD

DR ALBERT LUGO DMD PC

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 3

Those are all the possible cases of duplicates??? Are you sure you parsed all the data yet?

Ruban S SSC Veteran Points: 201 More actions · Answer 4

It's not yet all 55 million rows, it occurs near 100 K records.

I want to remove duplicates from this kind of data (100K) out of 55 millions.

Phil Parkin SSC Guru Points: 247183 More actions · Answer 5

"This kind of data" does not really cut it as far as a spec is concerned. Also, this is not strictly duplicate data.

So far, my spec would be along the lines of

"Break the field into space-delimited words and remove the second and subsequent occurrences of any repeated words"

Does that describe it?

Of course, names such as Bruce Bruce (any Iron Maiden fans out there?) Would be corrupted by this.

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 6

Phil Parkin (5/26/2011)
"This kind of data" does not really cut it as far as a spec is concerned. Also, this is not strictly duplicate data.
So far, my spec would be along the lines of
"Break the field into space-delimited words and remove the second and subsequent occurrences of any repeated words"
Does that describe it?
Of course, names such as Bruce Bruce (any Iron Maiden fans out there?) Would be corrupted by this.

Exactly my point.

Jeff Moden SSC Guru Points: 1004664 More actions · Answer 7

Ninja's_RGR'us (5/26/2011)
Please provide sample data (insert statements) for all your problem cases and the required output for all of them.

Heh... if there are thousands, wouldn't the script necessary to do that be the solution itself? 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Sean Lange SSC Guru Points: 286609 More actions · Answer 8

Sounds like what he wants is a way to find duplicate patterns of strings in a field and remove the second occurrence. Of course this would require parsing the field on each row into columns and keep the order so you can do about a zillion different combinations to see IF there are any matching patterns. Even doing it for each word would be painful and hideously slow with 100k records. With 55M it almost seems worthwhile to hire some temps and have them manually fix the records. Even IF this could be scripted, it would take forever to run and you would have to audit each record and manually review them to make sure the new values are correct. A replacement this "loose" will catch all sorts of outliers (Bruce Bruce for example) and these will have to be manually corrected again after the script incorrectly corrects them.

I don't envy you on this project as it is going to be painful.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

Jeff Moden SSC Guru Points: 1004664 More actions · Answer 9

Rani-434268 (5/26/2011)
I have table A_ having near 55 million rows. Name column having some duplicates, i would like to remove it.
Ex.
1. Dr Vasudev Narasima MD Vasudev Narasima MBBS
output
Dr Vasudev Narasima MD MBBS
2. Ruban s Soundrapandian Ruban Soundrpandian
output
Ruban s Soundrapandian
I would appreciate if any one provide the script for this.
Regards
Ruban

This sounds like fun. The script to do this (Phil described it best) actually isn't that complicated. Sean pointed out the tough problem... finding out if something does need to be fixed.

With that in mind, please describe what the PK on this table looks like because the method I'm thinking of is going to need to use the PK.

I also need to know... are there ANY triggers on the table?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

WayneS SSC Guru Points: 95461 More actions · Answer 10

Jeff Moden (5/27/2011)
Rani-434268 (5/26/2011)
I have table A_ having near 55 million rows. Name column having some duplicates, i would like to remove it.
Ex.
1. Dr Vasudev Narasima MD Vasudev Narasima MBBS
output
Dr Vasudev Narasima MD MBBS
2. Ruban s Soundrapandian Ruban Soundrpandian
output
Ruban s Soundrapandian
I would appreciate if any one provide the script for this.
Regards
Ruban
This sounds like fun. The script to do this (Phil described it best) actually isn't that complicated. Sean pointed out the tough problem... finding out if something does need to be fixed.
With that in mind, please describe what the PK on this table looks like because the method I'm thinking of is going to need to use the PK.
I also need to know... are there ANY triggers on the table?

Really? You're going that route? Hmm, time to get busy on this myself then - I have a different idea that still needs a little fleshing out.

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

Sean Lange SSC Guru Points: 286609 More actions · Answer 11

WayneS (5/27/2011)
Jeff Moden (5/27/2011)
Rani-434268 (5/26/2011)
I have table A_ having near 55 million rows. Name column having some duplicates, i would like to remove it.
Ex.
1. Dr Vasudev Narasima MD Vasudev Narasima MBBS
output
Dr Vasudev Narasima MD MBBS
2. Ruban s Soundrapandian Ruban Soundrpandian
output
Ruban s Soundrapandian
I would appreciate if any one provide the script for this.
Regards
Ruban
This sounds like fun. The script to do this (Phil described it best) actually isn't that complicated. Sean pointed out the tough problem... finding out if something does need to be fixed.
With that in mind, please describe what the PK on this table looks like because the method I'm thinking of is going to need to use the PK.
I also need to know... are there ANY triggers on the table?
Really? You're going that route? Hmm, time to get busy on this myself then - I have a different idea that still needs a little fleshing out.

ohhh this is starting to sound like a show down. This is going to be a good one to watch!!! Ruban sit back and watch the masters go at it. This has the markings of some pretty cool learning for the rest of us. 😉

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

WayneS SSC Guru Points: 95461 More actions · Answer 12

Sean Lange (5/27/2011)
ohhh this is starting to sound like a show down. This is going to be a good one to watch!!! Ruban sit back and watch the masters go at it. This has the markings of some pretty cool learning for the rest of us. 😉

:blush: Master? Not me... Jeff maybe.. but thanks for the vote of confidence!

Well, as is usually the case, the performance of Jeff's code will usually blow mine out of the park. But, I thoroughly agree with the learning part!

Anyway, here's my solution. It was only tested on the sample code (all 2 rows... I justdon't feel like generating the million rows of test data that the OP should be providing). Based on experience, it should perform pretty good (but like I said, Jeff has a way of doing the unexpected to totally amaze you), although it will hit the source table twice.

Like others have mentioned... it needs to be thoroughly tested, and the edge cases (Duran Duran, etc.) will get messed up.

As is becoming my norm... comments are in the code (wonder where I picked that up from???)

;

WITH cte AS

(

-- Add a row number for each row.

-- This step could be eliminated by adding an identity column

-- to the source table.

SELECT t.NAME_1,

RN = ROW_NUMBER() OVER (ORDER BY (SELECT 1))

FROM @tmp t

), cte2 AS

(

-- Break apart all of the names on the space.

-- The latest version of the Delimited Split Function is at http://www.sqlservercentral.com/articles/Tally+Table/72993/.

SELECT *

FROM cte t -- change to @tmp if identity column available

CROSS APPLY dbo.DelimitedSplit8K(t.name_1, ' ') ds

), cte3 AS

(

-- Group the results by the name,

-- Get the lowest item for each name part.

SELECT NAME_1,

RN,

Item,

ItemNumber = MIN(ItemNumber)

FROM cte2

GROUP BY NAME_1, RN, Item

)

-- Put the remaining results back together

SELECT t.NAME_1,

ca.ColList

FROM cte t -- change to @tmp if add identity column.

CROSS APPLY (SELECT ColList = STUFF((SELECT ' ' + t2.Item

FROM cte3 t2

WHERE t.RN = t2.RN -- change t.rn to @tmp.identity column if available

ORDER BY t2.RN, t2.ItemNumber

FOR XML PATH(''),TYPE).value('.','varchar(max)'),1,1,'')) ca

ORDER BY t.RN;

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

Jeff Moden SSC Guru Points: 1004664 More actions · Answer 13

Jeff Moden (5/27/2011)
Rani-434268 (5/26/2011)
I have table A_ having near 55 million rows. Name column having some duplicates, i would like to remove it.
Ex.
1. Dr Vasudev Narasima MD Vasudev Narasima MBBS
output
Dr Vasudev Narasima MD MBBS
2. Ruban s Soundrapandian Ruban Soundrpandian
output
Ruban s Soundrapandian
I would appreciate if any one provide the script for this.
Regards
Ruban
This sounds like fun. The script to do this (Phil described it best) actually isn't that complicated. Sean pointed out the tough problem... finding out if something does need to be fixed.
With that in mind, please describe what the PK on this table looks like because the method I'm thinking of is going to need to use the PK.
I also need to know... are there ANY triggers on the table?

Hey, Ruban... howz 'bout some answers, please? My questions are important to helping you.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004664 More actions · Answer 14

While we're waiting for Ruban to actually show up for his own post, I went ahead and made the assumption that the table with the "duped names" actually does have a clustered primary key.

I also just can't bring myself to test only on 2 rows. I didn't gen my normal million rows but the following will work as a small test.

--===== Identify a nice safe place to play in that everyone has

USE TempDB;

--===== Conditionally drop the test table to make reruns in SSMS easier

IF OBJECT_ID('TempDB.dbo.SomeTable','U') IS NOT NULL DROP TABLE TempDB.dbo.SomeTable;

--===== Build the test table with some oddly duplicated names and some normal names

SELECT SomeID = IDENTITY(INT,1,1),

Name_1 =

RTRIM(

ISNULL(Title + ' ', '')

+ CASE

WHEN ContactID % 13 = 0 --The "Bruce Bruce" syndrome

THEN ISNULL(FirstName + ' ', '') + ISNULL(FirstName + ' ', '')

WHEN ContactID % 11 = 0 --Partial dupe

THEN ISNULL(FirstName + ' ', '') + ISNULL(FirstName + ' ', '')

+ ISNULL(LastName + ' ', '')

WHEN ContactID % 7 = 0 --Reversed dupe

THEN ISNULL(FirstName + ' ', '') + ISNULL(LastName + ' ', '')

+ ISNULL(LastName + ' ', '') + ISNULL(FirstName + ' ', '')

WHEN ContactID % 5 = 0 --Simple dupe

THEN ISNULL(FirstName + ' ', '') + ISNULL(FirstName + ' ', '')

+ ISNULL(LastName + ' ', '') + ISNULL(LastName + ' ', '')

WHEN ContactID % 3 = 0 --Fully repeated dupe

THEN ISNULL(FirstName + ' ', '') + ISNULL(MiddleName + ' ', '') + ISNULL(LastName + ' ', '')

+ ISNULL(FirstName + ' ', '') + ISNULL(MiddleName + ' ', '') + ISNULL(LastName + ' ', '')

ELSE -- Normal

ISNULL(FirstName + ' ', '') + ISNULL(MiddleName + ' ', '') + ISNULL(LastName + ' ', '')

END

+ ISNULL(Suffix, '')

)

INTO dbo.SomeTable

FROM AdventureWorks.Person.Contact

;

--===== Create the expected PK on the test table

ALTER TABLE dbo.SomeTable

ADD PRIMARY KEY CLUSTERED (SomeID)

;

Now for my first guess at how we can accomplish the problem. Like always, the details are in the comments in the code. This takes about 5 seconds on my 9 year old P4 including the time to print the result to the grid.

------------------------------------------------------------------------

--===== Conditionally drop the work tables to make reruns in SSMS easier

IF OBJECT_ID('TempDB..#MyHead','U') IS NOT NULL DROP TABLE #MyHead;

IF OBJECT_ID('TempDB..#ForReview','U') IS NOT NULL DROP TABLE #ForReview;

--===== First, we do pretty much what the name of the following cte is...

-- We split the individual words out and tag the duplicate words with an Enum of the occurance

-- of the word

WITH

cteSplitAndEnumerate AS

(

SELECT st.SomeID,

split.ItemNumber,

EnumOccurance = DENSE_RANK() OVER (PARTITION BY st.SomeID, split.Item ORDER BY split.ItemNumber),

split.Item

FROM dbo.SomeTable st

CROSS APPLY dbo.DelimitedSplit8K(st.Name_1, ' ') split

) --=== This saves only the first occurance of each split item in a table for reconstitution later

SELECT SomeID = ISNULL(SomeID,0), --ISNULL makes a NOT NULL column for #MyHead PK

ItemNumber = ISNULL(ItemNumber,0), --ISNULL makes a NOT NULL column for #MyHead PK

Item

INTO #MyHead

FROM cteSplitAndEnumerate

WHERE EnumOccurance = 1

;

--==== Add this PK shaves off about 30% on duration

ALTER TABLE #MyHead

ADD PRIMARY KEY CLUSTERED (SomeID, ItemNumber)

;

--===== Again, we do just like the name of the following cte says... we reconstitute the data

-- back into full names.

WITH

cteReconstitute AS

(

SELECT t1.SomeID,

t1.Name_1,

DeDupedName =

STUFF( --===== This gets rid of the first blank

( --=== This reconstitutes several rows of data into just one.

SELECT ' '+Item

FROM #MyHead t2

WHERE t1.SomeID = t2.SomeID

ORDER BY ItemNumber

FOR XML PATH(''),TYPE

).value('.','varchar(max)')

,1,1,'')

FROM dbo.SomeTable t1

) --=== This saves only the items that will change in a separate table for review

SELECT SomeID, Name_1, DeDupedName

INTO #ForReview

FROM cteReconstitute

WHERE Name_1 <> DeDupedName

;

--===== This just shows what we've got. Notice that all the "Bruce-Bruce" syndrome names are wrong.

SELECT *

FROM #ForReview

;

As Wayne was kind enough to point out, you can get the splitter from the following URL...

http://www.sqlservercentral.com/articles/Tally+Table/72993/

I used the same column names as Wayne so you can use my test data for his stuff. You just need to change "@tmp" to "dbo.SomeTable" and you'll be ready to rock.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004664 More actions · Answer 15

As a bit of a sidebar, there IS a problem with CTE's which are really just "inline views". Sometimes there's a little "positive feedback" formed by CTE's and you end up with what I refer to as an "accidental Cross Join". That happens in Wayne's code and, if I convert mine to a single chain of CTE's, also happens in my code.

An easy solution to such a problem is to "Divide'n'Conquer" by using a Temp Table to hold interim results like I did in my code example.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)