Need help on performance and integrity

Question

Need help on performance and integrity

Dan Segalles

Old Hand

Points: 392
More actions
October 16, 2008 at 8:37 am

#194589

Hi Everyone,
I have sample code below Item#1 which will create the sample data for testing and Item #2 which is the code to get the interleaving rows and range of unique data.
Item #1: This will create a sample table and data
use tempdb;
SET NOCOUNT ON;
GO
IF OBJECT_ID ('#TempTable','U') IS NOT NULL
BEGIN
DROP TABLE #TempTable
END
CREATE TABLE #TempTable
(
RowID int PRIMARY KEY CLUSTERED,
RowKey varchar(1000),
Locale varchar(10),
Code char(1)
)
GO
INSERT #TempTable
SELECT 1,'1','en','C' UNION ALL
SELECT 2,'1','de','C' UNION ALL
SELECT 3,'5','zh','A' UNION ALL --/first(1) batch of A
SELECT 4,'4','en','A' UNION ALL --/first(1) batch of A
SELECT 5,'3','en','A' UNION ALL --/first(1) batch of A
SELECT 10,'2','zh','A' UNION ALL --/first(1) batch of A
SELECT 11,'2','zh','A' UNION ALL --/first(1) batch of A
SELECT 12,'1','zh','C' UNION ALL
SELECT 13,'1','de','C' UNION ALL
SELECT 14,'1','en','A' UNION ALL --/second(2) batch of A
SELECT 15,'3','en','A' UNION ALL --/second(2) batch of A
SELECT 16,'3','de','A' UNION ALL --/second(2) batch of A
SELECT 18,'1','en','A' UNION ALL --/second(2) batch of A
SELECT 19,'3','de','A' UNION ALL --/second(2) batch of A
SELECT 20,'1','en','C' UNION ALL
SELECT 21,'1','en','C'
GO
Item #2: This will get the interleaving rows of duplicates resulting to a unique range result.
SELECT * FROM #TempTable
DECLARE @RecordCount int;
DECLARE @MaxRowID int; -- Stores the overall maximum row id at the beginning.
-- Initialize the MaxROwID to the max row id number + 1
-- So we can set this value as the last value in the range
SET @MaxRowID = (SELECT MAX(RowID) FROM #TempTable) +1; -- Store this before we delete rows
SET @RecordCount = 0;
SELECT @RecordCount = COUNT(*) FROM #TempTable
IF @RecordCount > 0
BEGIN;
DECLARE @Results TABLE(RowID int); -- Here we assemble the range numbers (result)
DECLARE @RowID int; -- Stores the first rowid which is a duplicate
DECLARE @OptRows int; -- variable which records the optimum number of rows to scan
DECLARE @DeletedRowID int; -- Track the deleted rowids.
SET @DeletedRowID = -1
SET @OptRows = 100;
WHILE 1=1
BEGIN;
SET @RowID = (
SELECT MIN(RowID) FROM (
SELECT RowID,RowKey,
row_number() OVER (PARTITION BY RowKey,Locale ORDER BY RowID) AS DupRank
FROM (SELECT TOP (@OptRows) *
FROM #TempTable
WHERE RowID > @DeletedRowID
ORDER BY RowID ASC
) SubsetRows
) [Aggregate]
WHERE DupRank > 1
);
RAISERROR('RowID: %d, DeletedRowID: %d, OptRows: %d', 10, 1, @RowID, @DeletedRowID, @OptRows);
IF @RowID IS NULL BEGIN;
IF @OptRows >= 2*@MaxRowID BEGIN;
INSERT INTO @Results SELECT @MaxRowID;
BREAK;
END;
ELSE
BEGIN;
SET @OptRows = @OptRows + @OptRows;
CONTINUE;
END;
END;
ELSE BEGIN;
INSERT INTO @Results SELECT @RowID;
SET @DeletedRowID = @RowID - 1;
SET @OptRows = @OptRows - 1;
END;
END;
END;
SELECT * FROM @Results;
GO
My Objectives:
1.) Using Item #2, I need to process data for every batch of A (see Item #1 table data per batch), the problem with Item #2 is it will process the whole data without looking on every A batch.
2.) Aside from processing data for every batch of A, I need to make (Item #2) more faster on certain patterns of data for every batch of A. Since it is much slower when there are consecutive concentrated duplicates in the data like for example pattern (Item #3) below:
Item#3:
INSERT #TempTable
SELECT 1,'1','en','C' UNION ALL
SELECT 2,'1','de','C' UNION ALL
SELECT 3,'1','en','A' UNION ALL --/first(1) batch of A
SELECT 4,'1','en','A' UNION ALL --/first(1) batch of A
SELECT 5,'1','en','A' UNION ALL --/first(1) batch of A
SELECT 10,'1','en','A' UNION ALL --/first(1) batch of A
SELECT 11,'1','en','A' UNION ALL --/first(1) batch of A
SELECT 12,'1','zh','C' UNION ALL
SELECT 13,'1','de','C' UNION ALL
SELECT 14,'2','en','A' UNION ALL --/second(2) batch of A
SELECT 15,'2','en','A' UNION ALL --/second(2) batch of A
SELECT 16,'2','en','A' UNION ALL --/second(2) batch of A
SELECT 18,'2','en','A' UNION ALL --/second(2) batch of A
SELECT 19,'2','en','A' UNION ALL --/second(2) batch of A
SELECT 20,'1','en','C' UNION ALL
SELECT 21,'1','en','C'
Low performance numbers will be seen if tested on 750,000 records or 1 Million records using the pattern above. It will actually take about 2 hours+ by actually processing only 200k using the pattern (Item#3) above but If there aren't that much consecutive duplicates like (Item #1), It would only take 28 minutes to process the 1 million data.
Thank you for your time and help.

Viewing 15 posts - 1 through 15 (of 115 total)

You must be logged in to reply to this topic. Login to reply

Gary Johnson-259336 SSCommitted Points: 1673 More actions · Answer 1

Thanks for posting the sample data and DML.

I'm not clear on what you want the output to be. If you can post that maybe we can come up with a way to do this without having to do this in a loop and instead do it set based.

Gary Johnson
Sr Database Engineer

Dan Segalles Old Hand Points: 392 More actions · Answer 2

Gary Johnson (10/16/2008)
Thanks for posting the sample data and DML.
I'm not clear on what you want the output to be. If you can post that maybe we can come up with a way to do this without having to do this in a loop and instead do it set based.

Thank you for your prompt response Gary. Sorry I forgot to show you the output for the following data sample.

For this data sample, Item #1:

INSERT #TempTable

SELECT 1,'1','en','C' UNION ALL

SELECT 2,'1','de','C' UNION ALL

SELECT 3,'5','zh','A' UNION ALL --/first(1) batch of A

SELECT 4,'4','en','A' UNION ALL --/first(1) batch of A

SELECT 5,'3','en','A' UNION ALL --/first(1) batch of A

SELECT 10,'2','zh','A' UNION ALL --/first(1) batch of A

SELECT 11,'2','zh','A' UNION ALL --/first(1) batch of A <- dup on RowID 10 (checked)

SELECT 12,'1','zh','C' UNION ALL

SELECT 13,'1','de','C' UNION ALL

SELECT 14,'1','en','A' UNION ALL --/second(2) batch of A

SELECT 15,'3','en','A' UNION ALL --/second(2) batch of A

SELECT 16,'3','de','A' UNION ALL --/second(2) batch of A

SELECT 18,'1','en','A' UNION ALL --/second(2) batch of A <- dup on RowID 14 (checked)

SELECT 19,'3','de','A' UNION ALL --/second(2) batch of A <- I don't need to include this even if there is dup on RowID 16 since it is already covered in the range from RowID 14-18

SELECT 20,'1','en','C' UNION ALL

SELECT 21,'1','en','C'

Final Output on Item #1 should be:

SELECT 3 as RowID, 'A' as Code UNION ALL

SELECT 11 as RowID, 'A' as Code UNION ALL

SELECT 12 as RowID, 'C' as Code UNION ALL

SELECT 14 as RowID, 'A' as Code UNION ALL

SELECT 18 as RowID, 'A' as Code UNION ALL

SELECT 20 as RowID, 'C'

To explain on how did I come up with the final output on Item #1.

First, I need to get the interleaving rows per Code so it should give me the output of:

SELECT 3 as RowID, 'A' as Code UNION ALL

SELECT 12 as RowID, 'C' as Code UNION ALL

SELECT 14 as RowID, 'A' as Code UNION ALL

SELECT 20 as RowID, 'C'

Next is I need to process the interleaving rows of duplicates per batch of A's that would result on a unique range result. So for the first batch of A, it should give me the output.

SELECT 11 as RowID, 'A'

and the second batch of A will have an output:

SELECT 18 as RowID, 'A'

So that is how I came up with the final output. I put some comments on Item #1 hoping it would help understand the process.

Thank you for your time and help.

Dan Segalles Old Hand Points: 392 More actions · Answer 3

Hi,

I also have an attachment here named demo to help understand on what I'm about to achieve or help better support my objectives.

-dans

Garadin One Orange Chip Points: 29613 More actions · Answer 4

Great image. That goes a lot farther towards explaining what you are after. This seems like it'll be a very interesting query to write. I haven't the slightest idea of *why* you would want something like that, but at least now I think I grasp what you're looking for.

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Garadin One Orange Chip Points: 29613 More actions · Answer 5

OK. This will be attempt #1 at a set based solution. I'm not 100% sure at the speed of this, as the index does not fully cover the subquery in the second part. I'll need you to check that out and get back with us on it if possible, as it's an issue we're currently looking into. If it still runs fairly slow(in this case, > 5 minutes), I'm pretty sure we can speed it up.

The logic of this is simple. I scan through your table and break things into groups by finding changes of Code.

I then go back through, and analyze that group for any duplicates.

I take those duplicates and the first member of each group and make it the output.

If I'm wrong on any part of that logic, please correct me. (I may not answer right away, as I'll probably be out tonight, but someone else here can probably take this one from here, or I'll answer when I get back). This solution is based in part on the running total technique described here:

http://www.sqlservercentral.com/articles/Advanced+Querying/61716/ (Yeah, I know, I kinda get fixated on certain methods and use them everywhere for a while, sorry Jeff 😉 )

As part of this solution, I've added two fields to your table, GroupID and Dupe, both INT's.

[font="Courier New"]------ Declare Variables ----------------------------------

DECLARE @GroupID INT,

@PrevCode CHAR(5),

@RowID INT

SET @RowID = 0

SET @PrevCode = (SELECT Code FROM #TempTable WHERE RowID = 1)

SET @GroupID = 1

-----------------------------------------------------------

-- Breaks your data out into groups so it can be scanned --

UPDATE #TempTable

SET @GROUPID = CASE WHEN Code <> @PrevCode THEN @GroupID + 1 ELSE @GroupID END,

GroupID = @GroupID,

@PrevCode = Code

FROM #TempTable WITH (INDEX(0))

-----------------------------------------------------------

------------- Goes through your data and flags Dupes ------

UPDATE T1

SET Dupe = CASE WHEN (SELECT COUNT(*) FROM #TempTable T2

WHERE T2.GroupID = T1.GroupID AND

T2.Locale = T1.Locale AND

T2.RowKey = T1.RowKey AND

T2.RowID < T1.RowID) > 0 THEN 1 ELSE 0 END,

@RowID = RowID

FROM #Temptable T1 WITH INDEX(0)

-----------------------------------------------------------

---------- Selects The first member of each group (which is what I took as process 1) and

---------- any rows marked as duplicates and displays them

SELECT T1.RowID, T1.RowKey, T1.Locale, T1.Code

FROM #TempTable T1

LEFT JOIN #TempTable T2 ON T1.RowID = T2.RowID AND T2.Dupe = 1

LEFT JOIN (SELECT GroupID, MIN(RowID) FirstGroupRow FROM #TempTable WHERE GROUPID > 1 GROUP BY GroupID) T3 ON T1.RowID = T3.FirstGroupRow

WHERE T2.Locale IS NOT NULL OR T3.FirstGroupRow IS NOT NULL

[/font]

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 6

Here's a set based solution using CTEs. Please let us see the execution plans and performance statistics when you have tested them. Best of luck.

Bob

------------------------------------------------

-- CTE Solution

------------------------------------------------

;with cte1 as -- identify breaks

(select code,rowid

from #temptable t1

where t1.code <> (select top 1 code

from #temptable t2

where t2.rowID < t1.rowID

order by t2.rowID desc)

)

,cte2 as -- identify ranges for 'A' codes only

(select c1.code,c1.rowid as minRow,min(c1a.rowID) as maxRow

from cte1 c1

join cte1 c1a on c1a.code <> c1.code and c1a.rowID > c1.rowID

where c1.code = 'A'

group by c1.code,c1.rowID

)

,cte3 as -- identify duplicates in 'A' ranges

(select c.minrow,max(rowid) as dupID

from #temptable t

cross join cte2 c

where t.rowid >=minRow and t.rowID < maxRow

group by c.minRow,t.rowKey,t.locale,c.code

having count(*) > 1

)

,cte4 as -- just get the *first* duplicates

(select min(dupID) as rowID

from cte3

group by minRow

)

,cte5 as -- put the breaks and the first "A" dups together

(select rowID from cte1

union all

select rowID from cte4)

select t.*

from cte5 c

join #temptable t on t.rowID = c.rowID

order by c.rowID

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Dan Segalles Old Hand Points: 392 More actions · Answer 7

Garadin (10/17/2008)
OK. This will be attempt #1 at a set based solution. I'm not 100% sure at the speed of this, as the index does not fully cover the subquery in the second part. I'll need you to check that out and get back with us on it if possible, as it's an issue we're currently looking into. If it still runs fairly slow(in this case, > 5 minutes), I'm pretty sure we can speed it up....

I can't seem to test your solution since it gave me this error...

Msg 1018, Level 15, State 1, Line 28

Incorrect syntax near 'INDEX'. If this is intended as a part of a table hint, A WITH keyword and parenthesis are now required. See SQL Server Books Online for proper syntax.

Any idea? I'm sure i am missing something here..:(

Dan Segalles Old Hand Points: 392 More actions · Answer 8

bhovious (10/17/2008)
Here's a set based solution using CTEs. Please let us see the execution plans and performance statistics when you have tested them. Best of luck.
Bob

I tested your solution for data integrity but it seems it is giving me the wrong output on some other patterns. It is correct on Item #1 data that I supplied but for some other patterns it isn't, there is probably a problem with the logic. The image that I attached on my previous reply would explain how to get the result, I also tried the sample data on that image presentation (attached) and the output is lacking using your suggested solution :(.

Thank you so much for the help.

Here are some other patterns that might help on testing data integrity, logic in getting the output is still the same as stated in the attached image on my previous post.

Pattern #1

INSERT #TempTable

SELECT 1,'1','en','C' UNION ALL

SELECT 2,'1','de','C' UNION ALL

SELECT 3,'1','zh','A' UNION ALL

SELECT 4,'2','en','A' UNION ALL

SELECT 5,'2','en','A' UNION ALL

SELECT 10,'1','zh','A' UNION ALL

SELECT 11,'1','zh','A' UNION ALL

SELECT 15,'1','zh','C' UNION ALL

SELECT 16,'1','de','C' UNION ALL

SELECT 20,'1','en','A' UNION ALL

SELECT 21,'3','en','A' UNION ALL

SELECT 22,'3','de','A' UNION ALL

SELECT 23,'3','en','A' UNION ALL

SELECT 24,'3','de','A' UNION ALL

SELECT 30,'1','en','C' UNION ALL

SELECT 31,'1','en','C'

select * from #temptable

--//Output

SELECT 3 as RowID,'1' as RowKey,'zh' as Locale,'A' as [Action] UNION ALL

SELECT 5,'2','en','A' UNION ALL

SELECT 11,'1','zh','A' UNION ALL

SELECT 15,'1','zh','C' UNION ALL

SELECT 20,'1','en','A' UNION ALL

SELECT 23,'3','en','A' UNION ALL

SELECT 30,'1','en','C'

Pattern #2

INSERT #TempTable

SELECT 1,'1','en','D' UNION ALL

SELECT 2,'1','de','C' UNION ALL

SELECT 3,'1','zh','A' UNION ALL

SELECT 4,'1','zh','A' UNION ALL

SELECT 5,'1','zh','A' UNION ALL

SELECT 10,'1','zh','A' UNION ALL

SELECT 11,'1','zh','C' UNION ALL

SELECT 15,'1','zh','C' UNION ALL

SELECT 16,'1','de','C' UNION ALL

SELECT 20,'3','en','A' UNION ALL

SELECT 21,'3','de','A' UNION ALL

SELECT 22,'3','de','A' UNION ALL

SELECT 23,'3','en','A' UNION ALL

SELECT 24,'3','de','D' UNION ALL

SELECT 30,'1','en','C' UNION ALL

SELECT 31,'1','en','C'

select * from #temptable

--//Output

SELECT 2 as RowID,'1' as RowKey,'de' as Locale,'C' as [Action] UNION ALL

SELECT 3,'1','zh','A' UNION ALL

SELECT 4,'1','zh','A' UNION ALL

SELECT 5,'1','zh','A' UNION ALL

SELECT 10,'1','zh','A' UNION ALL

SELECT 11,'1','zh','C' UNION ALL

SELECT 20,'3','en','A' UNION ALL

SELECT 22,'3','de','A' UNION ALL

SELECT 24,'3','de','D' UNION ALL

SELECT 30,'1','en','C'

Pattern #3

INSERT #TempTable

SELECT 1,'5','en','D' UNION ALL

SELECT 2,'5','de','C' UNION ALL

SELECT 3,'4','zh','C' UNION ALL

SELECT 5,'5','zh','A' UNION ALL

SELECT 6,'4','zh','A' UNION ALL

SELECT 7,'3','zh','A' UNION ALL

SELECT 11,'2','zh','A' UNION ALL

SELECT 15,'1','zh','A' UNION ALL

SELECT 16,'1','zh','A' UNION ALL

SELECT 20,'2','en','C' UNION ALL

SELECT 21,'3','de','C' UNION ALL

SELECT 22,'4','de','A' UNION ALL

SELECT 23,'2','en','A' UNION ALL

SELECT 24,'4','de','A' UNION ALL

SELECT 30,'1','en','C' UNION ALL

SELECT 31,'1','en','C'

select * from #temptable

--//Output

SELECT 2 as RowID,'5' as RowKey,'de' as Locale,'C' as [Action] UNION ALL

SELECT 5,'5','zh','A' UNION ALL

SELECT 16,'1','zh','A' UNION ALL

SELECT 20,'2','en','C' UNION ALL

SELECT 22,'4','de','A' UNION ALL

SELECT 24,'4','de','A' UNION ALL

SELECT 30,'1','en','C'

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 9

Using the data at the bottom of this message, the CTE based code produces the following:

------------------------------------------------------------------------------------------------

3 5 zh A

112zhA

121zhC

141enA

181enA

201enC

------------------------------------------------------------------------------------------------

This corresponds to your expected output as specified in your second post of this thread. All you have to do is remove rowKey and locale from the final select statement.

SELECT 3 as RowID, 'A' as Code UNION ALL

SELECT 11 as RowID, 'A' as Code UNION ALL

SELECT 12 as RowID, 'C' as Code UNION ALL

SELECT 14 as RowID, 'A' as Code UNION ALL

SELECT 18 as RowID, 'A' as Code UNION ALL

SELECT 20 as RowID, 'C'

I have tested it against one of the patterns you just sent and found a discrepancy there which can be fixed, but first let me ask: Will you ever have THREE matching rows in a series of A's like this:

11 2 zh A

12 2 zh A

13 2 zh A

If so, which rowID should be returned, 12 or 13 ??

-----------------------------------------------------------------------

rowID rowKey locale code

11enC

21deC

35zhA

44enA

53enA

102zhA

112zhA

121zhC

131deC

141enA

153enA

163deA

181enA

193deA

201enC

211enC

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Dan Segalles Old Hand Points: 392 More actions · Answer 10

bhovious (10/17/2008)
I have tested it against one of the patterns you just sent and found a discrepancy there which can be fixed, but first let me ask: Will you ever have THREE matching rows in a series of A's like this:
11 2 zh A
12 2 zh A
13 2 zh A
If so, which rowID should be returned, 12 or 13 ??

Yes it is possible, so if you have that kind of combination inside A range then it should return both 12 & 13. It can even have more than that, its what I called a concentrated consecutive duplicates which causes my solution previously (Item #2 - original post) to slow down on 1 million or more data. So this may happen in a pattern:

11 1 de A

13 1 de A

15 1 de A

16 1 de A

20 1 zh A

21 1 de A

22 1 zh A

which will have an output of 13,15,16, and 21.

thanks for your help.

Garadin One Orange Chip Points: 29613 More actions · Answer 11

Interesting... I swore it worked when I ran it at the office. I don't even remember adding the WITH INDEX clause to the second piece. In any case, it needs to be wrapped in an additional set of parens.

WITH INDEX(0)

Needs to be:

WITH (INDEX(0))

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Dan Segalles Old Hand Points: 392 More actions · Answer 12

Garadin (10/18/2008)
Interesting... I swore it worked when I ran it at the office. I don't even remember adding the WITH INDEX clause to the second piece. In any case, it needs to be wrapped in an additional set of parens.
WITH INDEX(0)
Needs to be:
WITH (INDEX(0))

You're right Garadin, there was one line on your code which has:

WITH INDEX(0)

Needs to be:

WITH (INDEX(0))

I tested it for data integrity but it's not giving me the correct result, there some discrepancies.

I have this example which has an explanation on the attachment (demo.gif) on

how to get its final result.

Here is the sample running code which I tested.

use tempdb;

SET NOCOUNT ON;

GO

IF OBJECT_ID ('#TempTable','U') IS NOT NULL

BEGIN

DROP TABLE #TempTable

END

CREATE TABLE #TempTable

(

RowID int PRIMARY KEY CLUSTERED,

RowKey varchar(1000),

Locale varchar(10),

Code char(1),

GroupID int,

Dupe int

)

GO

INSERT #TempTable(RowID,RowKey,Locale,Code)

SELECT 1,'1','en','C' UNION ALL

SELECT 2,'1','de','C' UNION ALL

SELECT 3,'1','zh','A' UNION ALL

SELECT 4,'2','en','A' UNION ALL

SELECT 5,'2','en','A' UNION ALL

SELECT 10,'1','zh','A' UNION ALL

SELECT 11,'1','zh','A' UNION ALL

SELECT 15,'1','zh','C' UNION ALL

SELECT 16,'1','de','C' UNION ALL

SELECT 20,'1','en','A' UNION ALL

SELECT 21,'3','en','A' UNION ALL

SELECT 22,'3','de','A' UNION ALL

SELECT 23,'3','en','A' UNION ALL

SELECT 24,'3','de','A' UNION ALL

SELECT 30,'1','en','C' UNION ALL

SELECT 31,'1','en','C'

GO

SELECT * FROM #TempTable

------ Declare Variables ----------------------------------

DECLARE @GroupID INT,

@PrevCode CHAR(5),

@RowID INT

SET @RowID = 0

SET @PrevCode = (SELECT Code FROM #TempTable WHERE RowID = 1)

SET @GroupID = 1

-----------------------------------------------------------

-- Breaks your data out into groups so it can be scanned --

UPDATE #TempTable

SET @GROUPID = CASE WHEN Code <> @PrevCode THEN @GroupID + 1 ELSE @GroupID END,

GroupID = @GroupID,

@PrevCode = Code

FROM #TempTable WITH (INDEX(0))

-----------------------------------------------------------

------------- Goes through your data and flags Dupes ------

UPDATE T1

SET Dupe = CASE WHEN (SELECT COUNT(*) FROM #TempTable T2

WHERE T2.GroupID = T1.GroupID AND

T2.Locale = T1.Locale AND

T2.RowKey = T1.RowKey AND

T2.RowID 0 THEN 1 ELSE 0 END,

@RowID = RowID

FROM #Temptable T1 WITH (INDEX(0))

-----------------------------------------------------------

---------- Selects The first member of each group (which is what I took as process 1) and

---------- any rows marked as duplicates and displays them

SELECT T1.RowID, T1.RowKey, T1.Locale, T1.Code

FROM #TempTable T1

LEFT JOIN #TempTable T2 ON T1.RowID = T2.RowID AND T2.Dupe = 1

LEFT JOIN (SELECT GroupID, MIN(RowID) FirstGroupRow FROM #TempTable WHERE GROUPID > 1 GROUP BY GroupID) T3 ON T1.RowID = T3.FirstGroupRow

WHERE T2.Locale IS NOT NULL OR T3.FirstGroupRow IS NOT NULL

The idea in getting the result inside A range is, lets say I have this sample

1 2 en A

2 2 en A

3 2 en A

5 2 zh A

6 3 de A

10 3 zh A

11 3 de A

12 3 en A

13 3 zh A

It should have the final output:

2 2 en A -- This is a dupe from (1 2 en A)

3 2 en A -- This is a dupe from (2 2 en A)

11 3 de A -- This is a dupe from (6 3 de A)

Probably you might be wondering why (13 3 zh A) wasn't included in the final result.

That is because there is no duplicate of (13 3 zh A), starting from the last dupe found which is (11 3 de A)

to (13 3 zh A). It should always start looking for duplicates from the last duplicate found onwards.

Hope this helps.

Thank you for your help.

Garadin One Orange Chip Points: 29613 More actions · Answer 13

Ok, so a slight change in logic. How about this one?

[font="Courier New"]

------ Declare Variables ----------------------------------

DECLARE @GroupID INT,

@PrevCode CHAR(5),

@RowID INT,

@LastDupe INT,

@Dupe TinyInt

SET @PrevCode = (SELECT Code FROM #TempTable WHERE RowID = 1)

SET @GroupID = 1

SET @LastDupe = 1

-----------------------------------------------------------

-- Breaks your data out into groups so it can be scanned --

UPDATE #TempTable

SET @GROUPID = CASE WHEN Code <> @PrevCode THEN @GroupID + 1 ELSE @GroupID END,

GroupID = @GroupID,

@PrevCode = Code

FROM #TempTable WITH (INDEX(0))

-----------------------------------------------------------

------------- Goes through your data and flags Dupes ------

UPDATE T1

SET @Dupe = CASE WHEN (SELECT COUNT(*) FROM #TempTable T2

WHERE T2.GroupID = T1.GroupID AND

T2.Locale = T1.Locale AND

T2.RowKey = T1.RowKey AND

T2.RowID < T1.RowID AND

T2.RowID > @LastDupe) > 0 THEN 1 ELSE 0 END,

Dupe = @Dupe,

@LastDupe = CASE WHEN @Dupe = 1 THEN RowID ELSE @LastDupe END,

@RowID = RowID

FROM #Temptable T1 WITH (INDEX(0))

-----------------------------------------------------------

---------- Selects The first member of each group (which is what I took as process 1) and

---------- any rows marked as duplicates and displays them

SELECT T1.RowID, T1.RowKey, T1.Locale, T1.Code, T1.GroupID

FROM #TempTable T1

LEFT JOIN #TempTable T2 ON T1.RowID = T2.RowID AND T2.Dupe = 1

LEFT JOIN (SELECT GroupID, MIN(RowID) FirstGroupRow FROM #TempTable WHERE GROUPID > 1 GROUP BY GroupID) T3 ON T1.RowID = T3.FirstGroupRow

WHERE T2.Locale IS NOT NULL OR T3.FirstGroupRow IS NOT NULL[/font]

Using that code gives me the following results on your test data:

RESULTS

-----------------

35zhA2

112zhA2

121zhC3

141enA4

181enA4

193deA4

201enC5

211enC5

If these are incorrect, point out which ones should or should not be there / why.

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Dan Segalles Old Hand Points: 392 More actions · Answer 14

Garadin (10/20/2008)
Using that code gives me the following results on your test data:
RESULTS
-----------------
35zhA2
112zhA2
121zhC3
141enA4
181enA4
193deA4
201enC5
211enC5
If these are incorrect, point out which ones should or should not be there / why.

I think you're using this data, the first post that I gave:

INSERT #TempTable(RowID,RowKey,Locale,Code)

SELECT 1,'1','en','C' UNION ALL

SELECT 2,'1','de','C' UNION ALL

SELECT 3,'5','zh','A' UNION ALL --/first(1) batch of A

SELECT 4,'4','en','A' UNION ALL --/first(1) batch of A

SELECT 5,'3','en','A' UNION ALL --/first(1) batch of A

SELECT 10,'2','zh','A' UNION ALL --/first(1) batch of A

SELECT 11,'2','zh','A' UNION ALL --/first(1) batch of A

SELECT 12,'1','zh','C' UNION ALL

SELECT 13,'1','de','C' UNION ALL

SELECT 14,'1','en','A' UNION ALL --/second(2) batch of A

SELECT 15,'3','en','A' UNION ALL --/second(2) batch of A

SELECT 16,'3','de','A' UNION ALL --/second(2) batch of A

SELECT 18,'1','en','A' UNION ALL --/second(2) batch of A

SELECT 19,'3','de','A' UNION ALL --/second(2) batch of A

SELECT 20,'1','en','C' UNION ALL

SELECT 21,'1','en','C'

The results are incorrect, it should be:

CORRECT RESULTS

------------------

35zhA2

112zhA2

121zhC3

141enA4

181enA4

201enC5

These two should not be included because...

19 3deA4

-- you should look for duplicates from the last duplicate found and in this case the last duplicate found before (193deA4) is (18

1enA4)

21 1enC5

-- I just want to look at duplicates on range of A's and not C's as explained on the attached image and previous post.