Need help on performance and integrity

  • Hi Everyone,

    I have sample code below Item#1 which will create the sample data for testing and Item #2 which is the code to get the interleaving rows and range of unique data.

    Item #1: This will create a sample table and data

    use tempdb;

    SET NOCOUNT ON;

    GO

    IF OBJECT_ID ('#TempTable','U') IS NOT NULL

    BEGIN

    DROP TABLE #TempTable

    END

    CREATE TABLE #TempTable

    (

    RowID int PRIMARY KEY CLUSTERED,

    RowKey varchar(1000),

    Locale varchar(10),

    Code char(1)

    )

    GO

    INSERT #TempTable

    SELECT 1,'1','en','C' UNION ALL

    SELECT 2,'1','de','C' UNION ALL

    SELECT 3,'5','zh','A' UNION ALL --/first(1) batch of A

    SELECT 4,'4','en','A' UNION ALL --/first(1) batch of A

    SELECT 5,'3','en','A' UNION ALL --/first(1) batch of A

    SELECT 10,'2','zh','A' UNION ALL --/first(1) batch of A

    SELECT 11,'2','zh','A' UNION ALL --/first(1) batch of A

    SELECT 12,'1','zh','C' UNION ALL

    SELECT 13,'1','de','C' UNION ALL

    SELECT 14,'1','en','A' UNION ALL --/second(2) batch of A

    SELECT 15,'3','en','A' UNION ALL --/second(2) batch of A

    SELECT 16,'3','de','A' UNION ALL --/second(2) batch of A

    SELECT 18,'1','en','A' UNION ALL --/second(2) batch of A

    SELECT 19,'3','de','A' UNION ALL --/second(2) batch of A

    SELECT 20,'1','en','C' UNION ALL

    SELECT 21,'1','en','C'

    GO

    Item #2: This will get the interleaving rows of duplicates resulting to a unique range result.

    SELECT * FROM #TempTable

    DECLARE @RecordCount int;

    DECLARE @MaxRowID int; -- Stores the overall maximum row id at the beginning.

    -- Initialize the MaxROwID to the max row id number + 1

    -- So we can set this value as the last value in the range

    SET @MaxRowID = (SELECT MAX(RowID) FROM #TempTable) +1; -- Store this before we delete rows

    SET @RecordCount = 0;

    SELECT @RecordCount = COUNT(*) FROM #TempTable

    IF @RecordCount > 0

    BEGIN;

    DECLARE @Results TABLE(RowID int); -- Here we assemble the range numbers (result)

    DECLARE @RowID int; -- Stores the first rowid which is a duplicate

    DECLARE @OptRows int; -- variable which records the optimum number of rows to scan

    DECLARE @DeletedRowID int; -- Track the deleted rowids.

    SET @DeletedRowID = -1

    SET @OptRows = 100;

    WHILE 1=1

    BEGIN;

    SET @RowID = (

    SELECT MIN(RowID) FROM (

    SELECT RowID,RowKey,

    row_number() OVER (PARTITION BY RowKey,Locale ORDER BY RowID) AS DupRank

    FROM (SELECT TOP (@OptRows) *

    FROM #TempTable

    WHERE RowID > @DeletedRowID

    ORDER BY RowID ASC

    ) SubsetRows

    ) [Aggregate]

    WHERE DupRank > 1

    );

    RAISERROR('RowID: %d, DeletedRowID: %d, OptRows: %d', 10, 1, @RowID, @DeletedRowID, @OptRows);

    IF @RowID IS NULL BEGIN;

    IF @OptRows >= 2*@MaxRowID BEGIN;

    INSERT INTO @Results SELECT @MaxRowID;

    BREAK;

    END;

    ELSE

    BEGIN;

    SET @OptRows = @OptRows + @OptRows;

    CONTINUE;

    END;

    END;

    ELSE BEGIN;

    INSERT INTO @Results SELECT @RowID;

    SET @DeletedRowID = @RowID - 1;

    SET @OptRows = @OptRows - 1;

    END;

    END;

    END;

    SELECT * FROM @Results;

    GO

    My Objectives:

    1.) Using Item #2, I need to process data for every batch of A (see Item #1 table data per batch), the problem with Item #2 is it will process the whole data without looking on every A batch.

    2.) Aside from processing data for every batch of A, I need to make (Item #2) more faster on certain patterns of data for every batch of A. Since it is much slower when there are consecutive concentrated duplicates in the data like for example pattern (Item #3) below:

    Item#3:

    INSERT #TempTable

    SELECT 1,'1','en','C' UNION ALL

    SELECT 2,'1','de','C' UNION ALL

    SELECT 3,'1','en','A' UNION ALL --/first(1) batch of A

    SELECT 4,'1','en','A' UNION ALL --/first(1) batch of A

    SELECT 5,'1','en','A' UNION ALL --/first(1) batch of A

    SELECT 10,'1','en','A' UNION ALL --/first(1) batch of A

    SELECT 11,'1','en','A' UNION ALL --/first(1) batch of A

    SELECT 12,'1','zh','C' UNION ALL

    SELECT 13,'1','de','C' UNION ALL

    SELECT 14,'2','en','A' UNION ALL --/second(2) batch of A

    SELECT 15,'2','en','A' UNION ALL --/second(2) batch of A

    SELECT 16,'2','en','A' UNION ALL --/second(2) batch of A

    SELECT 18,'2','en','A' UNION ALL --/second(2) batch of A

    SELECT 19,'2','en','A' UNION ALL --/second(2) batch of A

    SELECT 20,'1','en','C' UNION ALL

    SELECT 21,'1','en','C'

    Low performance numbers will be seen if tested on 750,000 records or 1 Million records using the pattern above. It will actually take about 2 hours+ by actually processing only 200k using the pattern (Item#3) above but If there aren't that much consecutive duplicates like (Item #1), It would only take 28 minutes to process the 1 million data.

    Thank you for your time and help.

  • Thanks for posting the sample data and DML.

    I'm not clear on what you want the output to be. If you can post that maybe we can come up with a way to do this without having to do this in a loop and instead do it set based.

    Gary Johnson
    Sr Database Engineer

  • Gary Johnson (10/16/2008)


    Thanks for posting the sample data and DML.

    I'm not clear on what you want the output to be. If you can post that maybe we can come up with a way to do this without having to do this in a loop and instead do it set based.

    Thank you for your prompt response Gary. Sorry I forgot to show you the output for the following data sample.

    For this data sample, Item #1:

    INSERT #TempTable

    SELECT 1,'1','en','C' UNION ALL

    SELECT 2,'1','de','C' UNION ALL

    SELECT 3,'5','zh','A' UNION ALL --/first(1) batch of A

    SELECT 4,'4','en','A' UNION ALL --/first(1) batch of A

    SELECT 5,'3','en','A' UNION ALL --/first(1) batch of A

    SELECT 10,'2','zh','A' UNION ALL --/first(1) batch of A

    SELECT 11,'2','zh','A' UNION ALL --/first(1) batch of A <- dup on RowID 10 (checked)

    SELECT 12,'1','zh','C' UNION ALL

    SELECT 13,'1','de','C' UNION ALL

    SELECT 14,'1','en','A' UNION ALL --/second(2) batch of A

    SELECT 15,'3','en','A' UNION ALL --/second(2) batch of A

    SELECT 16,'3','de','A' UNION ALL --/second(2) batch of A

    SELECT 18,'1','en','A' UNION ALL --/second(2) batch of A <- dup on RowID 14 (checked)

    SELECT 19,'3','de','A' UNION ALL --/second(2) batch of A <- I don't need to include this even if there is dup on RowID 16 since it is already covered in the range from RowID 14-18

    SELECT 20,'1','en','C' UNION ALL

    SELECT 21,'1','en','C'

    Final Output on Item #1 should be:

    SELECT 3 as RowID, 'A' as Code UNION ALL

    SELECT 11 as RowID, 'A' as Code UNION ALL

    SELECT 12 as RowID, 'C' as Code UNION ALL

    SELECT 14 as RowID, 'A' as Code UNION ALL

    SELECT 18 as RowID, 'A' as Code UNION ALL

    SELECT 20 as RowID, 'C'

    To explain on how did I come up with the final output on Item #1.

    First, I need to get the interleaving rows per Code so it should give me the output of:

    SELECT 3 as RowID, 'A' as Code UNION ALL

    SELECT 12 as RowID, 'C' as Code UNION ALL

    SELECT 14 as RowID, 'A' as Code UNION ALL

    SELECT 20 as RowID, 'C'

    Next is I need to process the interleaving rows of duplicates per batch of A's that would result on a unique range result. So for the first batch of A, it should give me the output.

    SELECT 11 as RowID, 'A'

    and the second batch of A will have an output:

    SELECT 18 as RowID, 'A'

    So that is how I came up with the final output. I put some comments on Item #1 hoping it would help understand the process.

    Thank you for your time and help.

  • Hi,

    I also have an attachment here named demo to help understand on what I'm about to achieve or help better support my objectives.

    -dans

  • Great image. That goes a lot farther towards explaining what you are after. This seems like it'll be a very interesting query to write. I haven't the slightest idea of *why* you would want something like that, but at least now I think I grasp what you're looking for.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • OK. This will be attempt #1 at a set based solution. I'm not 100% sure at the speed of this, as the index does not fully cover the subquery in the second part. I'll need you to check that out and get back with us on it if possible, as it's an issue we're currently looking into. If it still runs fairly slow(in this case, > 5 minutes), I'm pretty sure we can speed it up.

    The logic of this is simple. I scan through your table and break things into groups by finding changes of Code.

    I then go back through, and analyze that group for any duplicates.

    I take those duplicates and the first member of each group and make it the output.

    If I'm wrong on any part of that logic, please correct me. (I may not answer right away, as I'll probably be out tonight, but someone else here can probably take this one from here, or I'll answer when I get back). This solution is based in part on the running total technique described here:

    http://www.sqlservercentral.com/articles/Advanced+Querying/61716/ (Yeah, I know, I kinda get fixated on certain methods and use them everywhere for a while, sorry Jeff 😉 )

    As part of this solution, I've added two fields to your table, GroupID and Dupe, both INT's.

    [font="Courier New"]------ Declare Variables ----------------------------------

    DECLARE @GroupID    INT,

    @PrevCode      CHAR(5),

    @RowID                  INT

    SET @RowID = 0

    SET @PrevCode = (SELECT Code FROM #TempTable WHERE RowID = 1)

    SET @GroupID = 1

    -----------------------------------------------------------

    -- Breaks your data out into groups so it can be scanned --

    UPDATE #TempTable

    SET @GROUPID = CASE WHEN Code <> @PrevCode THEN @GroupID + 1 ELSE @GroupID END,

       GroupID = @GroupID,

       @PrevCode = Code

    FROM #TempTable WITH (INDEX(0))

    -----------------------------------------------------------

    ------------- Goes through your data and flags Dupes ------

    UPDATE T1

    SET Dupe = CASE WHEN (SELECT COUNT(*) FROM #TempTable T2

                           WHERE   T2.GroupID = T1.GroupID AND

                                   T2.Locale = T1.Locale AND

                                   T2.RowKey = T1.RowKey AND

                                   T2.RowID < T1.RowID) > 0 THEN 1 ELSE 0 END,

       @RowID = RowID                    

    FROM #Temptable T1 WITH INDEX(0)

    -----------------------------------------------------------

    ---------- Selects The first member of each group (which is what I took as process 1) and

    ---------- any rows marked as duplicates and displays them

    SELECT T1.RowID, T1.RowKey, T1.Locale, T1.Code

    FROM #TempTable T1

           LEFT JOIN #TempTable T2 ON T1.RowID = T2.RowID AND T2.Dupe = 1

           LEFT JOIN (SELECT GroupID, MIN(RowID) FirstGroupRow FROM #TempTable WHERE GROUPID > 1 GROUP BY GroupID) T3 ON T1.RowID = T3.FirstGroupRow

    WHERE T2.Locale IS NOT NULL OR T3.FirstGroupRow IS NOT NULL

    [/font]

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • Here's a set based solution using CTEs. Please let us see the execution plans and performance statistics when you have tested them. Best of luck.

    Bob

    ------------------------------------------------

    -- CTE Solution

    ------------------------------------------------

    ;with cte1 as -- identify breaks

    (select code,rowid

    from #temptable t1

    where t1.code <> (select top 1 code

    from #temptable t2

    where t2.rowID < t1.rowID

    order by t2.rowID desc)

    )

    ,cte2 as -- identify ranges for 'A' codes only

    (select c1.code,c1.rowid as minRow,min(c1a.rowID) as maxRow

    from cte1 c1

    join cte1 c1a on c1a.code <> c1.code and c1a.rowID > c1.rowID

    where c1.code = 'A'

    group by c1.code,c1.rowID

    )

    ,cte3 as -- identify duplicates in 'A' ranges

    (select c.minrow,max(rowid) as dupID

    from #temptable t

    cross join cte2 c

    where t.rowid >=minRow and t.rowID < maxRow

    group by c.minRow,t.rowKey,t.locale,c.code

    having count(*) > 1

    )

    ,cte4 as -- just get the *first* duplicates

    (select min(dupID) as rowID

    from cte3

    group by minRow

    )

    ,cte5 as -- put the breaks and the first "A" dups together

    (select rowID from cte1

    union all

    select rowID from cte4)

    select t.*

    from cte5 c

    join #temptable t on t.rowID = c.rowID

    order by c.rowID

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Garadin (10/17/2008)


    OK. This will be attempt #1 at a set based solution. I'm not 100% sure at the speed of this, as the index does not fully cover the subquery in the second part. I'll need you to check that out and get back with us on it if possible, as it's an issue we're currently looking into. If it still runs fairly slow(in this case, > 5 minutes), I'm pretty sure we can speed it up....

    I can't seem to test your solution since it gave me this error...

    Msg 1018, Level 15, State 1, Line 28

    Incorrect syntax near 'INDEX'. If this is intended as a part of a table hint, A WITH keyword and parenthesis are now required. See SQL Server Books Online for proper syntax.

    Any idea? I'm sure i am missing something here..:(

  • bhovious (10/17/2008)


    Here's a set based solution using CTEs. Please let us see the execution plans and performance statistics when you have tested them. Best of luck.

    Bob

    I tested your solution for data integrity but it seems it is giving me the wrong output on some other patterns. It is correct on Item #1 data that I supplied but for some other patterns it isn't, there is probably a problem with the logic. The image that I attached on my previous reply would explain how to get the result, I also tried the sample data on that image presentation (attached) and the output is lacking using your suggested solution :(.

    Thank you so much for the help.

    Here are some other patterns that might help on testing data integrity, logic in getting the output is still the same as stated in the attached image on my previous post.

    Pattern #1

    INSERT #TempTable

    SELECT 1,'1','en','C' UNION ALL

    SELECT 2,'1','de','C' UNION ALL

    SELECT 3,'1','zh','A' UNION ALL

    SELECT 4,'2','en','A' UNION ALL

    SELECT 5,'2','en','A' UNION ALL

    SELECT 10,'1','zh','A' UNION ALL

    SELECT 11,'1','zh','A' UNION ALL

    SELECT 15,'1','zh','C' UNION ALL

    SELECT 16,'1','de','C' UNION ALL

    SELECT 20,'1','en','A' UNION ALL

    SELECT 21,'3','en','A' UNION ALL

    SELECT 22,'3','de','A' UNION ALL

    SELECT 23,'3','en','A' UNION ALL

    SELECT 24,'3','de','A' UNION ALL

    SELECT 30,'1','en','C' UNION ALL

    SELECT 31,'1','en','C'

    select * from #temptable

    --//Output

    SELECT 3 as RowID,'1' as RowKey,'zh' as Locale,'A' as [Action] UNION ALL

    SELECT 5,'2','en','A' UNION ALL

    SELECT 11,'1','zh','A' UNION ALL

    SELECT 15,'1','zh','C' UNION ALL

    SELECT 20,'1','en','A' UNION ALL

    SELECT 23,'3','en','A' UNION ALL

    SELECT 30,'1','en','C'

    Pattern #2

    INSERT #TempTable

    SELECT 1,'1','en','D' UNION ALL

    SELECT 2,'1','de','C' UNION ALL

    SELECT 3,'1','zh','A' UNION ALL

    SELECT 4,'1','zh','A' UNION ALL

    SELECT 5,'1','zh','A' UNION ALL

    SELECT 10,'1','zh','A' UNION ALL

    SELECT 11,'1','zh','C' UNION ALL

    SELECT 15,'1','zh','C' UNION ALL

    SELECT 16,'1','de','C' UNION ALL

    SELECT 20,'3','en','A' UNION ALL

    SELECT 21,'3','de','A' UNION ALL

    SELECT 22,'3','de','A' UNION ALL

    SELECT 23,'3','en','A' UNION ALL

    SELECT 24,'3','de','D' UNION ALL

    SELECT 30,'1','en','C' UNION ALL

    SELECT 31,'1','en','C'

    select * from #temptable

    --//Output

    SELECT 2 as RowID,'1' as RowKey,'de' as Locale,'C' as [Action] UNION ALL

    SELECT 3,'1','zh','A' UNION ALL

    SELECT 4,'1','zh','A' UNION ALL

    SELECT 5,'1','zh','A' UNION ALL

    SELECT 10,'1','zh','A' UNION ALL

    SELECT 11,'1','zh','C' UNION ALL

    SELECT 20,'3','en','A' UNION ALL

    SELECT 22,'3','de','A' UNION ALL

    SELECT 24,'3','de','D' UNION ALL

    SELECT 30,'1','en','C'

    Pattern #3

    INSERT #TempTable

    SELECT 1,'5','en','D' UNION ALL

    SELECT 2,'5','de','C' UNION ALL

    SELECT 3,'4','zh','C' UNION ALL

    SELECT 5,'5','zh','A' UNION ALL

    SELECT 6,'4','zh','A' UNION ALL

    SELECT 7,'3','zh','A' UNION ALL

    SELECT 11,'2','zh','A' UNION ALL

    SELECT 15,'1','zh','A' UNION ALL

    SELECT 16,'1','zh','A' UNION ALL

    SELECT 20,'2','en','C' UNION ALL

    SELECT 21,'3','de','C' UNION ALL

    SELECT 22,'4','de','A' UNION ALL

    SELECT 23,'2','en','A' UNION ALL

    SELECT 24,'4','de','A' UNION ALL

    SELECT 30,'1','en','C' UNION ALL

    SELECT 31,'1','en','C'

    select * from #temptable

    --//Output

    SELECT 2 as RowID,'5' as RowKey,'de' as Locale,'C' as [Action] UNION ALL

    SELECT 5,'5','zh','A' UNION ALL

    SELECT 16,'1','zh','A' UNION ALL

    SELECT 20,'2','en','C' UNION ALL

    SELECT 22,'4','de','A' UNION ALL

    SELECT 24,'4','de','A' UNION ALL

    SELECT 30,'1','en','C'

  • Using the data at the bottom of this message, the CTE based code produces the following:

    ------------------------------------------------------------------------------------------------

    3 5 zh A

    112zhA

    121zhC

    141enA

    181enA

    201enC

    ------------------------------------------------------------------------------------------------

    This corresponds to your expected output as specified in your second post of this thread. All you have to do is remove rowKey and locale from the final select statement.

    SELECT 3 as RowID, 'A' as Code UNION ALL

    SELECT 11 as RowID, 'A' as Code UNION ALL

    SELECT 12 as RowID, 'C' as Code UNION ALL

    SELECT 14 as RowID, 'A' as Code UNION ALL

    SELECT 18 as RowID, 'A' as Code UNION ALL

    SELECT 20 as RowID, 'C'

    I have tested it against one of the patterns you just sent and found a discrepancy there which can be fixed, but first let me ask: Will you ever have THREE matching rows in a series of A's like this:

    11 2 zh A

    12 2 zh A

    13 2 zh A

    If so, which rowID should be returned, 12 or 13 ??

    -----------------------------------------------------------------------

    rowID rowKey locale code

    11enC

    21deC

    35zhA

    44enA

    53enA

    102zhA

    112zhA

    121zhC

    131deC

    141enA

    153enA

    163deA

    181enA

    193deA

    201enC

    211enC

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • bhovious (10/17/2008)

    I have tested it against one of the patterns you just sent and found a discrepancy there which can be fixed, but first let me ask: Will you ever have THREE matching rows in a series of A's like this:

    11 2 zh A

    12 2 zh A

    13 2 zh A

    If so, which rowID should be returned, 12 or 13 ??

    Yes it is possible, so if you have that kind of combination inside A range then it should return both 12 & 13. It can even have more than that, its what I called a concentrated consecutive duplicates which causes my solution previously (Item #2 - original post) to slow down on 1 million or more data. So this may happen in a pattern:

    11 1 de A

    13 1 de A

    15 1 de A

    16 1 de A

    20 1 zh A

    21 1 de A

    22 1 zh A

    which will have an output of 13,15,16, and 21.

    thanks for your help.

  • Interesting... I swore it worked when I ran it at the office. I don't even remember adding the WITH INDEX clause to the second piece. In any case, it needs to be wrapped in an additional set of parens.

    WITH INDEX(0)

    Needs to be:

    WITH (INDEX(0))

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • Garadin (10/18/2008)


    Interesting... I swore it worked when I ran it at the office. I don't even remember adding the WITH INDEX clause to the second piece. In any case, it needs to be wrapped in an additional set of parens.

    WITH INDEX(0)

    Needs to be:

    WITH (INDEX(0))

    You're right Garadin, there was one line on your code which has:

    WITH INDEX(0)

    Needs to be:

    WITH (INDEX(0))

    I tested it for data integrity but it's not giving me the correct result, there some discrepancies.

    I have this example which has an explanation on the attachment (demo.gif) on

    how to get its final result.

    Here is the sample running code which I tested.

    use tempdb;

    SET NOCOUNT ON;

    GO

    IF OBJECT_ID ('#TempTable','U') IS NOT NULL

    BEGIN

    DROP TABLE #TempTable

    END

    CREATE TABLE #TempTable

    (

    RowID int PRIMARY KEY CLUSTERED,

    RowKey varchar(1000),

    Locale varchar(10),

    Code char(1),

    GroupID int,

    Dupe int

    )

    GO

    INSERT #TempTable(RowID,RowKey,Locale,Code)

    SELECT 1,'1','en','C' UNION ALL

    SELECT 2,'1','de','C' UNION ALL

    SELECT 3,'1','zh','A' UNION ALL

    SELECT 4,'2','en','A' UNION ALL

    SELECT 5,'2','en','A' UNION ALL

    SELECT 10,'1','zh','A' UNION ALL

    SELECT 11,'1','zh','A' UNION ALL

    SELECT 15,'1','zh','C' UNION ALL

    SELECT 16,'1','de','C' UNION ALL

    SELECT 20,'1','en','A' UNION ALL

    SELECT 21,'3','en','A' UNION ALL

    SELECT 22,'3','de','A' UNION ALL

    SELECT 23,'3','en','A' UNION ALL

    SELECT 24,'3','de','A' UNION ALL

    SELECT 30,'1','en','C' UNION ALL

    SELECT 31,'1','en','C'

    GO

    SELECT * FROM #TempTable

    ------ Declare Variables ----------------------------------

    DECLARE @GroupID INT,

    @PrevCode CHAR(5),

    @RowID INT

    SET @RowID = 0

    SET @PrevCode = (SELECT Code FROM #TempTable WHERE RowID = 1)

    SET @GroupID = 1

    -----------------------------------------------------------

    -- Breaks your data out into groups so it can be scanned --

    UPDATE #TempTable

    SET @GROUPID = CASE WHEN Code <> @PrevCode THEN @GroupID + 1 ELSE @GroupID END,

    GroupID = @GroupID,

    @PrevCode = Code

    FROM #TempTable WITH (INDEX(0))

    -----------------------------------------------------------

    ------------- Goes through your data and flags Dupes ------

    UPDATE T1

    SET Dupe = CASE WHEN (SELECT COUNT(*) FROM #TempTable T2

    WHERE T2.GroupID = T1.GroupID AND

    T2.Locale = T1.Locale AND

    T2.RowKey = T1.RowKey AND

    T2.RowID 0 THEN 1 ELSE 0 END,

    @RowID = RowID

    FROM #Temptable T1 WITH (INDEX(0))

    -----------------------------------------------------------

    ---------- Selects The first member of each group (which is what I took as process 1) and

    ---------- any rows marked as duplicates and displays them

    SELECT T1.RowID, T1.RowKey, T1.Locale, T1.Code

    FROM #TempTable T1

    LEFT JOIN #TempTable T2 ON T1.RowID = T2.RowID AND T2.Dupe = 1

    LEFT JOIN (SELECT GroupID, MIN(RowID) FirstGroupRow FROM #TempTable WHERE GROUPID > 1 GROUP BY GroupID) T3 ON T1.RowID = T3.FirstGroupRow

    WHERE T2.Locale IS NOT NULL OR T3.FirstGroupRow IS NOT NULL

    The idea in getting the result inside A range is, lets say I have this sample

    1 2 en A

    2 2 en A

    3 2 en A

    5 2 zh A

    6 3 de A

    10 3 zh A

    11 3 de A

    12 3 en A

    13 3 zh A

    It should have the final output:

    2 2 en A -- This is a dupe from (1 2 en A)

    3 2 en A -- This is a dupe from (2 2 en A)

    11 3 de A -- This is a dupe from (6 3 de A)

    Probably you might be wondering why (13 3 zh A) wasn't included in the final result.

    That is because there is no duplicate of (13 3 zh A), starting from the last dupe found which is (11 3 de A)

    to (13 3 zh A). It should always start looking for duplicates from the last duplicate found onwards.

    Hope this helps.

    Thank you for your help.

  • Ok, so a slight change in logic. How about this one?

    [font="Courier New"]

    ------ Declare Variables ----------------------------------

    DECLARE @GroupID    INT,

    @PrevCode          CHAR(5),

    @RowID             INT,

    @LastDupe          INT,

    @Dupe              TinyInt

    SET @PrevCode = (SELECT Code FROM #TempTable WHERE RowID = 1)

    SET @GroupID = 1

    SET @LastDupe = 1

    -----------------------------------------------------------

    -- Breaks your data out into groups so it can be scanned --

    UPDATE #TempTable

    SET @GROUPID = CASE WHEN Code <> @PrevCode THEN @GroupID + 1 ELSE @GroupID END,

       GroupID = @GroupID,

       @PrevCode = Code

    FROM #TempTable WITH (INDEX(0))

    -----------------------------------------------------------

    ------------- Goes through your data and flags Dupes ------

    UPDATE T1

    SET @Dupe = CASE WHEN (SELECT COUNT(*) FROM #TempTable T2

                           WHERE   T2.GroupID = T1.GroupID AND

                                   T2.Locale = T1.Locale AND

                                   T2.RowKey = T1.RowKey AND

                                   T2.RowID < T1.RowID AND

                                   T2.RowID > @LastDupe) > 0 THEN 1 ELSE 0 END,

       Dupe = @Dupe,

       @LastDupe = CASE WHEN @Dupe = 1 THEN RowID ELSE @LastDupe END,

       @RowID = RowID            

    FROM #Temptable T1 WITH (INDEX(0))

    -----------------------------------------------------------

    ---------- Selects The first member of each group (which is what I took as process 1) and

    ---------- any rows marked as duplicates and displays them

    SELECT T1.RowID, T1.RowKey, T1.Locale, T1.Code, T1.GroupID

    FROM #TempTable T1

           LEFT JOIN #TempTable T2 ON T1.RowID = T2.RowID AND T2.Dupe = 1

           LEFT JOIN (SELECT GroupID, MIN(RowID) FirstGroupRow FROM #TempTable WHERE GROUPID > 1 GROUP BY GroupID) T3 ON T1.RowID = T3.FirstGroupRow

    WHERE T2.Locale IS NOT NULL OR T3.FirstGroupRow IS NOT NULL[/font]

    Using that code gives me the following results on your test data:

    RESULTS

    -----------------

    35zhA2

    112zhA2

    121zhC3

    141enA4

    181enA4

    193deA4

    201enC5

    211enC5

    If these are incorrect, point out which ones should or should not be there / why.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • Garadin (10/20/2008)

    Using that code gives me the following results on your test data:

    RESULTS

    -----------------

    35zhA2

    112zhA2

    121zhC3

    141enA4

    181enA4

    193deA4

    201enC5

    211enC5

    If these are incorrect, point out which ones should or should not be there / why.

    I think you're using this data, the first post that I gave:

    INSERT #TempTable(RowID,RowKey,Locale,Code)

    SELECT 1,'1','en','C' UNION ALL

    SELECT 2,'1','de','C' UNION ALL

    SELECT 3,'5','zh','A' UNION ALL --/first(1) batch of A

    SELECT 4,'4','en','A' UNION ALL --/first(1) batch of A

    SELECT 5,'3','en','A' UNION ALL --/first(1) batch of A

    SELECT 10,'2','zh','A' UNION ALL --/first(1) batch of A

    SELECT 11,'2','zh','A' UNION ALL --/first(1) batch of A

    SELECT 12,'1','zh','C' UNION ALL

    SELECT 13,'1','de','C' UNION ALL

    SELECT 14,'1','en','A' UNION ALL --/second(2) batch of A

    SELECT 15,'3','en','A' UNION ALL --/second(2) batch of A

    SELECT 16,'3','de','A' UNION ALL --/second(2) batch of A

    SELECT 18,'1','en','A' UNION ALL --/second(2) batch of A

    SELECT 19,'3','de','A' UNION ALL --/second(2) batch of A

    SELECT 20,'1','en','C' UNION ALL

    SELECT 21,'1','en','C'

    The results are incorrect, it should be:

    CORRECT RESULTS

    ------------------

    35zhA2

    112zhA2

    121zhC3

    141enA4

    181enA4

    201enC5

    These two should not be included because...

    19 3deA4

    -- you should look for duplicates from the last duplicate found and in this case the last duplicate found before (193deA4) is (18

    1enA4)

    21 1enC5

    -- I just want to look at duplicates on range of A's and not C's as explained on the attached image and previous post.

Viewing 15 posts - 1 through 15 (of 115 total)

You must be logged in to reply to this topic. Login to reply