Simple query, iteratively - RBAR argh...

  • I have a test procedure that select rows based on uniqueness in a column. It is simple to process using a loop, recursive, etc. But, without using recursive table constructors, or a loop, is proving quite challenging.

    Here's some sample data:

    [font="Courier New"]create TABLE t_input(

    c1INT,

    c2 INT,

    c3 INT,

    sq INT IDENTITY (1,1) ) ;

    insert t_input ( c1, c2, c3) values

    ( 1, 2, 3 ),

    ( 4, 1, 6),

    ( 1, 7, 8),

    ( 9, 2, 10),

    (11,12, 6),

    (13,14, 15),

    ( 9,20, 21),

    (22,23, 3);[/font]

    Here's the trival proc:

    ...

    BEGIN

    DECLARE zzz INT;

    DECLARE cnt INT;

    select cnt=count(*) FROM t_input;

    set zzz := 1;

    create TABLE #t1(

    c1INT,

    c2 INT,

    c3 INT);

    CREATE UNIQUE INDEX ix1 on #t1 ( c1 ) ;

    CREATE UNIQUE INDEX ix2 on #t1 ( c2 ) ;

    CREATE UNIQUE INDEX ix3 on #t1 ( c3 ) ;

    while ( zzz <= cnt )

    BEGIN

    insert into #t1

    select z.*

    FROM t_input as z

    WHERE not exists ( SELECT * FROM #t1 WHERE #t1.c1 = z.c1 ) and

    not exists ( SELECT * FROM #t1 WHERE #t1.c2 = z.c2 ) and

    not exists ( SELECT * FROM #t1 WHERE #t1.c3 = z.c3 )

    and z.sq = zzz;

    set zzz := zzz + 1;

    end;

    select * FROM #t1;

    ...

    Expected output:

    [font="Courier New"]c1c2c3sq

    -----------------------------

    1231

    4162

    1314156

    920217

    [/font]

    It should: select row#1, as it is first, nothing to stop it.

    select #2, the "1" in the second column is unique to column2.

    discard #3, the "1" in c1 is a dupe of #1.

    discard #4, the "2" in c2 is a dupe of #1.

    discard #5, the "6" in c3 is a dupe of #2.

    select #6

    select #7, though the "9" in c1 is a dupe of #4, we discarded #4, so we dont consider it a dupe.

    discard #8

    I've been trying to come up with a single statement that answers this. Problem is, it must run on version back to 2K.

    Closest I've come is:

    select q.* FROM t_input as q

    where q.sq in (

    select sq FROM (

    select c1, MIN(sq) as sq from t_input as t1

    where not exists ( select sq FROM t_input as t2 WHERE t2.c2 = t1.c2 and t2.sq < t1.sq )

    group by c1

    union all

    select c2, MIN(sq) as sq from t_input as t1

    where not exists ( select sq FROM t_input as t2 WHERE t2.c3 = t1.c3 and t2.sq < t1.sq )

    group by c2

    union all

    select c3, MIN(sq) as sq from t_input as t1

    group by c3

    ) as z

    group by sq having COUNT(*)=3 );

    But that fails because you will be tempted to remedy this by putting in a set of “where not exists” clauses within each inner select, but this just does one more step – you will need to have “where not exists” clauses within those, ad infinitum.

    Honor Super Omnia-
    Jason Miller

  • This can be done as a set based operation. It is not trivial but this looks like a homework problem so I think you should try to work it out on your own.

    The probability of survival is inversely proportional to the angle of arrival.

  • A few things in this make me question the proc in general.

    Jason Miller-476791 (11/17/2010)


    CREATE UNIQUE INDEX ix1 on #t1 ( c1 ) ;

    CREATE UNIQUE INDEX ix2 on #t1 ( c2 ) ;

    CREATE UNIQUE INDEX ix3 on #t1 ( c3 ) ;

    I realize this is a temp table but every column being unique outside of combinations seems like the data itself is nonrelational. Thus, what's the point of the data?

    Moving on...

    insert into #t1

    select z.*

    FROM t_input as z

    WHERE not exists ( SELECT * FROM #t1 WHERE #t1.c1 = z.c1 ) and

    not exists ( SELECT * FROM #t1 WHERE #t1.c2 = z.c2 ) and

    not exists ( SELECT * FROM #t1 WHERE #t1.c3 = z.c3 )

    and z.sq = zzz;

    set zzz := zzz + 1;

    What business rule are you trying to handle here? This is quite odd.

    I've been trying to come up with a single statement that answers this. Problem is, it must run on version back to 2K.

    Which can be done. The only way I know to approach it is insanely complex, but it can be done.

    So, to bring us back to the beginning... what is the business purpose of this? It almost seems like the approach itself is poor, and should be rebuilt from the ground up.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Craig Farrell (11/17/2010)


    A few things in this make me question the proc in general.

    Jason Miller-476791 (11/17/2010)


    CREATE UNIQUE INDEX ix1 on #t1 ( c1 ) ;

    CREATE UNIQUE INDEX ix2 on #t1 ( c2 ) ;

    CREATE UNIQUE INDEX ix3 on #t1 ( c3 ) ;

    What business rule are you trying to handle here? This is quite odd.

    Basically the columns are order numbers that come in from different streams. This process removes dupes for a matching process.

    Craig Farrell (11/17/2010)It almost seems like the approach itself is poor, and should be rebuilt from the ground up.

    I tend to agree with you. Unfortunately, I don't have that luxury. I am given the set and told it takes too long (believe it or not, my solution there is much quicker).

    Honor Super Omnia-
    Jason Miller

  • Alright, another business logic question, then.

    The way you're approaching it, with this dataset:

    ( 1, 2, 3 ),

    ( 4, 1, 6),

    ( 1, 7, 8),

    ( 9, 2, 10),

    (11,12, 6),

    (13,14, 15),

    ( 9,20, 21),

    (22,23, 3);

    The '1' in Rows one and two repeat in different columns. This is acceptable I assume.

    However, the 7 in column two will never be reported, it's completely removed from the result set and will never appear. This is desired?

    I'm still curious why these are all on the same row if they've got no relation in the first place.

    Anyway, the method to deal with this in your way in a single rowset would be like the query that follows. Please note, this result is still a hair off, just presenting the core of the methodology. The problem is that in your data Row 1 excludes row 4 from the second column. However, this process doesn't realize that directly, so row 4 excludes row 8 (9,20,21).

    To deal with that, you'd have to include the logic of the process before it into the subselect under it. It won't be pretty, and if I find some more spare time I'll try to include that, but it'll get a bit wild. I want to make sure I have the rest of the assumptions accurate.

    SELECT

    t2.*

    FROM

    #t_input AS t2

    JOIN

    (SELECT

    i1,

    MIN( z) AS MinRow

    from

    #t_input AS t

    GROUP BY

    i1

    ) AS i1drv

    ONt2.i1 = i1drv.i1

    AND t2.z = i1drv.Minrow

    JOIN

    (SELECT

    i2,

    MIN( z) AS MinRow

    from

    #t_input AS t

    GROUP BY

    i2

    ) AS i2drv

    ONt2.i2 = i2drv.i2

    AND t2.z = i2drv.Minrow

    JOIN

    (SELECT

    i3,

    MIN( z) AS MinRow

    from

    #t_input AS t

    GROUP BY

    i3

    ) AS i3drv

    ONt2.i3 = i3drv.i3

    AND t2.z = i3drv.Minrow


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • >The '1' in Rows one and two repeat in different columns. This is acceptable I assume.

    Yes that is fine.

    >However, the 7 in column two will never be reported, it's completely removed from the result set and will never appear. This is desired?

    Yes, that is what I'm told.

    >I'm still curious why these are all on the same row if they've got no relation in the first place.

    There is a reason from the market streams, but I dont quite know that much of the puzzle.

    >Anyway, ...To deal with that, you'd have to include the logic of the process before it into the subselect under it.

    I think that'll fail because you will be tempted to remedy this by putting in sets where/join clauses within each inner select, but this just does one more step – you will need to have clauses within those, ad infinitum.

    (I have to get the most mileage out of that statement, I owe credit to the author jw)

    The process is trivial by eye, but this is happening on a larger scale... 10-20 million rows every hour. And this is just one small part of a larger process.

    Honor Super Omnia-
    Jason Miller

  • CELKO (11/17/2010)


    How about this version?

    SELECT * FROM ThreeStreams;

    1231

    4162

    NULL 783

    9NULL 104

    1112NULL 5

    1314156

    NULL 20217

    2223NULL 8

    I just nulled out the later dups in each column, based on the sequence number

    Close.

    The 7th row is excluded/nulled. It should not be. The "9" in the first column of row 7 is acceptable because we are not considering row 4 due to its exclusion on the duplication of "2" on column 2.

    Honor Super Omnia-
    Jason Miller

  • Jason Miller-476791 (11/17/2010)


    I think that'll fail because you will be tempted to remedy this by putting in sets where/join clauses within each inner select, but this just does one more step – you will need to have clauses within those, ad infinitum.

    (I have to get the most mileage out of that statement, I owe credit to the author jw)

    I believe you're correct. The requirements to this pretty much fall into the 1% of actual procedural loop work. Each next line must be computed based on the efforts of all the work done previously, with each next row possibly using a different dataset, which breaks the rules. The very definition of when you need iterative processing.

    Someone may have a tricksy way out of this, but I don't see it. If it wasn't for those cross column rule modifications I have a way to deal with this, but not bi directionally where colA affects ColB results and ColB affects ColA's results.

    Sorry. Good luck with this one.

    I have an idea with the serial update, but I can't remember the mechanics of a previous row's editing of a self join's effect, if the updates are re-read on a per serial component evaluation. It'd basically just RBAR for you, but you would have less control. In this case, I wouldn't suggest doing that, but it might be a fun intellectual exercise.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Not very efficient and SQL Server 2005/2008 only

    WITH CTE AS (

    SELECT c1,

    c2,

    c3,

    ROW_NUMBER() OVER(PARTITION BY c1 ORDER BY sq) AS c1_rn,

    ROW_NUMBER() OVER(PARTITION BY c2 ORDER BY sq) AS c2_rn,

    ROW_NUMBER() OVER(PARTITION BY c3 ORDER BY sq) AS c3_rn,

    sq

    FROM t_input)

    SELECT x.sq,

    x.c1,

    x.c2,

    x.c3

    FROM CTE x

    WHERE NOT EXISTS (SELECT * FROM CTE y1

    WHERE y1.sq<x.sq

    AND y1.c1_rn=1 AND y1.c2_rn=1 AND y1.c3_rn=1

    AND (y1.c1=x.c1 OR y1.c2=x.c2 OR y1.c3=x.c3))

    ORDER BY x.sq;

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply