Parsing paired relationships.

  • Hmmm... OK, I can do this but only with some really expensive RBAR table operations, the script I've written takes 10 minutes to run (longer with cursors) and I can't help thinking I'm being a bit thick.

    So I have a simple paired list which contains relationships between two products, column A is paired with Column B.

    Column A ColumnB

    ProdA --> ProdB

    ProdB --> ProdA

    ProdB --> ProdC

    ProdC --> ProdB

    ProdD --> ProdE

    ProdE --> ProdD

    I don't know how many items are in a group (could be up to 20), and I also can't guarantee that every relationship is expressed explicitly which means chaining with self joins doesn't necessarily work, so for example in the above set of records there is no explicit relationship between ProdC and ProdA.

    Say what you like about the modelling of this, web programmers, don't get me started.

    Anyway, I have to group these records together - the above data creates two groups :

    Group 1 --> ProdA, ProdB, ProdC

    Group 2 --> ProdD, ProdE

    The only way I've got it working is:

    SELECT TOP 1 from column A and it's corresponding value from column B into variables

    SELECT columns A, B where variable A in column A or column B or variable B in column A or column B

    Dump results into a temp table

    Union both columns to create a single list, select distinct from there,

    Increment a counter, dump it into a grouping table with the counter value to create a group number,

    Delete everything from the product table where the value in A or B is in either column of the grouping table.

    Repeat until the product table is empty.

    But like I say, I think my solution is not at all smart.

    Any thoughts appreciated.

  • Recursion would usually be used for this Richard.

    Try the following:

    IF 0 = 1 BEGIN; -- Knock up some sample data

    WITH MyData (A, B) AS (

    SELECT 'ProdA', 'ProdB' UNION ALL

    SELECT 'ProdB', 'ProdA' UNION ALL

    SELECT 'ProdB', 'ProdC' UNION ALL

    SELECT 'ProdC', 'ProdB' UNION ALL

    SELECT 'ProdD', 'ProdE' UNION ALL

    SELECT 'ProdE', 'ProdD')

    SELECT * INTO #MyData FROM MyData; END;

    -- Get rid of dupe pairs

    WITH DedupedSet AS (

    SELECT DISTINCT x.*

    FROM #MyData

    CROSS APPLY (

    SELECT

    A = CASE WHEN A < B THEN A ELSE B END,

    B = CASE WHEN A < B THEN B ELSE A END

    ) x

    -- use recursion to resolve the hierarchy

    ), rCTE AS (

    SELECT d.A, d.B, Seq = CAST(A+'>'+B AS VARCHAR(50)), [Level] = 0

    FROM DedupedSet d

    WHERE NOT EXISTS (SELECT 1 FROM DedupedSet di WHERE di.B = d.A)

    UNION ALL

    SELECT nl.A, nl.B, Seq = CAST(ll.Seq +'>'+ nl.B AS VARCHAR(50)), [Level] = ll.[Level]+1

    FROM rCTE ll

    INNER JOIN DedupedSet nl

    ON nl.A = ll.B

    )

    SELECT Seq

    FROM rCTE ro

    WHERE NOT EXISTS (SELECT 1 FROM rCTE ri WHERE ri.[Level] > ro.[Level] AND ri.Seq LIKE ro.Seq+'%')

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Thanks for that Chris....

    Right, so now I understand how a recursive CTE works. Doh! There's a certain point when I'm reading into things that my brain just gives up and I just hack and slash until stuff works :-D.

    Logically this is more or less what I'm already doing, but surprisingly it actually takes longer than my existing query, I think because I'm identifying and hiving off smaller record sets and deleting from the source table then doing a select top 1 to prime the next pass, rather than using 2 Exists checks (I started with a similar construct to your "WHERE not EXISTS" and dropped it).

    So, technically superior but the idiot seems to have the edge on performance.....

  • You could accelerate it by saving off DedupedSet as a temp table and indexing it - a clustered index on column A, possibly.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Does this help?

    http://stackoverflow.com/questions/2076715/challenge-how-to-implement-an-algorithm-for-six-degree-of-separation

    It sounds like the same root problem. If you can solve it efficiently, I am sure that Facebook would love to hear from you 🙂

  • Just curious if this is any easier?

    WITH MyData (A, B) AS (

    SELECT 'ProdA', 'ProdB' UNION ALL

    SELECT 'ProdB', 'ProdA' UNION ALL

    SELECT 'ProdB', 'ProdC' UNION ALL

    SELECT 'ProdC', 'ProdB' UNION ALL

    SELECT 'ProdD', 'ProdE' UNION ALL

    SELECT 'ProdE', 'ProdD'

    )

    SELECT *

    INTO #MyData

    FROM MyData

    WHERE A <> B;

    CREATE TABLE #TEMP_DATA (

    RN int IDENTITY(1, 1) NOT NULL PRIMARY KEY CLUSTERED,

    A varchar(10),

    B varchar(10)

    );

    CREATE NONCLUSTERED INDEX IX_TEMP_DATA_A_B ON #TEMP_DATA (

    A ASC,

    B ASC

    );

    CREATE NONCLUSTERED INDEX IX_TEMP_DATA_B_A ON #TEMP_DATA (

    B ASC,

    A ASC

    );

    INSERT INTO #TEMP_DATA (A, B)

    SELECT DISTINCT

    CASE WHEN A < B THEN A ELSE B END AS A,

    CASE WHEN B > A THEN B ELSE A END AS B

    FROM #MyData

    ORDER BY A, B;

    /*

    SELECT *

    FROM #TEMP_DATA

    ORDER BY A, B;

    */

    WITH CONNECTED_SETS AS (

    SELECT T.A, T.B,

    CASE

    WHEN T.RN = 1 THEN 1

    WHEN LAG(T.B, 1, NULL) OVER(ORDER BY T.RN) <> T.A THEN 1

    ELSE 0

    END AS GROUP_START,

    T.RN

    FROM #TEMP_DATA AS T

    ),

    GROUPS AS (

    SELECT CS.A, CS.B, CS.GROUP_START, CS.RN,

    SUM(CS.GROUP_START) OVER(ORDER BY CS.RN ROWS UNBOUNDED PRECEDING) AS GROUP_NUMBER

    FROM CONNECTED_SETS AS CS

    )

    SELECT GRP.GROUP_NUMBER, X.VALUE

    FROM (

    SELECT G.GROUP_NUMBER

    FROM GROUPS AS G

    GROUP BY G.GROUP_NUMBER

    ) AS GRP

    CROSS APPLY (

    SELECT DISTINCT G2.A AS VALUE

    FROM GROUPS AS G2

    WHERE G2.GROUP_NUMBER = GRP.GROUP_NUMBER

    UNION

    SELECT DISTINCT G3.B AS VALUE

    FROM GROUPS AS G3

    WHERE G3.GROUP_NUMBER = GRP.GROUP_NUMBER

    ) AS X

    ORDER BY GRP.GROUP_NUMBER, X.VALUE;

    DROP TABLE #MyData;

    DROP TABLE #TEMP_DATA;

    See the attached sqlplan file for performance comparison purposes.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • richard.gardner 6009 (12/9/2016)


    Hmmm... OK, I can do this but only with some really expensive RBAR table operations, the script I've written takes 10 minutes to run (longer with cursors) and I can't help thinking I'm being a bit thick.

    So I have a simple paired list which contains relationships between two products, column A is paired with Column B.

    Column A ColumnB

    ProdA --> ProdB

    ProdB --> ProdA

    ProdB --> ProdC

    ProdC --> ProdB

    ProdD --> ProdE

    ProdE --> ProdD

    I don't know how many items are in a group (could be up to 20), and I also can't guarantee that every relationship is expressed explicitly which means chaining with self joins doesn't necessarily work, so for example in the above set of records there is no explicit relationship between ProdC and ProdA.

    Say what you like about the modelling of this, web programmers, don't get me started.

    Anyway, I have to group these records together - the above data creates two groups :

    Group 1 --> ProdA, ProdB, ProdC

    Group 2 --> ProdD, ProdE

    The only way I've got it working is:

    SELECT TOP 1 from column A and it's corresponding value from column B into variables

    SELECT columns A, B where variable A in column A or column B or variable B in column A or column B

    Dump results into a temp table

    Union both columns to create a single list, select distinct from there,

    Increment a counter, dump it into a grouping table with the counter value to create a group number,

    Delete everything from the product table where the value in A or B is in either column of the grouping table.

    Repeat until the product table is empty.

    But like I say, I think my solution is not at all smart.

    Any thoughts appreciated.

    How many rows in this table? Also, are products unique to their individual chain or can they appear in more than one chain?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hey, Jeff Moden, you've taught me a lot over the years (you might not believe it looking at my code :-))...

    OK, so we're only looking at about 89,000 records, so I'm a little confused as to why we can't improve on my bludgeoning, Steve's solution ran in similar time to mine, but it only produces pairs of records. Because there didn't seem to be a time advantage I didn't trouble shoot it further. Chris's solution does work, I can see it dumping records out, but I think there's something askance in the recursion which is hitting performance, I cut the query after 15 minutes.

    By the nature of things if there is a link between two numbers both of those numbers are in the same group, so all groups are by definition mutually exclusive.

    So here's the dirty version which works (it takes 10 - 12 minutes), I've commented my second pass, it was the only way I could figure out how to resolve incomplete links. Note I've moved through CTES --> Temp tables --> actual tables to try and squeeze some performance out of it, but I can't polish it, I'm just rolling it in glitter 😉

    DECLARE @a int

    DECLARE @b-2 int

    DECLARE @grouper INT

    SET @a = NULL

    SET @b-2 = NULL

    SET @grouper = 1

    TRUNCATE TABLE HoldLinks

    TRUNCATE TABLE Holding

    WHILE EXISTS

    (

    SELECT TOP 1 product_id as A, linked_product_id as B

    FROM [ProductLinks]

    )

    BEGIN

    --Prime the procedure with the first two records from the recordset

    SELECT TOP 1 product_id as A, linked_product_id as B

    INTO #Primer

    FROM [ProductLinks]

    SET @a = (SELECT A FROM #Primer)

    SET @b-2 = (SELECT B FROM #Primer)

    --Insert all relevant records into holding table

    INSERT INTO Holding

    SELECT product_id as A, linked_product_id as B

    FROM ProductLinks

    WHERE product_id IN (@A, @b-2) OR linked_product_id IN (@A, @b-2)

    ;

    --Do a second pass with all the records currently in Holding to resolve situations where links are not complete

    --i.e. Record A is connected to B and C, C is connected to D but D is not connected to A or B, therefore single pass will not pick up D

    WITH Uniques

    as

    (

    SELECT DISTINCT A FROM Holding

    UNION

    SELECT DISTINCT B FROM Holding

    )

    INSERT INTO Holding

    SELECT U.A, P.linked_product_id as B

    FROM Uniques U

    LEFT JOIN ProductLinks P ON U.A = P.product_id

    INSERT INTO HoldLinks

    SELECT DISTINCT A, @grouper as [SwatchGroup] FROM Holding

    UNION

    SELECT DISTINCT B, @grouper as [SwatchGroup] FROM Holding

    DELETE FROM [ProductLinks] WHERE product_id IN (SELECT A FROM HoldLinks)

    DROP TABLE #Primer

    TRUNCATE TABLE Holding

    SET @grouper = @grouper + 1

    END

  • Hey Richard, if you post up your rCTE version, I'll figure out an index or two for you. rCTE's can be blazingly fast if the indexing is correct.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • richard.gardner 6009 (12/21/2016)


    Hey, Jeff Moden, you've taught me a lot over the years (you might not believe it looking at my code :-))...

    Thank you for the kind feedback, Richard.

    OK, so we're only looking at about 89,000 records, so I'm a little confused as to why we can't improve on my bludgeoning, Steve's solution ran in similar time to mine, but it only produces pairs of records. Because there didn't seem to be a time advantage I didn't trouble shoot it further. Chris's solution does work, I can see it dumping records out, but I think there's something askance in the recursion which is hitting performance, I cut the query after 15 minutes.

    By the nature of things if there is a link between two numbers both of those numbers are in the same group, so all groups are by definition mutually exclusive.

    Excellent. That's the info I needed to setup for some tests... unless, with only 89K rows, you wanted to attach a ZIP file that I could import. Of course, no proprietary or PII info, please.

    So here's the dirty version which works (it takes 10 - 12 minutes), I've commented my second pass, it was the only way I could figure out how to resolve incomplete links. Note I've moved through CTES --> Temp tables --> actual tables to try and squeeze some performance out of it, but I can't polish it, I'm just rolling it in glitter 😉

    That will help as well... thanks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff,

    Here's a sample dataset, my bad, 89,000 is a different number, it's just 50,000 pairs of internal keys.

    @chris-2 - Thanks for the offer, appreciated, I'll have another look at it, I did run it a couple of times and the first time I got an error >100 levels of recursion, second time it ran until I killed it.

    Regards

    Richard

  • Hi Richard, how many rows are you expecting back from this query?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Well I can't be precise because I reloaded the source data this morning and I haven't run it today, but yesterday it was 15,760 with maximum group sizes of between 2 and 20 (this is with the data organised as below):

    ID Group

    3998 1

    1409121

    7972 2

    263742

    263772

    263802

    266863

    266903

    266943

    266983

    9769 3

    270033

    270113

  • I also started with a recursive CTE but it never came to an end. A cursor seems to be reasonably fast in this case. On my laptop your script runs in 4:30 min, the script below took 2 seconds with your test dataset. I almost never use cursors so I'm not sure if LOCAL FAST_FORWARD is the fastest possible cursor in this case.

    CREATE TABLE #ProductSets

    (

    set_id INT NOT NULL,

    product_id INT NOT NULL

    );

    CREATE INDEX IX_ProductSet_product_id ON #ProductSets(product_id);

    DECLARE PL CURSOR LOCAL FAST_FORWARD FOR

    SELECT DISTINCT

    IIF(product_id < linked_product_id, product_id, linked_product_id) AS product_id,

    IIF(product_id < linked_product_id, linked_product_id, product_id) AS linked_product_id

    FROM

    dbo.ProductLinks

    ORDER BY

    product_id, linked_product_id;

    DECLARE @product_id INT;

    DECLARE @linked_product_id INT;

    DECLARE @set_counter INT = 0;

    DECLARE @set_id INT;

    OPEN PL

    FETCH NEXT FROM PL

    INTO @product_id, @linked_product_id;

    WHILE @@FETCH_STATUS = 0

    BEGIN

    SET @set_id = NULL;

    SELECT @set_id = set_id FROM #ProductSets WHERE product_id = @product_id

    IF (@set_id IS NULL)

    BEGIN

    SET @set_counter += 1;

    SET @set_id = @set_counter;

    INSERT INTO

    #ProductSets(set_id, product_id)

    VALUES

    (@set_id, @product_id),

    (@set_id, @linked_product_id)

    END

    ELSE

    BEGIN

    INSERT INTO

    #ProductSets(set_id, product_id)

    VALUES

    (@set_id, @linked_product_id)

    END

    FETCH NEXT FROM PL

    INTO @product_id, @linked_product_id;

    END;

    CLOSE PL;

    DEALLOCATE PL;

    PRINT CAST(GETDATE() AS VARCHAR(100));

    SELECT DISTINCT

    set_id,

    STUFF((

    SELECT

    ', ' + CAST(Q.product_id AS VARCHAR(10))

    FROM

    (

    SELECT DISTINCT

    PS2.product_id

    FROM

    #ProductSets PS2

    WHERE

    PS2.set_id = PS1.set_id

    ) Q

    FOR XML PATH('')

    ), 1, 2, '') product_set

    FROM

    #ProductSets PS1

    ORDER BY

    set_id

    DROP TABLE #ProductSets;

  • The sample data set has a row where both columns have the same value which messes up rCTE's unless you remove it. Even then, some of the parent values have at least 21 values linked in the sequence, for instance

    179440>179479>179481>179488>179490>179492>179494>179496>179498>179500>179502>179504>179530>179556>179566>179570>179572>179576>179578>179580>179582>179784

    Using longhand code to get a decent handle on the shape of the data, I get over 50 million rows output at 11 levels of linkage...

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing 15 posts - 1 through 15 (of 25 total)

You must be logged in to reply to this topic. Login to reply