Aggregate Query question

  • ChrisM@Work (6/21/2012)


    David Webb-200187 (6/19/2012)


    Hmmmm....

    Let's say you had 50 rows, all with a quantity of 1. The target quantity was 5.

    How would you want that brought back? All the possible combinations of those rows that added to 5?

    Just the rows that could possibly be a component of the 5? You'd end up bringing them all back. What would that tell you? Would you want all the combinations grouped by some artificial designator (like a solution_number)?

    I agree with the idea that t-sql isn't the best vehicle for generating solution combinations like this. A number crunching package like SAS or SPSS would be a better bet ( or find someone who has an old APL compiler).

    1,125,899,906,842,624 distinct combinations in 50 rows

    I'm really curious why you calculated 2^50 for this.

  • jeffem (6/21/2012)


    ChrisM@Work (6/21/2012)


    David Webb-200187 (6/19/2012)


    Hmmmm....

    Let's say you had 50 rows, all with a quantity of 1. The target quantity was 5.

    How would you want that brought back? All the possible combinations of those rows that added to 5?

    Just the rows that could possibly be a component of the 5? You'd end up bringing them all back. What would that tell you? Would you want all the combinations grouped by some artificial designator (like a solution_number)?

    I agree with the idea that t-sql isn't the best vehicle for generating solution combinations like this. A number crunching package like SAS or SPSS would be a better bet ( or find someone who has an old APL compiler).

    1,125,899,906,842,624 distinct combinations in 50 rows

    I'm really curious why you calculated 2^50 for this.

    UniquePerms = POWER(2,n)-1 where n = rowcount.

    Easy - I determined it by testing the first four or five rows, then confirmed it by running Dwain's code and mine.

    Dwain's code, incidentally, is a flash of genius. It runs 2,500 times faster than mine, because it only works with the rows it has to.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (6/21/2012)


    jeffem (6/21/2012)


    ChrisM@Work (6/21/2012)


    David Webb-200187 (6/19/2012)


    Hmmmm....

    Let's say you had 50 rows, all with a quantity of 1. The target quantity was 5.

    How would you want that brought back? All the possible combinations of those rows that added to 5?

    Just the rows that could possibly be a component of the 5? You'd end up bringing them all back. What would that tell you? Would you want all the combinations grouped by some artificial designator (like a solution_number)?

    I agree with the idea that t-sql isn't the best vehicle for generating solution combinations like this. A number crunching package like SAS or SPSS would be a better bet ( or find someone who has an old APL compiler).

    1,125,899,906,842,624 distinct combinations in 50 rows

    I'm really curious why you calculated 2^50 for this.

    UniquePerms = POWER(2,n)-1 where n = rowcount.

    Easy - I determined it by testing the first four or five rows, then confirmed it by running Dwain's code and mine.

    Dwain's code, incidentally, is a flash of genius. It runs 2,500 times faster than mine, because it only works with the rows it has to.

    I will concede that I really couldn't follow your code, and that's not a knock against your code! 🙂 I just wasn't understanding your mindset with it.

    The thing is, that's not the formula for permutations or combinations. With the "50 rows, 1 distinct value, total of 5" scenario above, the formula is 50!/45! if order of results creates uniqueness (for example, "1 2 3" is distinct from "2 3 1"). If order of results does NOT create uniqueness, then you would divide the result by 5!.

    Your formula (except for the subtracting 1) is for when there are 2 possible outcomes/values, for 50 specific events/positions.

  • jeffem (6/21/2012)


    ... if order of results creates uniqueness (for example, "1 2 3" is distinct from "2 3 1"). If order of results does NOT create uniqueness, then you would divide the result by 5!.

    Your formula (except for the subtracting 1) is for when there are 2 possible outcomes/values, for 50 specific events/positions.

    Order of results is insignificant, it won't affect the sum of the quantity on each row.

    Consider this:

    1 row

    A

    2 rows POWER(2,2)-1

    A

    AB

    B

    3 rows POWER(2,3)-1

    A

    AB

    ABC

    AC

    B

    BC

    C

    4 rows POWER(2,4)-1

    A

    AB

    ABC

    ABCD

    ABD

    AC

    ACD

    AD

    B

    BC

    BCD

    BD

    C

    CD

    D

    If the expression holds true for 1, 2, 3 and 4 rows, is there any reason why it should be different for 50 rows?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (6/21/2012)


    jeffem (6/21/2012)


    ... if order of results creates uniqueness (for example, "1 2 3" is distinct from "2 3 1"). If order of results does NOT create uniqueness, then you would divide the result by 5!.

    Your formula (except for the subtracting 1) is for when there are 2 possible outcomes/values, for 50 specific events/positions.

    Order of results is insignificant, it won't affect the sum of the quantity on each row.

    Consider this:

    1 row

    A

    2 rows POWER(2,2)-1

    A

    AB

    B

    3 rows POWER(2,3)-1

    A

    AB

    ABC

    AC

    B

    BC

    C

    4 rows POWER(2,4)-1

    A

    AB

    ABC

    ABCD

    ABD

    AC

    ACD

    AD

    B

    BC

    BCD

    BD

    C

    CD

    D

    If the expression holds true for 1, 2, 3 and 4 rows, is there any reason why it should be different for 50 rows?

    Okay, now I see the issue. We were thinking of different scenarios.

    What I interpreted from the 50-row scenario was that there were 50 rows, each with a value of 1, and the requirement was a total of EXACTLY 5, not a total of 5 or less.

    Thanks for clarifying!

  • ChrisM@Work (6/21/2012)

    Dwain's code, incidentally, is a flash of genius. It runs 2,500 times faster than mine, because it only works with the rows it has to.

    Wow Chris! That's pretty high praise and I'm humbled by it. :blush: After I come back down to Earth, assuming I can get my head through the door, I'll print and frame this discussion, then confirm that wire transfer I promised.

    I truly hope that the OP returns to say whether or not this solution works for his case, including the running time (I figure that will be the biggest issue). When I wrote the article I mentioned, I was a bit concerned that it may be a solution seeking a problem, consequently being consigned as a cobweb-cloaked curiosity.

    I also want to dispel a possible misunderstanding propagated by me regarding the similarity of this to the Knapsack problem. The KP effectively operates on 3 constraints:

    1. The number of combinations of items is only bounded by the number of items.

    2. The weight of the items must be <= some value.

    3. The value of the items is unbounded and you seek the solution of maximum value.

    This problem also has 3 constraints, that are slightly different:

    1. The number of combinations of items is bounded

    2. The value (equivalent to weight in the KP) must be = some value.

    3. There is no equivalent to value for the KP, so that third constraint can be considered unbounded.

    For #1, this is the point I'm a bit unclear on from the discussion. If it is not true, then it is likely with 250 combinations the run time may exceed anything realistic. Hopefully this is not something the OP would be running in Production every night!


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (6/21/2012)


    ChrisM@Work (6/21/2012)

    Dwain's code, incidentally, is a flash of genius. It runs 2,500 times faster than mine, because it only works with the rows it has to.

    Wow Chris! That's pretty high praise and I'm humbled by it. :blush: After I come back down to Earth, assuming I can get my head through the door, I'll print and frame this discussion, then confirm that wire transfer I promised.

    Heh not so fast, dude...you get a 25% performance hike by improving the join in the recursive part

    JOIN @t t ON t.Prod < LEFT(n.Tuples,4)

    which you can only see through the noise by taking off the filter 😀

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (6/22/2012)


    dwain.c (6/21/2012)


    ChrisM@Work (6/21/2012)

    Dwain's code, incidentally, is a flash of genius. It runs 2,500 times faster than mine, because it only works with the rows it has to.

    Wow Chris! That's pretty high praise and I'm humbled by it. :blush: After I come back down to Earth, assuming I can get my head through the door, I'll print and frame this discussion, then confirm that wire transfer I promised.

    Heh not so fast, dude...you get a 25% performance hike by improving the join in the recursive part

    JOIN @t t ON t.Prod < LEFT(n.Tuples,4)

    which you can only see through the noise by taking off the filter 😀

    Sorry, I don't get it. Are you talking my query here or yours?


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (6/22/2012)


    ...

    Sorry, I don't get it. Are you talking my query here or yours?

    Yours:

    SELECT 1, Tuples = CAST(Prod AS VARCHAR(8000)), value

    FROM @t

    UNION ALL

    SELECT n.n+1, t.Prod + ',' + n.Tuples, n.value + t.value

    FROM UNIQUEnTuples n

    JOIN @t t ON t.Prod < LEFT(n.Tuples,4)

    WHERE t.value + n.value <= 10

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • I see you also removed CHARINDEX. However I can't verify your results. I get this running with n<=2 in the recursive leg:

    -- Original

    (115681 row(s) affected)

    SQL Server Execution Times:

    CPU time = 9750 ms, elapsed time = 12049 ms.

    -- Remove CHARINDEX and use LEFT

    (115681 row(s) affected)

    SQL Server Execution Times:

    CPU time = 26676 ms, elapsed time = 28815 ms.

    -- Remove CHARINDEX only

    (117455 row(s) affected)

    SQL Server Execution Times:

    CPU time = 9516 ms, elapsed time = 11449 ms.

    Removing only the CHARINDEX improves the speed but is affecting the results set's row count.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (6/22/2012)


    I see you also removed CHARINDEX. However I can't verify your results. I get this running with n<=2 in the recursive leg:

    -- Original

    (115681 row(s) affected)

    SQL Server Execution Times:

    CPU time = 9750 ms, elapsed time = 12049 ms.

    -- Remove CHARINDEX and use LEFT

    (115681 row(s) affected)

    SQL Server Execution Times:

    CPU time = 26676 ms, elapsed time = 28815 ms.

    -- Remove CHARINDEX only

    (117455 row(s) affected)

    SQL Server Execution Times:

    CPU time = 9516 ms, elapsed time = 11449 ms.

    Removing only the CHARINDEX improves the speed but is affecting the results set's row count.

    Removing only the CHARINDEX ruins the join - you have to use LEFT() as above.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Note that removing the check column as you have done does get you about a 10% speed improvement.

    If you were right though, does that mean my version is 3125x faster than your original? 🙂


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (6/22/2012)


    Note that removing the check column as you have done does get you about a 10% speed improvement.

    If you were right though, does that mean my version is 3125x faster than your original? 🙂

    Yes it does, it's amazingly fast!

    -----------------------------------------------------------------------------

    -- play with sample data sets

    -----------------------------------------------------------------------------

    DROP TABLE #Sample

    CREATE TABLE #Sample (Item INT, Qty INT)

    INSERT INTO #Sample (Item, Qty)

    SELECT Item = n, Qty = ABS(CHECKSUM(NEWID())) % 10 + 1

    FROM (SELECT TOP 50 n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM sys.all_columns) d

    DECLARE @t TABLE (Prod VARCHAR(4) PRIMARY KEY CLUSTERED, value INT)

    INSERT INTO @t

    SELECT TOP 50 RIGHT('000' + CAST(item AS VARCHAR), 4), qty

    FROM #Sample

    -----------------------------------------------------------------------------

    PRINT '===== Dwain''s Query 1'

    -----------------------------------------------------------------------------

    SET STATISTICS IO,TIME ON

    ;WITH UNIQUEnTuples (n, Tuples, value)

    AS (

    SELECT 1, Tuples = CAST(Prod AS VARCHAR(8000)), value

    FROM @t

    UNION ALL

    SELECT n.n+1, t.Prod + ',' + n.Tuples, n.value + t.value

    FROM UNIQUEnTuples n

    JOIN @t t ON t.Prod < LEFT(n.Tuples,4)

    WHERE t.value + n.value <= 10

    )

    SELECT *

    FROM UNIQUEnTuples

    WHERE value = 10

    ORDER BY n, Tuples

    PRINT '===== Dwain''s Query 1'

    SET STATISTICS IO,TIME OFF

    -----------------------------------------------------------------------------

    PRINT '===== Dwain''s Query 2'

    -----------------------------------------------------------------------------

    SET STATISTICS IO,TIME ON

    ;WITH UNIQUEnTuples (n, Tuples, value)

    AS (

    SELECT 1, Tuples = CAST(Prod AS VARCHAR(8000)), value

    FROM @t

    UNION ALL

    SELECT n.n+1, t.Prod + ',' + n.Tuples, n.value + t.value

    FROM @t t JOIN UNIQUEnTuples n ON t.Prod < n.Tuples

    WHERE CHARINDEX(t.Prod, n.Tuples) = 0

    AND t.value + n.value <= 10

    )

    SELECT *

    FROM UNIQUEnTuples

    WHERE value = 10

    ORDER BY n, Tuples

    PRINT '===== Dwain''s Query 2'

    SET STATISTICS IO,TIME OFF

    Same results, the first query is significantly faster than the second.

    I'm looking forward, like you, to using this elsewhere. There are very very few cases where a rCTE beats a non-recursive functional equivalent and this looks like one of them. Let's see if JC or some other "SQL Smartie" can come up with anything. Problem with JC is he's too busy plugging his books to spend time on a challenge like this.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • After multiple runs, I am starting to see an improvement using your modification over the original! Great work Chris!

    When I posted the article, Jeff took a shot at beating this query using a loop approach. He said that he was able to do much better on CPU and IOs but couldn't consistently beat the elapsed time.

    I'm looking forward to anybody improving this approach! As you said, I think it has a lot of applications in resource allocation problems. I heard earlier on this thread folks saying how this isn't a problem SQL is designed to handle. I'm hoping that with a team effort we may be able to find new applications that SQL can address and do it reasonably well.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (6/22/2012)


    After multiple runs, I am starting to see an improvement using your modification over the original! Great work Chris!

    When I posted the article, Jeff took a shot at beating this query using a loop approach. He said that he was able to do much better on CPU and IOs but couldn't consistently beat the elapsed time.

    I'm looking forward to anybody improving this approach! As you said, I think it has a lot of applications in resource allocation problems. I heard earlier on this thread folks saying how this isn't a problem SQL is designed to handle. I'm hoping that with a team effort we may be able to find new applications that SQL can address and do it reasonably well.

    After testing several permutations of the join, the fastest I can find is this:

    FROM @t t

    JOIN UNIQUEnTuples n

    ON t.Prod < LEFT(n.Tuples,4)

    WHERE t.value + n.value <= 10

    which, averaged over 20 runs, executes in about 66% of the time of the original join - and is always faster, on every run.

    "this isn't a problem SQL is designed to handle"

    I couldn't agree more. It translates better as "this isn't a problem I'm comfortable dealing with". Better tools enable us to tackle problems which would have been impossible or too tedious to contemplate in say SQL Server 2000.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing 15 posts - 31 through 45 (of 109 total)

You must be logged in to reply to this topic. Login to reply