Set-Based Solution Possible?

  • I used

    CREATE CLUSTERED INDEX IX_Yak ON TestData (StaffName, ReviewDate)

    Please reread my previous post with timings and execution plans involved.

    My suggestion only needs 32 reads whereas Sergiy needs 4385 reads for doing the same job (137 times as much).

    My suggestion averages at 2 ms whereas Sergiy averages at 395 ms (198 times as much).

    And you still have two more things to do with Sergiy's suggestion.

    1) Format the dates (could be done at crystal or reportings services report design)

    2) Decide what to put in the columns where NULL is present (could be done at crystal or reporting services report design)


    N 56°04'39.16"
    E 12°55'05.25"

  • Interesting.

    I am going to modify your suggestion as required for my needs and run it on production data against my modified Sergiy method and see what happens.

    Null fields and Date formatting I'm not concerned with handled in Crystal Report.

    The actual data returned is a concern though. When I run the code from your original post I get the following results:

    [font="Courier New"]Blow, Joe 2006-03-23 2007-03-13 2008-01-24

    Doe, Jane 2006-05-16 2007-03-12 2008-01-22

    Dunno, I 2006-12-10 2007-07-05

    2006-07-18 2007-06-28

    2006-11-19 2007-06-11

    2007-12-18

    2007-12-21

    2007-12-16 [/font]

    Which makes it appear as if Joe Blow and Jane Doe have only 3 reviews each, and the remainder were I Dunno's. I'll check your procedure to see if can modify accordingly though.


    In the beginning the Universe was created. This has made a lot of people very angry and has been widely regarded as a bad move.   Douglas Adams (1952-2001)

  • It appears I may need to modify my production code *sigh* thanks to Peter...

    After modifying his original post for SQL2000 method to sort properly it works great!

    Here's the modification:

    [font="Courier New"]Declare @Y0 int; Set @Y0 = Year(GetDate())

    Declare @Y1 int; Set @Y1 = @Y0-1

    Declare @Y2 int; Set @Y2 = @Y0-2

    Declare @MinYear varchar(12); Set @MinYear = Convert(varchar,@Y0-2)+'-01-01'

    Declare @MaxYear varchar(12); Set @MaxYear = Convert(varchar,@Y0+1)+'-01-01'

    Select Case When t3.theIndex=0 Then t3.StaffName Else '' End As StaffName,

    t3.[Prior_Year_2], t3.[Prior_Year_1], t3.[Current_Year]

    From (

    Select Top 100 Percent

    u.theIndex,

    u.StaffName,

    Max(Case When Year(u.ReviewDate)=@Y2 Then u.ReviewDate Else '' End) As [Prior_Year_2],

    Max(Case When Year(u.ReviewDate)=@Y1 Then u.ReviewDate Else '' End) As [Prior_Year_1],

    Max(Case When Year(u.ReviewDate)=@Y0 Then u.ReviewDate Else '' End) As [Current_Year]

    FROM (

    SELECT t1.StaffName,

    CONVERT(CHAR(10), t1.ReviewDate, 120) AS ReviewDate,

    ( SELECT COUNT(ReviewDate)

    FROM #TestData AS t2

    WHERE t2.StaffName = t1.StaffName

    AND YEAR(t2.ReviewDate) = YEAR(t1.ReviewDate)

    AND t2.ReviewDate < t1.ReviewDate

    AND t2.ReviewDate >= @MinYear

    AND t2.ReviewDate < @MaxYear) As theIndex

    FROM #TestData AS t1

    ) AS u

    Group By u.StaffName, u.theIndex

    Order By u.StaffName, u.theIndex

    ) As t3

    Go[/font]


    In the beginning the Universe was created. This has made a lot of people very angry and has been widely regarded as a bad move.   Douglas Adams (1952-2001)

  • Mike Nuessler (1/31/2008)


    Interesting.

    I am going to modify your suggestion as required for my needs and run it on production data against my modified Sergiy method and see what happens.

    When you tested, did you notice a significant difference between them?


    N 56°04'39.16"
    E 12°55'05.25"

  • Peter Larsson (1/31/2008)


    That TABLE SPOOL looks nasty to me. And 10 clustered index scans?

    I did not have TABLE SPOOL in my test runs.

    Are you sure you copied my code as it was?

    _____________
    Code for TallyGenerator

  • Peter Larsson (1/31/2008)


    When you tested, did you notice a significant difference between them?

    I just created a stored procedure for Peter's method and executed the two of them.

    Sergiy's proc ran first, and the Query cost relative to the batch was 97.3%

    Peter's proc's cost relative to the batch was obviously the other 2.7%

    With the ridiculously small amount of data this is being run on either would work for my purposes.

    (for that matter so does my original While..Continue temp table methods)

    However from what I'm gathering here from you guys who know more about this than me

    and the staggering differences in query costs (97% to 3%) it seems I should go with the

    one from Peter.

    That and there's no need for tally table or extra function, or the need to increase the Tally

    table join condition (which really would "never" happen in this particular environment but still)

    I see the table spool as well. Not a clue what that is though I need to do some reading..!


    In the beginning the Universe was created. This has made a lot of people very angry and has been widely regarded as a bad move.   Douglas Adams (1952-2001)

  • Mike Nuessler (1/31/2008)I see the table spool as well.

    Did you copy HASH JOIN?

    _____________
    Code for TallyGenerator

  • Sergiy (1/31/2008)


    Mike Nuessler (1/31/2008)I see the table spool as well.

    Did you copy HASH JOIN?

    I just tried it again. Copied straight from your post, pasted into QA, and ran it.

    Got one Table Spool. Out of curiousity I removed all "HASH" and ran it again and got 4 Table Spools. What it means - couldn't tell you.

    Either way I have gained a few tricks from everything posted and I think the end result was a bit from everyone. I still have to read up on Table Spools (links anyone?) which are apparently evil, and go over a bunch of older code to try out some of Carl's, Sergiy's, and Peter's methods.

    {edit - sorry, missed Gova - still like that little trick!}


    In the beginning the Universe was created. This has made a lot of people very angry and has been widely regarded as a bad move.   Douglas Adams (1952-2001)

  • Well - while the bases seem pretty well covered, so just remember this - costs aren't always what they're cracked up to be. They represent just one component for deciding what's "best". At least in my mind - that "cost" is usually an indicator of how much CPU effort it might take, although the reads tends to balance that out somewhat. Also - certain things (like DBCC system maintenance functions) are completely not counted in the costs, so they're not always telling the truth, or at least - not the whole truth.

    Another way to get a different read on that is to turn on IO statistics, which will show you metrics on the IO resources being used.

    finally - another side would be pure duration.

    depending on what your server is lacking most at the time, it may be better to go with a more "costly" query that executes faster, or one with somewhat higher IO. So - a low cost, high IO solution would stink if the drives are getting hammered by something else: a high-cost scenario might suck if the processor is busy all of the time, etc...

    In short - "best" is in the eye of the DBA. Your mileage may vary.

    That being said - both solutions you've been given look to be plenty viable...

    (you didn't expect this would an easy answer, did you?)

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Mike Nuessler (1/31/2008)With the ridiculously small amount of data this is being run on either would work for my purposes.

    (for that matter so does my original While..Continue temp table methods)

    However from what I'm gathering here from you guys who know more about this than me

    and the staggering differences in query costs (97% to 3%) it seems I should go with the

    one from Peter.

    Key word is ridiculously small amount of data.

    My script is designed to be scalable up to 5k entries per person per year.

    Performance for hundreds reviews will be about the same as for 5.

    Peter's query builds "tally table" on fly and limits it to the actual number of reviews for the Staff Member. So, it's better for "ridiculously small amount of data".

    But problem with this approach is it uses correlated subquery.

    They name it "hidden cursor".

    As for any cursor performance is quite OK on beginning stage, when there are up to hundreds of rows.

    But expense of such query increases exponentially with increasing number of rows.

    Fortunately for those contractors they are far away from the company by that time (problems appear usually after 2-3 years) and it's not their problem. Current DBA is the one to blame for not keeping performance of good working system on required level.

    Yes, I agree, for small amounts those 5000 spare lines look like some kind of wasting resources.

    But I use to have this Tally table pinned in memory, so no actual reads happened here.

    And if you keen to remove those extra lines you may use this:

    [Code]DECLARE @N int

    SELECT @N = MAX (CNT)

    FROM (select COUNT(*) CNT

    FROM dbo.tTestData

    GROUP BY StaffName

    ) DT

    select CASE WHEN T.N = 1 THEN S.StaffName ELSE '' END,

    T.N, Y2006.ReviewDate [2006_Dates], Y2007.ReviewDate [2007_Dates], Y2008.ReviewDate [2008_Dates]

    from dbo.tTestData S

    INNER JOIN dbo.Tally T ON T.N > 0 AND T.N < @N

    -- Now number of rows retrieved from Tally table is limited to the total number of reviews for every staff member

    LEFT HASH JOIN dbo.tReviewsOfYear ('2006-01-01') Y2006 ON S.StaffName = Y2006.StaffName AND T.N = Y2006.ReviewNo

    LEFT HASH JOIN dbo.tReviewsOfYear ('2007-01-01') Y2007 ON S.StaffName = Y2007.StaffName AND T.N = Y2007.ReviewNo

    LEFT HASH JOIN dbo.tReviewsOfYear ('2008-01-01') Y2008 ON S.StaffName = Y2008.StaffName AND T.N = Y2008.ReviewNo

    WHERE Y2006.ReviewNo IS NOT NULL OR Y2007.ReviewNo IS NOT NULL OR Y2008.ReviewNo IS NOT NULL

    GROUP BY S.StaffName, T.N, Y2006.ReviewDate , Y2007.ReviewDate , Y2008.ReviewDate

    ORDER BY S.StaffName, T.N

    [/Code]

    P.S. And I checked execution plan.

    When I tested my script it was limitation to 1000 entries, not 5000. I did not check it for 5k thoroughly, and it appears there are some hidden reasons why table spool happens for 5000 lines and does not happen for 1000.

    _____________
    Code for TallyGenerator

  • Here is another approach with no ridiculus tally table pinned into memory, where more vital information can be stored and used.

    With the original sample data, there is now only 67 reads. It is not as good as 32 reads but still much better than 4385 reads.

    My original solution has a hidden triangular join (which Sergiy writes), which makes the algorithm equal to n*n + n internal iterations.

    This algorithm has 3*n internal iterations. It's quite better for a large number of sample data.

    I tested with 365 records per Staffname per year, 100 StaffNames per year and 20 years in total.

    It works VERY nice!

    Sergiy's solution is somewhat better than my original. It has n * m internal iterations.

    m is maximum number of entries for a staff and n is number of staff. So both his suggestion and my original tends to be power of 2 internal iterations.

    But this is about linear, 3 * n.

    And it is about 5 * n if you want to include all staff even if they have no reviews in the wanted date range.

    CREATE PROCEDURE dbo.uspGetStatistics

    (

    @BaseYear SMALLINT = NULL,

    @IncludeAllStaff BIT = 0

    )

    AS

    SET NOCOUNT ON

    DECLARE@MinDate SMALLDATETIME,

    @MaxDate SMALLDATETIME

    IF @BaseYear IS NULL

    SET @BaseYear = DATEPART(YEAR, GETDATE())

    SELECT@MaxDate = DATEADD(YEAR, @BaseYear - 1899, '19000101'),

    @MinDate = DATEADD(YEAR, -3, @MaxDate)

    CREATE TABLE#Stage

    (

    RowID INT IDENTITY(1, 1) PRIMARY KEY CLUSTERED,

    StaffName VARCHAR(40) NOT NULL,

    ReviewDate VARCHAR(10) NOT NULL

    )

    INSERT#Stage

    (

    StaffName,

    ReviewDate

    )

    SELECT DISTINCTStaffName,

    CONVERT(CHAR(10), ReviewDate, 120)

    FROMTestData

    WHEREReviewDate >= @MinDate

    AND ReviewDate < @MaxDate

    ORDER BYStaffName,

    CONVERT(CHAR(10), ReviewDate, 120) DESC

    IF @IncludeAllStaff = 1

    INSERT#Stage

    (

    StaffName,

    ReviewDate

    )

    SELECT DISTINCTt.StaffName,

    ''

    FROMTestData AS t

    LEFT JOIN#Stage AS s ON s.StaffName = t.StaffName

    WHEREs.StaffName IS NULL

    SELECTCASE WHEN u.theIndex = 0 THEN u.StafName ELSE '' END AS StaffName,

    MAX(CASE WHEN u.theYear = @BaseYear - 2 THEN u.ReviewDate ELSE '' END) AS y2,

    MAX(CASE WHEN u.theYear = @BaseYear - 1 THEN u.ReviewDate ELSE '' END) AS y1,

    MAX(CASE WHEN u.theYear = @BaseYear THEN u.ReviewDate ELSE '' END) AS y0

    FROM(

    SELECTs.StaffName AS StafName,

    m.theYear,

    s.ReviewDate,

    s.RowID - m.minRowID AS theIndex

    FROM#Stage AS s

    INNER JOIN(

    SELECTStaffName,

    LEFT(ReviewDate, 4) AS theYear,

    MIN(RowID) AS minRowID

    FROM#Stage

    GROUP BYStaffName,

    LEFT(ReviewDate, 4)

    ) AS m ON m.StaffName = s.StaffName

    WHEREs.ReviewDate LIKE m.theYear + '%'

    ) AS u

    GROUP BYu.StafName,

    u.theIndex

    ORDER BYu.StafName,

    u.theIndex

    DROP TABLE#Stage


    N 56°04'39.16"
    E 12°55'05.25"

  • Peter, my solution has 3 internal iterations: one per year.

    It does not iterate neither over staff members, nor over reviews.

    It builds 3 temp tables for each year and joins them together.

    _____________
    Code for TallyGenerator

  • These are my favourite kind of discussion. I'd be happy to say that 99% of SQL Server users probably couldn't have come up with either approach. The use of a tally table opens all kinds of possibilities and it often pays to consider it to solve this sort of problem. Having said that the other solution presented also looks elegant. If I had the time I'd try to do some benchmarking and do a thorough comparison of the approaches - but I don't. I'll certainly keep an eye on the thread though! 😀

    It's unfortunate that you can't get your reporting tool to do some of the heavy lifting to lay out the data? Had you considered making the report side of things slightly more complex to allow for easy SQL? As it stands now though you'd be crazy to make the report more complex given the nice SQL output you're now getting, but for a more complex problem in future sometimes it's easier/faster to fiddle with the reporting tool rather than getting SQL to do some of the formatting.... my 2c

  • And Mike, don't trust much those "Query cost" numbers in QA.

    Every time verify it with simple but 100% reliable check:

    [Code]

    DECLARE @Time datetime

    SET @Time = GETDATE()

    Set of queries No.1 here

    PRINT CONVERT(varchar(20), GETDATE() - @Time, 114)

    SET @Time = GETDATE()

    Set of queries No.2 here

    PRINT CONVERT(varchar(20), GETDATE() - @Time, 114)

    [/Code]

    I ran my code in line with Peter's code.

    In Execution plan my query took 99.69% of total cost.

    Guess what time meter indicated?

    00:00:00:093 - Sergiy's

    vs.

    00:00:00:157 - Peter's

    Here is the code I executed (just in case you can see a mistake I made somewhere):

    [Code]

    DBCC FREEPROCCACHE

    DECLARE @Time datetime

    SET @Time = GETDATE()

    DECLARE @N int

    SELECT @N = MAX (CNT)

    FROM (select COUNT(*) CNT

    FROM dbo.tTestData

    GROUP BY StaffName

    ) DT

    select CASE WHEN T.N = 1 THEN S.StaffName ELSE '' END,

    T.N, Y2006.ReviewDate [2006_Dates], Y2007.ReviewDate [2007_Dates], Y2008.ReviewDate [2008_Dates]

    from dbo.tTestData S

    INNER JOIN dbo.Tally T ON T.N > 0 AND T.N < @N

    -- If you expect more than 5000 reviews per year per customer increase this number

    LEFT HASH JOIN dbo.tReviewsOfYear ('2006-11-01') Y2006 ON S.StaffName = Y2006.StaffName AND T.N = Y2006.ReviewNo

    LEFT HASH JOIN dbo.tReviewsOfYear ('2007-01-01') Y2007 ON S.StaffName = Y2007.StaffName AND T.N = Y2007.ReviewNo

    LEFT HASH JOIN dbo.tReviewsOfYear ('2008-01-01') Y2008 ON S.StaffName = Y2008.StaffName AND T.N = Y2008.ReviewNo

    WHERE Y2006.ReviewNo IS NOT NULL OR Y2007.ReviewNo IS NOT NULL OR Y2008.ReviewNo IS NOT NULL

    GROUP BY S.StaffName, T.N, Y2006.ReviewDate , Y2007.ReviewDate , Y2008.ReviewDate

    ORDER BY S.StaffName, T.N

    PRINT CONVERT(varchar(20), GETDATE() - @Time, 114)

    SET @Time = GETDATE()

    -- CREATE PROCEDURE dbo.uspGetStatistics

    -- (

    -- @BaseYear SMALLINT = NULL,

    -- @IncludeAllStaff BIT = 0

    -- )

    -- AS

    --

    -- SET NOCOUNT ON

    DECLARE @BaseYear SMALLINT , @IncludeAllStaff BIT

    DECLARE@MinDate SMALLDATETIME,

    @MaxDate SMALLDATETIME

    IF @BaseYear IS NULL

    SET @BaseYear = DATEPART(YEAR, GETDATE())

    SELECT@MaxDate = DATEADD(YEAR, @BaseYear - 1899, '19000101'),

    @MinDate = DATEADD(YEAR, -3, @MaxDate)

    CREATE TABLE#Stage

    (

    RowID INT IDENTITY(1, 1) PRIMARY KEY CLUSTERED,

    StaffName VARCHAR(40) NOT NULL,

    ReviewDate VARCHAR(10) NOT NULL

    )

    INSERT#Stage

    (

    StaffName,

    ReviewDate

    )

    SELECT DISTINCTStaffName,

    CONVERT(CHAR(10), ReviewDate, 120)

    FROMtTestData

    WHEREReviewDate >= @MinDate

    AND ReviewDate < @MaxDate

    ORDER BYStaffName,

    CONVERT(CHAR(10), ReviewDate, 120) DESC

    IF @IncludeAllStaff = 1

    INSERT#Stage

    (

    StaffName,

    ReviewDate

    )

    SELECT DISTINCTt.StaffName,

    ''

    FROMtTestData AS t

    LEFT JOIN#Stage AS s ON s.StaffName = t.StaffName

    WHEREs.StaffName IS NULL

    SELECTCASE WHEN u.theIndex = 0 THEN u.StafName ELSE '' END AS StaffName,

    MAX(CASE WHEN u.theYear = @BaseYear - 2 THEN u.ReviewDate ELSE '' END) AS y2,

    MAX(CASE WHEN u.theYear = @BaseYear - 1 THEN u.ReviewDate ELSE '' END) AS y1,

    MAX(CASE WHEN u.theYear = @BaseYear THEN u.ReviewDate ELSE '' END) AS y0

    FROM(

    SELECTs.StaffName AS StafName,

    m.theYear,

    s.ReviewDate,

    s.RowID - m.minRowID AS theIndex

    FROM#Stage AS s

    INNER JOIN(

    SELECTStaffName,

    LEFT(ReviewDate, 4) AS theYear,

    MIN(RowID) AS minRowID

    FROM#Stage

    GROUP BYStaffName,

    LEFT(ReviewDate, 4)

    ) AS m ON m.StaffName = s.StaffName

    WHEREs.ReviewDate LIKE m.theYear + '%'

    ) AS u

    GROUP BYu.StafName,

    u.theIndex

    ORDER BYu.StafName,

    u.theIndex

    DROP TABLE#Stage

    PRINT CONVERT(varchar(20), GETDATE() - @Time, 114)

    SET @Time = GETDATE()

    [/Code]

    _____________
    Code for TallyGenerator

  • Sergiy (1/31/2008)


    Peter, my solution has 3 internal iterations: one per year.

    It does not iterate neither over staff members, nor over reviews.

    It builds 3 temp tables for each year and joins them together.

    from dbo.tTestData SINNER JOIN dbo.Tally T ON T.N > 0 AND T.N < @N


    N 56°04'39.16"
    E 12°55'05.25"

Viewing 15 posts - 16 through 30 (of 38 total)

You must be logged in to reply to this topic. Login to reply