Set-Based Solution Possible?

  • Here is how I both created sample data and timetested...

    SET NOCOUNT ON

    -- Create sample data

    CREATE TABLETestData

    (

    StaffName VARCHAR(40),

    ReviewDate SMALLDATETIME

    )

    GO

    -- Populate the table with 512000 records

    INSERTTestData

    (

    StaffName,

    ReviewDate

    )

    SELECTsn.Number,

    36524 + ABS(CHECKSUM(NEWID())) % 7304

    FROMmaster..spt_values AS sn

    INNER JOINmaster..spt_values AS rd ON rd.Type = 'p'

    INNER JOINmaster..spt_values AS y ON y.Type = 'p'

    WHEREsn.Type = 'p'

    AND sn.Number < 100

    AND rd.Number < 256

    AND y.Number < 20

    GO

    CREATE CLUSTERED INDEX IX_TestData_StaffName_ReviewDate ON TestData (StaffName, ReviewDate)

    GO

    -- Create Sergiys tally table with 65536 rescords

    CREATE TABLETally

    (

    N INT PRIMARY KEY CLUSTERED

    )

    INSERTTally

    (

    N

    )

    SELECT DISTINCT256 * v1.Number + v2.Number

    FROMmaster..spt_values AS v1

    INNER JOINmaster..spt_values AS v2 ON v2.Type = 'p'

    WHEREv1.Type = 'p'

    AND v1.Number < 256

    AND v2.Number < 256

    GO

    -- Create Sergiys function

    CREATE FUNCTION dbo.tReviewsOfYear

    (

    @ReviewYear SMALLDATETIME

    )

    RETURNS @reviews TABLE (

    ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,

    StaffName VARCHAR(40),

    ReviewNo int NULL,

    ReviewDate SMALLDATETIME NOT NULL

    )

    AS

    BEGIN

    -- To make sure we deal with "year only" value

    SET @ReviewYear = DATEADD(YY, DATEDIFF(YY, 0, @ReviewYear), 0)

    DECLARE @StaffName VARCHAR(200)

    DECLARE @Count INT

    INSERT INTO @reviews (StaffName, ReviewDate)

    SELECT StaffName, ReviewDate

    FROM TestData

    WHERE ReviewDate >= @ReviewYear

    AND ReviewDate < DATEADD(YY, 1, @ReviewYear)

    ORDER BY StaffName, ReviewDate

    -- Sequential update driven by PRIMARY KEY

    UPDATE R

    SET @Count = ReviewNo = CASE WHEN @StaffName = StaffName THEN @Count + 1 ELSE 1 END,

    @StaffName = StaffName

    FROM @reviews R

    RETURN

    END

    GO

    -- Prepare timetesting

    DBCC FREEPROCCACHE

    DBCC DROPCLEANBUFFERS

    DECLARE @Time datetime

    SET @Time = GETDATE()

    -- Now we timetest Sergiys suggestion

    DECLARE @N int

    SELECT @N = MAX (CNT)

    FROM (select COUNT(*) CNT

    FROM TestData

    GROUP BY StaffName

    ) DT

    select CASE WHEN T.N = 1 THEN S.StaffName ELSE '' END,

    T.N, Y2006.ReviewDate [2006_Dates], Y2007.ReviewDate [2007_Dates], Y2008.ReviewDate [2008_Dates]

    from TestData S

    INNER JOIN Tally T ON T.N > 0 AND T.N < @N

    -- If you expect more than 5000 reviews per year per customer increase this number

    LEFT HASH JOIN dbo.tReviewsOfYear ('2006-11-01') Y2006 ON S.StaffName = Y2006.StaffName AND T.N = Y2006.ReviewNo

    LEFT HASH JOIN dbo.tReviewsOfYear ('2007-01-01') Y2007 ON S.StaffName = Y2007.StaffName AND T.N = Y2007.ReviewNo

    LEFT HASH JOIN dbo.tReviewsOfYear ('2008-01-01') Y2008 ON S.StaffName = Y2008.StaffName AND T.N = Y2008.ReviewNo

    WHERE Y2006.ReviewNo IS NOT NULL OR Y2007.ReviewNo IS NOT NULL OR Y2008.ReviewNo IS NOT NULL

    GROUP BY S.StaffName, T.N, Y2006.ReviewDate , Y2007.ReviewDate , Y2008.ReviewDate

    ORDER BY S.StaffName, T.N

    -- Now we display the time for Sergyis suggestion

    PRINT 'Sergyi time taken ' + CONVERT(varchar(20), DATEDIFF(MILLISECOND, @Time, GETDATE())) + ' milliseconds.'

    -- Prepare timetesting

    DBCC FREEPROCCACHE

    DBCC DROPCLEANBUFFERS

    -- Now we timetest Pesos suggestion

    SET @Time = GETDATE()

    DECLARE @BaseYear SMALLINT , @IncludeAllStaff BIT

    DECLARE @MinDate SMALLDATETIME,

    @MaxDate SMALLDATETIME

    IF @BaseYear IS NULL

    SET @BaseYear = DATEPART(YEAR, GETDATE())

    SELECT @MaxDate = DATEADD(YEAR, @BaseYear - 1899, '19000101'),

    @MinDate = DATEADD(YEAR, -3, @MaxDate)

    CREATE TABLE #Stage

    (

    RowID INT IDENTITY(1, 1) PRIMARY KEY CLUSTERED,

    StaffName VARCHAR(40) NOT NULL,

    ReviewDate VARCHAR(10) NOT NULL

    )

    INSERT #Stage

    (

    StaffName,

    ReviewDate

    )

    SELECT DISTINCT StaffName,

    CONVERT(CHAR(10), ReviewDate, 120)

    FROM TestData

    WHERE ReviewDate >= @MinDate

    AND ReviewDate < @MaxDate

    ORDER BY StaffName,

    CONVERT(CHAR(10), ReviewDate, 120) DESC

    IF @IncludeAllStaff = 1

    INSERT #Stage

    (

    StaffName,

    ReviewDate

    )

    SELECT DISTINCT t.StaffName,

    ''

    FROM tTestData AS t

    LEFT JOIN #Stage AS s ON s.StaffName = t.StaffName

    WHERE s.StaffName IS NULL

    SELECT CASE WHEN u.theIndex = 0 THEN u.StafName ELSE '' END AS StaffName,

    MAX(CASE WHEN u.theYear = @BaseYear - 2 THEN u.ReviewDate ELSE '' END) AS y2,

    MAX(CASE WHEN u.theYear = @BaseYear - 1 THEN u.ReviewDate ELSE '' END) AS y1,

    MAX(CASE WHEN u.theYear = @BaseYear THEN u.ReviewDate ELSE '' END) AS y0

    FROM (

    SELECT s.StaffName AS StafName,

    m.theYear,

    s.ReviewDate,

    s.RowID - m.minRowID AS theIndex

    FROM #Stage AS s

    INNER JOIN (

    SELECT StaffName,

    LEFT(ReviewDate, 4) AS theYear,

    MIN(RowID) AS minRowID

    FROM #Stage

    GROUP BY StaffName,

    LEFT(ReviewDate, 4)

    ) AS m ON m.StaffName = s.StaffName

    WHERE s.ReviewDate LIKE m.theYear + '%'

    ) AS u

    GROUP BY u.StafName,

    u.theIndex

    ORDER BY u.StafName,

    u.theIndex

    DROP TABLE #Stage

    -- Now we display the time for Pesos suggestion

    PRINT 'Peso time taken ' + CONVERT(varchar(20), DATEDIFF(MILLISECOND, @Time, GETDATE())) + ' milliseconds.'

    Well.. How can I write this in a nice way?

    I run my code 60 times and the average time on my computer is 2200 milliseconds (yes 2.2 seconds only!)

    The average reads were 115000.

    During this time, Sergiys suggestion is still running the first time. I promise, I will get back and post the time taken and average reads.


    N 56°04'39.16"
    E 12°55'05.25"

  • I have to break my promise.

    After over 30 minutes for first run of Sergiys suggestion I had to cancel the query.

    My profiler then showed my this result for Sergiy

    Duration - 1,874,907

    Reads - 10,773,782


    N 56°04'39.16"
    E 12°55'05.25"

  • I took the liberty to optimize Sergiys suggestion a little.

    Two things.

    SELECT @N = MAX (CNT)

    FROM (select COUNT(*) CNT

    FROM TestData

    GROUP BY StaffName , year(reviewdate)

    ) DT

    and

    INNER JOIN Tally T ON T.N > 0 AND T.N <= @N

    not much improvement. Only 10%.


    N 56°04'39.16"
    E 12°55'05.25"

  • I did some test with SET STATISTICS IO ON and got this result for original 15 records

    Sergiy

    Table 'TestData'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'TestData'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 2, logical reads 48, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Tally'. Scan count 2, logical reads 6, physical reads 1, read-ahead reads 7, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#336AA144'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#318258D2'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#2F9A1060'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Sergyi time taken 63 milliseconds.

    Peso

    Table '#Stage'. Scan count 0, logical reads 31, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'TestData'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#Stage'. Scan count 2, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Peso time taken 46 milliseconds.

    My statistics io result for 512000 records of sample data is

    Table 'TestData'. Scan count 3, logical reads 1543, physical reads 0, read-ahead reads 16, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#Stage'. Scan count 0, logical reads 111250, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#Stage'. Scan count 2, logical reads 424, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Peso time taken 2250 milliseconds.


    N 56°04'39.16"
    E 12°55'05.25"

  • And just for the fun of it (to prove Sergiy wrong about my original suggestion with the hidden triangular join which he didn't like and wasn't afraid to tell), I timetested that too.

    With 40 runs, the time taken averaged at 12.1 seconds and averaged at 820000 reads.

    Still hell a lot better than Sergiys suggestion with tally table, function and join hints!

    The statistics io result were for my original suggestion and 512000 sample records

    Table 'TestData'. Scan count 55103, logical reads 295419, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 131558, logical reads 520202, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


    N 56°04'39.16"
    E 12°55'05.25"

  • I tried to rerun Sergiys suggestion, and this time my server died after 2 hours and 7 minutes.

    There were no longer space to expand my tempdb!

    sp_spaceused testdata

    namerowsreserveddataindex_sizeunused

    TestData512000 11272 KB11160 KB80 KB32 KB

    When checking the size of TempDB i realized it was now 124 000 MB !

    For a table of 11 MB size.


    N 56°04'39.16"
    E 12°55'05.25"

  • I now come to the conclusion I must create less sample data, so I started with 4000 records

    INSERTTestData

    (

    StaffName,

    ReviewDate

    )

    SELECTsn.Number,

    36524 + ABS(CHECKSUM(NEWID())) % 7304

    FROMmaster..spt_values AS sn

    INNER JOINmaster..spt_values AS rd ON rd.Type = 'p'

    INNER JOINmaster..spt_values AS y ON y.Type = 'p'

    WHEREsn.Type = 'p'

    AND sn.Number < 20

    AND rd.Number < 20

    AND y.Number < 10

    Now Sergiys suggestion finalized with an average of 5.1 s and 15000 reads.

    My suggestion averaged at 0.2 s and 1600 reads.

    Then I went on to create 18000 sample records like thisINSERTTestData

    (

    StaffName,

    ReviewDate

    )

    SELECTsn.Number,

    36524 + ABS(CHECKSUM(NEWID())) % 7304

    FROMmaster..spt_values AS sn

    INNER JOINmaster..spt_values AS rd ON rd.Type = 'p'

    INNER JOINmaster..spt_values AS y ON y.Type = 'p'

    WHEREsn.Type = 'p'

    AND sn.Number < 30

    AND rd.Number < 30

    AND y.Number < 20

    Now Sergiys suggestion averaged with an average of 120 s and 143000 reads.

    My suggestion averaged at 0.3 seconds and 5600 reads.

    As you can see, Sergiys suggestion doesnt scale well. Far from.

    4.5 times the sample data (from 4000 to 18000) makes his suggestion need 9.5 number of reads and 24 times the time taken.

    And with 512000 (128 times more compared to 4000 records) sample records my suggestion did take 115000 reads and 2.2 seconds.

    That is 72 times more reads only and 88 times the time taken.

    And with 512000 (7 times more compared to 18000 records) sample records my suggestion did take 115000 reads and 2.2 seconds.

    That is 20 times more reads only and 7 times the time taken.

    So my suggestion scales very well! In practical my suggestion scales linear. Double the sample data, double the time!

    What if we could calculate the time taken for Sergiy with 512000 sample records?

    To do this we need a calculatable set of sample data. I created this with

    INSERTTestData

    (

    StaffName,

    ReviewDate

    )

    SELECTsn.Number,

    36524 + ABS(CHECKSUM(NEWID())) % 7304

    FROMmaster..spt_values AS sn

    INNER JOINmaster..spt_values AS rd ON rd.Type = 'p'

    INNER JOINmaster..spt_values AS y ON y.Type = 'p'

    WHEREsn.Type = 'p'

    AND sn.Number < 50

    AND rd.Number < 40

    AND y.Number < 20

    That is double the number of sample records (40000 records and that made tempdb expand to 2300 MB!). And as I suspected Sergiys suggestion now did take 308 s and used 400000 reads.

    That is 40000 sample records (2 times more compared to 18000 records).

    Which is 2.5 times more time taken and 3 times more used reads.

    For a sample set of 512000 records, Sergiys suggestion would take about 23 hours...

    Now I just wait to hear from Sergiy and he tells me I am using "a ridicolus large amount of sample data".

    Peso algorithm

    Sample recordsTime takenUsed reads

    ---------------------------------------

    4,000 0.2 s 1,600 reads

    18,000 0.3 s 5,600 reads

    512,000 2.2 s 115,000 reads

    1,600,000 5.6 s 235,000 reads(same time as Sergiy had, but for 4,000 records only)

    40,000,000 126.0 s3,266,000 reads(same time as Sergiy had, but for 18,000 records only)


    N 56°04'39.16"
    E 12°55'05.25"

  • Sergiy, I think you are reasonable skilled at many things for Microsoft SQL Server.

    But in this specific topic I think I have proved that this is not one of your strengths.


    N 56°04'39.16"
    E 12°55'05.25"

  • I think Peter has proven his case here..

    Thank you to everyone for suggestions; in the long run every one is valuable to improve my SQL skills.

    Cheers, Mike


    In the beginning the Universe was created. This has made a lot of people very angry and has been widely regarded as a bad move.   Douglas Adams (1952-2001)

Viewing 9 posts - 31 through 38 (of 38 total)

You must be logged in to reply to this topic. Login to reply