Getting the closest date to a given date

  • Hi all,

    Got the following problem situation:

    CREATE TABLE #Test

    (

    ID INT,

    Date DATETIME

    )

    CREATE TABLE #Test2

    (

    ID INT,

    Date DATETIME,

    Val INT

    )

    INSERT INTO #Test (ID, Date) VALUES (1, '2012-09-01')

    INSERT INTO #Test (ID, Date) VALUES (1, '2012-09-02')

    INSERT INTO #Test (ID, Date) VALUES (1, '2012-09-04')

    INSERT INTO #Test (ID, Date) VALUES (2, '2012-09-05')

    INSERT INTO #Test (ID, Date) VALUES (2, '2012-09-08')

    INSERT INTO #Test (ID, Date) VALUES (3, '2012-09-15')

    INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-01', 2)

    INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-02', 2)

    INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-03', 2)

    INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-04', 3)

    INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-05', 3)

    INSERT INTO #Test2 (ID, Date, Val) VALUES (2, '2012-09-05', 2)

    INSERT INTO #Test2 (ID, Date, Val) VALUES (2, '2012-09-06', 4)

    INSERT INTO #Test2 (ID, Date, Val) VALUES (3, '2012-09-08', 2)

    INSERT INTO #Test2 (ID, Date, Val) VALUES (3, '2012-09-09', 5)

    INSERT INTO #Test2 (ID, Date, Val) VALUES (3, '2012-09-20', 5)

    Basically, I have one table that has a value and a date. For each record in that table, I want to find the value in the second table which corresponds to the date that is closest to the date from the first table.

    Here's the solution that I've employed:

    WITH cte AS

    (

    SELECT

    t.ID,

    t.Date,

    p.Val,

    ROW_NUMBER() OVER (PARTITION BY t.ID, t.Date ORDER BY ABS(DATEDIFF(dd, t.Date, p.Date))) AS rowNum

    FROM #Test t

    LEFT JOIN #Test2 p ON t.ID = p.ID

    )

    SELECT

    ID,

    Date,

    Val

    FROM cte

    WHERE rowNum = 1

    This worked fine when I had only a few hundred records in my first table. The problem is that when the number of records goes up, the amount of time increases very sharply. Right now, that cte is returning me about a million records, and is taking around 30 seconds to run. I need to bring it down to around 2-3 seconds ideally.

    Any suggestions? As it stands, there is an index on the second table, on the field ID, which includes Date and Val as included columns. On the first table, there's a clustered index on the ID field.

  • ... Right now, that cte is returning me about a million records, and is taking around 30 seconds to run. I need to bring it down to around 2-3 seconds ideally.

    ...

    You need to find where the most of time is spent. I would guess that is spend on creating product of JOIN between your two tables on ID's.

    Can you measure how long is takes just to join your two tables?

    SELECT t1.*, t2.*

    FROM T1 JOIN T2 ON T1.ID=T2.ID

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • It's certainly that part which is taking the most time. I just don't know how else to go about it other than using the approach I've listed.

  • You can do it with a subquery but I tried that and it's slower. Your query is pretty good.

    But there is another approach if you're running more than one processor and all you are concerned with is elapsed time. Try this:

    SELECT ID, Date, Val

    FROM #Test a

    CROSS APPLY (

    SELECT TOP 1 Val

    FROM #Test2 b

    WHERE a.ID = b.ID

    ORDER BY ABS(DATEDIFF(day, a.Date, b.Date))) b

    I put it into a test harness with ~3,000,000 rows in #Test and ~5,000,000 rows in #Test2:

    CREATE TABLE #Test

    (

    ID INT,

    Date DATETIME

    )

    CREATE CLUSTERED INDEX TEST ON #Test (ID, Date);

    CREATE TABLE #Test2

    (

    ID INT,

    Date DATETIME,

    Val INT

    )

    CREATE INDEX TEST2 ON #Test2 (ID) INCLUDE(Date, Val)

    INSERT INTO #Test (ID, Date) VALUES (1, '2012-09-01')

    INSERT INTO #Test (ID, Date) VALUES (1, '2012-09-02')

    INSERT INTO #Test (ID, Date) VALUES (1, '2012-09-04')

    INSERT INTO #Test (ID, Date) VALUES (2, '2012-09-05')

    INSERT INTO #Test (ID, Date) VALUES (2, '2012-09-08')

    INSERT INTO #Test (ID, Date) VALUES (3, '2012-09-15')

    INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-01', 2)

    INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-02', 2)

    INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-03', 2)

    INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-04', 3)

    INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-05', 3)

    INSERT INTO #Test2 (ID, Date, Val) VALUES (2, '2012-09-05', 2)

    INSERT INTO #Test2 (ID, Date, Val) VALUES (2, '2012-09-06', 4)

    INSERT INTO #Test2 (ID, Date, Val) VALUES (3, '2012-09-08', 2)

    INSERT INTO #Test2 (ID, Date, Val) VALUES (3, '2012-09-09', 5)

    INSERT INTO #Test2 (ID, Date, Val) VALUES (3, '2012-09-20', 5)

    ;WITH Tally (n) AS (

    SELECT TOP 500000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM sys.all_columns a CROSS JOIN sys.all_columns b)

    INSERT INTO #Test

    SELECT ID + 10*n, Date

    FROM #Test

    CROSS APPLY Tally

    ;WITH Tally (n) AS (

    SELECT TOP 500000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM sys.all_columns a CROSS JOIN sys.all_columns b)

    INSERT INTO #Test2

    SELECT ID + 10*n, Date, VAL + n

    FROM #Test2

    CROSS APPLY Tally

    PRINT '-- CTE query'

    SET STATISTICS TIME ON

    ;WITH cte AS

    (

    SELECT

    t.ID,

    t.Date,

    p.Val,

    ROW_NUMBER() OVER (PARTITION BY t.ID, t.Date ORDER BY ABS(DATEDIFF(dd, t.Date, p.Date))) AS rowNum

    FROM #Test t

    LEFT JOIN #Test2 p ON t.ID = p.ID

    )

    SELECT

    ID,

    Date,

    Val

    FROM cte

    WHERE rowNum = 1

    SET STATISTICS TIME OFF

    PRINT '-- CROSS APPLY query'

    SET STATISTICS TIME ON

    SELECT ID, Date, Val

    FROM #Test a

    CROSS APPLY (

    SELECT TOP 1 Val

    FROM #Test2 b

    WHERE a.ID = b.ID

    ORDER BY ABS(DATEDIFF(day, a.Date, b.Date))) b

    SET STATISTICS TIME OFF

    DROP TABLE #Test, #Test2

    And these are the results I got:

    (3000000 row(s) affected)

    (5000000 row(s) affected)

    -- CTE query

    (3000006 row(s) affected)

    SQL Server Execution Times:

    CPU time = 35474 ms, elapsed time = 79574 ms.

    -- CROSS APPLY query

    (3000006 row(s) affected)

    SQL Server Execution Times:

    CPU time = 77875 ms, elapsed time = 52099 ms.

    Notice how CPU is higher than elapsed time? That's because SQL is parallelizing the query so you end up burning more CPU but cutting down your elapsed time.

    Perhaps this will work for you.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply