Getting the closest date to a given date

Question

Getting the closest date to a given date

kramaswamy

SSCoach

Points: 18181
More actions
October 18, 2012 at 9:44 am

#283169

Hi all,
Got the following problem situation:
CREATE TABLE #Test
(
ID INT,
Date DATETIME
)
CREATE TABLE #Test2
(
ID INT,
Date DATETIME,
Val INT
)
INSERT INTO #Test (ID, Date) VALUES (1, '2012-09-01')
INSERT INTO #Test (ID, Date) VALUES (1, '2012-09-02')
INSERT INTO #Test (ID, Date) VALUES (1, '2012-09-04')
INSERT INTO #Test (ID, Date) VALUES (2, '2012-09-05')
INSERT INTO #Test (ID, Date) VALUES (2, '2012-09-08')
INSERT INTO #Test (ID, Date) VALUES (3, '2012-09-15')
INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-01', 2)
INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-02', 2)
INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-03', 2)
INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-04', 3)
INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-05', 3)
INSERT INTO #Test2 (ID, Date, Val) VALUES (2, '2012-09-05', 2)
INSERT INTO #Test2 (ID, Date, Val) VALUES (2, '2012-09-06', 4)
INSERT INTO #Test2 (ID, Date, Val) VALUES (3, '2012-09-08', 2)
INSERT INTO #Test2 (ID, Date, Val) VALUES (3, '2012-09-09', 5)
INSERT INTO #Test2 (ID, Date, Val) VALUES (3, '2012-09-20', 5)
Basically, I have one table that has a value and a date. For each record in that table, I want to find the value in the second table which corresponds to the date that is closest to the date from the first table.
Here's the solution that I've employed:
WITH cte AS
(
SELECT
t.ID,
t.Date,
p.Val,
ROW_NUMBER() OVER (PARTITION BY t.ID, t.Date ORDER BY ABS(DATEDIFF(dd, t.Date, p.Date))) AS rowNum
FROM #Test t
LEFT JOIN #Test2 p ON t.ID = p.ID
)
SELECT
ID,
Date,
Val
FROM cte
WHERE rowNum = 1
This worked fine when I had only a few hundred records in my first table. The problem is that when the number of records goes up, the amount of time increases very sharply. Right now, that cte is returning me about a million records, and is taking around 30 seconds to run. I need to bring it down to around 2-3 seconds ideally.
Any suggestions? As it stands, there is an index on the second table, on the field ID, which includes Date and Val as included columns. On the first table, there's a clustered index on the ID field.

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 1

... Right now, that cte is returning me about a million records, and is taking around 30 seconds to run. I need to bring it down to around 2-3 seconds ideally.
...

You need to find where the most of time is spent. I would guess that is spend on creating product of JOIN between your two tables on ID's.

Can you measure how long is takes just to join your two tables?

SELECT t1.*, t2.*

FROM T1 JOIN T2 ON T1.ID=T2.ID

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

kramaswamy SSCoach Points: 18181 More actions · Answer 2

It's certainly that part which is taking the most time. I just don't know how else to go about it other than using the approach I've listed.

Dwain Camps SSC Guru Points: 86978 More actions · Answer 3

You can do it with a subquery but I tried that and it's slower. Your query is pretty good.

But there is another approach if you're running more than one processor and all you are concerned with is elapsed time. Try this:

SELECT ID, Date, Val

FROM #Test a

CROSS APPLY (

SELECT TOP 1 Val

FROM #Test2 b

WHERE a.ID = b.ID

ORDER BY ABS(DATEDIFF(day, a.Date, b.Date))) b

I put it into a test harness with ~3,000,000 rows in #Test and ~5,000,000 rows in #Test2:

CREATE TABLE #Test

(

ID INT,

Date DATETIME

)

CREATE CLUSTERED INDEX TEST ON #Test (ID, Date);

CREATE TABLE #Test2

(

ID INT,

Date DATETIME,

Val INT

)

CREATE INDEX TEST2 ON #Test2 (ID) INCLUDE(Date, Val)

INSERT INTO #Test (ID, Date) VALUES (1, '2012-09-01')

INSERT INTO #Test (ID, Date) VALUES (1, '2012-09-02')

INSERT INTO #Test (ID, Date) VALUES (1, '2012-09-04')

INSERT INTO #Test (ID, Date) VALUES (2, '2012-09-05')

INSERT INTO #Test (ID, Date) VALUES (2, '2012-09-08')

INSERT INTO #Test (ID, Date) VALUES (3, '2012-09-15')

INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-01', 2)

INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-02', 2)

INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-03', 2)

INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-04', 3)

INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-05', 3)

INSERT INTO #Test2 (ID, Date, Val) VALUES (2, '2012-09-05', 2)

INSERT INTO #Test2 (ID, Date, Val) VALUES (2, '2012-09-06', 4)

INSERT INTO #Test2 (ID, Date, Val) VALUES (3, '2012-09-08', 2)

INSERT INTO #Test2 (ID, Date, Val) VALUES (3, '2012-09-09', 5)

INSERT INTO #Test2 (ID, Date, Val) VALUES (3, '2012-09-20', 5)

;WITH Tally (n) AS (

SELECT TOP 500000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

FROM sys.all_columns a CROSS JOIN sys.all_columns b)

INSERT INTO #Test

SELECT ID + 10*n, Date

FROM #Test

CROSS APPLY Tally

;WITH Tally (n) AS (

SELECT TOP 500000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

FROM sys.all_columns a CROSS JOIN sys.all_columns b)

INSERT INTO #Test2

SELECT ID + 10*n, Date, VAL + n

FROM #Test2

CROSS APPLY Tally

PRINT '-- CTE query'

SET STATISTICS TIME ON

;WITH cte AS

(

SELECT

t.ID,

t.Date,

p.Val,

ROW_NUMBER() OVER (PARTITION BY t.ID, t.Date ORDER BY ABS(DATEDIFF(dd, t.Date, p.Date))) AS rowNum

FROM #Test t

LEFT JOIN #Test2 p ON t.ID = p.ID

)

SELECT

ID,

Date,

Val

FROM cte

WHERE rowNum = 1

SET STATISTICS TIME OFF

PRINT '-- CROSS APPLY query'

SET STATISTICS TIME ON

SELECT ID, Date, Val

FROM #Test a

CROSS APPLY (

SELECT TOP 1 Val

FROM #Test2 b

WHERE a.ID = b.ID

ORDER BY ABS(DATEDIFF(day, a.Date, b.Date))) b

SET STATISTICS TIME OFF

DROP TABLE #Test, #Test2

And these are the results I got:

(3000000 row(s) affected)

(5000000 row(s) affected)

-- CTE query

(3000006 row(s) affected)

SQL Server Execution Times:

CPU time = 35474 ms, elapsed time = 79574 ms.

-- CROSS APPLY query

(3000006 row(s) affected)

SQL Server Execution Times:

CPU time = 77875 ms, elapsed time = 52099 ms.

Notice how CPU is higher than elapsed time? That's because SQL is parallelizing the query so you end up burning more CPU but cutting down your elapsed time.

Perhaps this will work for you.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St