A Faster BETWEEN Dates

  • Although I like the idea of being creative on 'older' hardware (my (far from new) portable runs the test queries in sub-second time frames), I too think there are a bit of issues with the article's content :

    * as pointed out already, the LEFT OUTER JOIN has been 'transformed' into a 'disguised' INNER JOIN

    * the use of table-variables for performance testing is a bit of a no-no in my book, they suffer from strangenessitis too often. Additionally, you can't quite index them the way you want and they don't have statistics (that I know of)

    * there are no indexes on the@ProjectList table, nor on the @WorkTable ... probably worth a try (using temp-tables; sequence : put data in first, then add the indexes)

    * @CalTable does have indexes, but for reasons I don't get the CLUSTERED PRIMARY KEY gets another UNIQUE INDEX on the same field. (**)

    I would suggest to look into indexing the working tables first, if that's not already the case, and stay clear of table-variables for this kind of tuning. In my personal opinion table variables are useful for VERY small datasets, for recursive stuff (no clashes when dealing with "shared" temp-tables) and when you want to work around the Transaction model. Otherwise I'm quite happy with 'ordinary' temp-tables as -in my experience (***)- they seem to result in much better query plans. That said, YMMV off course.. if this change seems to get you better performance then it might be the way forward on your particular system... but frankly I think you'll be back in the near future once the unexplainable benefit wears off...

    That said, although I do not quite agree with the contents of the article, I do applaud the way it was written : nicely structured, clear language and clean example code ! Please do not be put-off by the fact people have doubts about your findings ... finding the reason behind your observations might show you're in some kind of borderline situation; but as there is none given it's hard to accept something /counter-logic/ that turns out to be false when tested by other people.

    **: query plan indicates that the optimizer takes the clustered index in both situations.

    ***: I tend to work on largish data-sets most of the time; for small things, table-variables might work as light-weight temptables and do show some performance benefit.. then again, if it's small it should be fast by definition =)

  • similarly to find all events that overlapped with a period of time, you might go:

    select blah

    from events ev

    where ev.start_date_time <= '20100701'

    and (ev.end_date_time > '20100101' or ev.end_date_time is null)

    If it was your job to design these tables, would you be tempted to have an end_date_time column that could not be null, and instead substitute a ridiculous future date for all events that have not yet ended (like 22000101)? Are there indexing or other performance advantages to making your end_date_time column not null?

    Absolutely would put a rediculous enddate in place. NULL values are stored as a bitmap at the leaf level of on index (clustered or otherwise), but are not used in the B-Tree portion of an index. In some cases you will get seek-like behavior out of NULLS, because the optimizer can basically do an inverse search. Dates are a fixed length fields, so NULL or not, you are still taking the same amount of physical space. By specifying a rediculous date as the End_Date_Time, your query eliminates the NULL check, and can always rely on the B-Tree portion of an index that uses End_Date_Time in the key and perform a SEEK.

  • ebaya (11/1/2010)


    jpenniman (11/1/2010)


    Did you clear the buffers between tests? Using: dbcc dropcleanbuffers. If not, it would explain your "performance gain"

    His illusory performance gain is explained entirely by the fact that it's an entirely differentquery, that will return different results. You don't need to invoke any buffering or anything else to explain it.

    ...as the result is already in memory. Your "faster" query would perform a table scan every time in real life, as your condition is an expression... and a case statement at that.

    Eh? You can buffer an expression as you can anything else. You simply can't index off it (in technical terms, its not SARGable).

    Not sure what I was thinking there... agreed the plan wouldn't be in memory. Still in shock I guess.

    Agreed it's not SARGable... that's why I noted the scan.

  • ebaya (11/1/2010)


    david.avraamides (11/1/2010)


    I would consider using a sentinel date such as the maximum T-SQL datetime (9999-12-31) for two reasons: 1) your queries become simpler since you don't always need that "or end_date is null" part of the clause, and 2) you can create an index on end_date since it is not nullable.

    Eh? You can index on a nullable column.

    You can, but NULLS are stored as a bimap at the leaf level, and are not part of the organization of the B-Tree itself, so tend not to optimize correctly.

  • Paul Muharsky-474732 (11/1/2010)


    Absolutely would put a rediculous enddate in place. NULL values are stored as a bitmap at the leaf level of on index (clustered or otherwise), but are not used in the B-Tree portion of an index. In some cases you will get seek-like behavior out of NULLS, because the optimizer can basically do an inverse search. Dates are a fixed length fields, so NULL or not, you are still taking the same amount of physical space. By specifying a rediculous date as the End_Date_Time, your query eliminates the NULL check, and can always rely on the B-Tree portion of an index that uses End_Date_Time in the key and perform a SEEK.

    Several misconceptions.

    a) All data is stored in the leaf level of an index. It's not just NULLs.

    b) If by "the B-tree part of the index", you mean non-leaf pages, then yes, NULL values are not stored there. But why would you want them there? Intermediate levels are to allow quick location of a specific value ... but we already know where the NULLS in the index are.

    c) The belief that NULLs always hurt performance is wrong. It depends on several factors, including cardinality/null ratio of the column, the query structure, etc. Many times they can increase performance. Unless you know the specific cases where eliminating NULLS can give you a large performance boost (such as left-join elimination), it's better to use them where they appropriately model your data.

    d) The belief that NULLs always complicate queries is wrong. Since they match nothing, they can simplify many queries. For instance, this query as to which events have not yet ended:

    WHERE EndDate > GETDATE() OR EndDate IS NULL

    Would indeed be simplified by a sentinel value instead of a NULL. But using a sentinel means this query to retrieve all events that end after next year now becomes more complex:

    WHERE EndDate > '20110101' AND EndDate <> @SentinelDate

    When using aggregates in particular, sentinel values can be a real nightmare.

  • Which is more efficient, individual indexes on Proj_Start and Proj_End or a composite index on Proj_Start and Proj_End?

  • david.avraamides (11/1/2010)


    I would consider using a sentinel date such as the maximum T-SQL datetime (9999-12-31) for two reasons: 1) your queries become simpler since you don't always need that "or end_date is null" part of the clause, and 2) you can create an index on end_date since it is not nullable.

    Agreed, 9999-12-31 or a similar future day is preferable to a NULL end date. The queries are much simpler to write that way. However, you can create an index either way; NULLs are perfectly acceptable in all non-primary key indexes (with some restrictions on unique keys).

    The problem with start date/end date schemes is that creating a perfect index is nearly impossible with or without the NULLs. Let's say that you want all rows that were "active" as of 2010-11-02:

    SELECT *

    FROM YourTbl

    WHERE '2010-11-02' BETWEEN startDate AND endDate

    How should you index for this query? There are basically two options: (startDate, endDate), and (endDate, startDate). Given the first index, the QP has to find the first row where startDate <= '2010-11-02', then it must do a scan from there to evaluate all of the endDate values. The reverse is true if you have endDate first.

    So which index is better? The best index for a given query is generally ordered by the most selective columns. So if you have fewer rows that start before 2010-11-02 than those that end after 2010-11-02, startDate should be first. Otherwise, endDate should be first. But what if you have an equal number of rows that meet each criteria? And what if there are 50,000,000,000 rows that start before 2010-11-02 but end earlier, and 50,000,000,000 rows that both start and end after 2010-11-02? Now you're stuck scanning 50,000,000,000 rows either way, even if there is only a single row that's actually valid for that day.

    Solution? I haven't figured it out yet. Some other DBMS products have multidimensional index types (R and Quad trees) and I suspect that these indexes can be exploited to solve these problems, but for now, in SQL Server, we're in a bit of a bind... Unless someone out there has a solution? Speak up if so 🙂

    --
    Adam Machanic
    whoisactive

  • Adam Machanic (11/2/2010)


    david.avraamides (11/1/2010)


    I would consider using a sentinel date such as the maximum T-SQL datetime (9999-12-31) for two reasons: 1) your queries become simpler since you don't always need that "or end_date is null"....

    Sometimes. Sometimes they become more complex, since you need to add an "...or date is not (sentinel)" to them. The overarching concern in data modelling should be correctness of expression, not how to simplify a query.

    How should you index for this query? There are basically two options: (startDate, endDate), and (endDate, startDate).

    Without a two-dimensional index such as a Quadtree, you can't do fully-indexed 2D range searches. However, there are a few tricks you can do to get close, by exploiting information the QO doesn't know about your data. I should probably write an article about it, as its a bit more complex than one can easily cover in a post.

    Also, you *can* implement a quadtree index in Sql Server, with a little work.

  • Interesting approach and thanks for sharing.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Thank you to all of you who have contributed to this discussion. I am kinda glad that I was too busy to read the article and the discussions until now.

    Not only is the original article of interest (and it is great to have someone prepared to write about their findings...thanks Terry) but the discussions show that this topic is of great interest to many of us, myself included.

  • happycat59 (11/3/2010)


    Not only is the original article of interest (and it is great to have someone prepared to write about their findings...thanks Terry)

    Does no one actually care the entire article is wrong, top to bottom?

  • Michael Ebaya (11/3/2010)


    happycat59 (11/3/2010)


    Not only is the original article of interest (and it is great to have someone prepared to write about their findings...thanks Terry)

    Does no one actually care the entire article is wrong, top to bottom?

    The speed increase is probably caused by the inadvertent change of the meaning of the query, that has been shown painfully clearly. Other discussion that has ensued, taught me a trick or two. So thanks Terry!

    But the core of the article, the technique of using CASE expressions to tame NULL values is sometimes very handy, especially when either side of the comparison may be NULL.

    The CASE expression makes the intent crystal clear to first-time reader of the code, or the writer after a few years-->weeks. Changing the polarity of the comparison and switching the "Boolean" values 0 and 1 makes it easy to change which way NULLs lean.

    The major caveat in the discussion, placing the CASE expressions in OUTER JOINs, not WHERE clauses where they easily bite with their INNER nature, was spot-on. My own bite-marks? Much better now, thank you.

  • I have a table with an unindexed date column in a table of billions of rows

    I collect the lowest and highest primary keys between those dates into variables

    I set two variables

    declare @min-2 int

    declare @max-2 int

    select @min-2 = min(primarykey) where datecol => 'begindate'

    select @max-2= max(primarykey) where datecol <= 'enddate'

    select primarykey, datecol, x,y,z from table where primarykey between @min-2 and @max-2

    works for me

  • TheSQLGuru (11/1/2010)


    2) Personally I like dynamic SQL (when available) for widely varying date inputs because the optimizer always gets the best possible chance to create the optimum plan for each execution. This is even better in databases if you have the ad hoc optimization enabled.

    The query optimizer also gets the opportunity to dynamically screw up with a wacky execution plan at an inopportune time when youre asleep dreaming of wacky ways of writing wacky code and processing has stopped and your business has stopped functioning and the poor old production dba is woken from his sleep again, and to trawl through your wacky code and fix it, again. The way you write code today may take your business down tomorrow, because its not usually possible to test for all scenarios. Datasets grow and execution plans change so keeps things simple and predictable I say. Dont try to be too clever.

  • sql.monkey (11/4/2010)


    select @min-2 = min(primarykey) where datecol => 'begindate'

    select @max-2= max(primarykey) where datecol <= 'enddate'

    select primarykey, datecol, x,y,z from table where primarykey between @min-2 and @max-2

    works for me

    Yes, because you've missed the entire point, and your query is nothing close to what we're even discussing here. You're finding ONE date column between TWO static values. A B-tree index works fine for that. Finding one STATIC value between two date columns is an entirely different problem.

    Further, if your column isn't indexed and you have "billions of rows", you're going to be table scanning, which means performance is not going to be acceptable. Either you have an index on the column that you don't know about, or the table is orders or magnitude smaller than "billions", or the entire scenario was fabricated to make a nice-sounding post.

Viewing 15 posts - 31 through 45 (of 53 total)

You must be logged in to reply to this topic. Login to reply