why no Binary Search operator ?

  • doobya (7/12/2010)


    When you look up a number in the phone book - you know whether to binary search left or right because the alphabet is increasing

    You are aware that SQL doesn't use binary trees for indexes? Depending on the index, the 'middle' value is not necessarily going to be in the root page of the index. The root page only stores the lowest index key value for each page in the level below.

    Personally, I can't see all that much value in this. Any function that's simple enough that the query processor can do this (and it would be the QP doing this, not the optimiser), should be simple enough that it can be rewritten in a SARGable form.

    Did you post on Connect?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • You are aware that SQL doesn't use binary trees for indexes? Depending on the index, the 'middle' value is not necessarily going to be in the root page of the index. The root page only stores the lowest index key value for each page in the level below.

    Personally, I can't see all that much value in this. Any function that's simple enough that the query processor can do this (and it would be the QP doing this, not the optimiser), should be simple enough that it can be rewritten in a SARGable form.

    Did you post on Connect?

    Index aware 'middle' value - Yes - I figured you can have a "loose binary search" that doesn't have to split the set exactly in two every time ...

    QP / optimizer - I was thinking that the optimizer creates the QP - and that if a binary search operator were available - it could create a QP

    that used it to avoid a scan

    SARGable - I suspect you are correct - and there is little point adding complexity to the engine just to help numpties

  • doobya (7/12/2010)


    Index aware 'middle' value - Yes - I figured you can have a "loose binary search" that doesn't have to split the set exactly in two every time ...

    I still don't know how one would do a binary-type search when what's available are b-trees (balanced trees), not binary trees. Searches are certainly possible, no so much binary ones though, unless you want to go and generate a binary tree at query execution time.

    QP / optimizer - I was thinking that the optimizer creates the QP - and that if a binary search operator were available - it could create a QP

    QP standing for Query Processor, not query plan. Optimiser creates plan, passes it to query processor which executes it. Query Optimiser (QO) would have to know the option exists and when it's both optimal and safe to use, Query Processor (QP) would then have to do all the work to execute it.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I posted a note on the Connect item as I'm with Gail. I'm not sure this helps with a balanced tree, and would like some more commentary and a better set of examples.

    There has been a lot of work on the QP and how it examines plans. It goes through many, though not all, possibilities and based on what has been learned, choose a good one. I've wondered why we don't have more partial index scans, which seems like what you'd want, but apparently it's not cost effective for the algorithms SQL Server uses.

    I will be interested to see what Microsoft says.

  • doobya (7/7/2010)


    where datediff(day, [field], getdate()) < 240 will force a scan (instead of a seek) but if the optimizer knew that the output of datediff decreases as [field] increases

    (which is always the case as getdate() is constant for this statement) it could reduce io by using a binary search it can easily determine the min and max values of the [field] if there is a suitable index available the property of increasing / decreasing is deterministic for many functions and mathematical expressions so I am surprised there is no "binary search" operator in SQL Server or am I missing something?

    On the face of it, it's an interesting question - why does SQL Server choose to scan every row in the index, rather than setting up a binary search?

    One thing to consider is that an index scan typically results in sequential IO. In Enterprise Edition, a single read-ahead IO can be up to 1MB in size, so one IO can fetch 128 pages. Let's say the index is on a single datetime column, which is 8 bytes wide, and there's an extra byte of index overhead per row - 910 rows per leaf-level index page. That single IO brings in 910 * 128 = 116,480 rows. Perhaps more importantly, the storage engine looks ahead in the upper levels of the index (assuming an ordered scan) and aims to keep the read-ahead around 500 pages ahead of the needs of the query processor. The situation with an unordered scan is similar, but the storage engine also has the option of an allocation-order scan (driven from the IAM pages) in suitable circumstances.

    Ok, so taking the example of 16,777,216 rows, doesn't that still mean that 145 IOs will be needed - more than the 24 claimed for the binary search? Yes, but each seek performed for the binary search will be random IO - and non-SSD disk sub-systems perform much less well with random IO. Another consideration is that each index seek must navigate down the index levels. With a typical index depth of 4, the number of IOs required for the binary search is around 100. The question then becomes, are 100 random IOs likely to be more or less efficient than 150ish sequential IOs? Of course, with an inequality predicate (as in the example) the engine would then need to go on to read all the pages prior to the point found with the binary search...

    There are other questions around how generally useful such an optimisation might be, how easily it would fit within the existing framework, and whether a more general optimisation exists.

    As far as general application is concerned, you'd have to say that a binary search optimisation would be better suited to singleton seeks, or small range seeks on a large enough index, if indeed it turns out to be an IO optimisation at all. It also seems likely to be restricted to use with built-in functions on a single column, or very simple expressions thereon (perhaps to a similar level of complexity achieved by the current constant-folding implementation). The more I think about that, the smaller the range of applicable scenarios seems.

    Let's think about incorporating the idea within the existing optimiser, processor, and storage engine framework.

    We would definitely need to put some work into determining the behaviour of built-in functions with respect to increasing or decreasing uniformly with an index key of a particular type. We might also need to extend the statistics system to help decide when binary search was a good optimisation or not - and to enable us to safely cache a plan containing a binary search, knowing that an optimality-based recompilation would be triggered if statistics changed enough to make the original decision to binary search over a straight scan unsound. We would also need to decide whether binary search was a query processor responsibility, or whether it should be implemented in the storage engine, in a similar way to the ordered/unordered scan mentioned earlier. Finally, we would need to work the necessary implementation rules into the optimiser's search strategy, in a way that preserves correctness, but does not result in any significant extension to compilation time.

    Does a more general, or more generally useful optimisation exist? Well, in this specific case, there is certainly an argument for suggesting that the optimiser could consider rewriting the query into the obviously equivalent SARGable form. That's actually quite a bit more difficult to do in general than you might think, but I do know it is an area that has been examined with a view to incorporating it in a future release.

    Another more general optimisation might be to at least stop the scan once all possibly-matching records have been found using a forward or backward scan. That's a small modification to the original idea, but it also seems unlikely to save enough to warrant the effort, except in edge cases.

    There is never a shortage of ideas for additional optimisations that could be included in the product, but it seems unlikely to me that this kind of binary search would meet the bar, all things considered.

    Paul

  • Interesting points, Paul.

    One thing does come to mind. Has MS really done enough research on alternatives? The read-ahead mechanism seems designed to pre-fetch data and prevent further interrupts of the CPU needing to wait for IO. Just like branch prediction is CPUs, it seems this is a somewhat poor-fix for the problem if trying to build a long pipeline. In this case, because we don't have better searching, we anticipate needing more data and just read ahead.

  • Steve Jones - Editor (7/17/2010)


    Has MS really done enough research on alternatives? The read-ahead mechanism seems designed to pre-fetch data and prevent further interrupts of the CPU needing to wait for IO. Just like branch prediction is CPUs, it seems this is a somewhat poor-fix for the problem if trying to build a long pipeline. In this case, because we don't have better searching, we anticipate needing more data and just read ahead.

    Judging by some of the patent applications and technical papers produced by the SQL Server team as a whole, and the optimisation/query processor people in particular, I would say probably yes to the first question.

    And yes, the whole thing is very IO-centric. The optimiser in particular makes a number of assumptions that are unlikely to be true in many cases: all queries start with a cold cache is my favourite there. I think there is too heavy a focus on IO, particularly sequential versus random - especially given the way technology is moving to close that gap.

    The problem is that optimising an enormous range of possible queries well, in reasonable time is a deeply non-trivial task. There are many compromises and potentially unsafe assumptions, but you can't argue that SQL Server does an amazing job, overall. It's ability to transform complex T-SQL into elegant execution plans still surprises me on a regular basis. What we have is not perfect, but it's a whole lot better than many of the alternatives.

    I don't know whether it came across well in my first post, but I tend to agree that scanning a whole index just because the predicate is a bit non-SARGable seems a bit dumb in many cases. The trick is to find something better, that works as well and as reliably as the current approach, without breaking anything else, and without tying up MSFT's dev and test for ten years!

    Maybe we will see some limited query transformations in the next release that will be able to produce SARGable predicates from some common non-SARGable constructions. Maybe we will see 'smarter' partial scans in some cases. Maybe binary search or linear interpolation will break out from its current place in index key seeking...who knows?

  • I am sure they have done some work here, but I wonder if they've done any recently. The work Peter Spiro did on 7, rewriting all kinds of stuff has improved over the years, but I haven't heard any revisiting of things. My concern is it's viewed as a dead subject and "columnar dbs" "azure", etc is being worked on instead.

    I don't want people tied up, but this might be a good research project for a small team. I too think partial scans of some sort should be doable, but I have no idea if they can be made efficient enough to implement.

  • Steve Jones - Editor (7/17/2010)


    I am sure they have done some work here, but I wonder if they've done any recently. The work Peter Spiro did on 7, rewriting all kinds of stuff has improved over the years, but I haven't heard any revisiting of things. My concern is it's viewed as a dead subject and "columnar dbs" "azure", etc is being worked on instead. I don't want people tied up, but this might be a good research project for a small team. I too think partial scans of some sort should be doable, but I have no idea if they can be made efficient enough to implement.

    I think there has to be a limit on how many times you revisit the same idea. I think it more likely that we will see other improvements that provide us with better ways of achieving the same task. So, we may not see binary search or partial scans per se, but we might see SARG transforms, more powerful indexed views (they've definitely done work on incorporating MIN and MAX and outer joins). We already have indexable functions and computed columns, and a couple of partitioning tools - all of which narrow the use cases for binary search or partial scans, at least with deterministic predicates.

    SARGability is at the root of the problem really, since a partial scan is just as seek + partial scan anyway! Some might argue that slow full scans are necessary to motivate the query writer to improve their design 😉

Viewing 9 posts - 16 through 23 (of 23 total)

You must be logged in to reply to this topic. Login to reply