A quick search without an index

  • Comments posted to this topic are about the item A quick search without an index

  • Thanks for sharing!  A successful developer is one who is skilled in the requisite tools AND is confident is his/her creative ability.  A developer is a creator.

     

  • 5 Stars for realizing that you were stuck in a box and then for thinking outside the box.  I wonder how many people would have actually thought of using a Binary Search like this? Nice job!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Brilliant - thanks for sharing! That's a great example of creatively using an existing index rather than adding another index to the table. I'll definitely add the binary search technique to my SQL toolkit.

  • Alex, just a couple things from an old-timer for you to consider.

    First, if you are serious about that number of rows, you need to fix that.  What are you doing, documenting the stars in the Milky Way?  In my active years as a DBA I was always advocating for archiving data.  You need to consider how old the history is and how often you need to access it, versus the time and cost to store it online, back it up,  etc.

    At least be sure it is NOT in a database with transactional data, preferably on a different server.  Better, move anything more than the last year or so to something like a dis-mountable NAS drive and put it 'in the basement'.  Break the data up into unique tables such as by year,  get it organized and indexed and ready to access.  Create summary data which might even suffice for lots of demands. On the rare occasion you need to access it, you can plug it in and query it at will with simple SQL code, instead of spending the time doing complex stuff just to look at it.  In the old days of removable 'disk packs', we had archive disks that we could drop in and fire up as needed.  Today I use a NAS device with internal drives that idle down when not active, and I'm thinking about getting a second one.  I can archive my data to that, and it's great for backups too so they're not taking space on my active server.

    It very easy for companies to get carried away with demands for accessible data, and it's up us to figure out how it gets done.  Those of us who have been around for decades had to learn all the tricks because budgets were lots smaller then, both for hardware and personnel.  Even back when we had to use 1/2 inch magnetic tape, we did weekly tape-to-tape merges of detail into archived data and only kept summary data online.

    We even traded storage with other companies close by where we kept copies of our data archives for security and protection from disasters.  Was that the original version of 'the cloud'?  Before that, several of us would actually carry home backups and archives for off-site storage.

    I have applications that use historical data that is as old as 35 years, with a small quantity of data going all the way back to 1944 but I only keep the most recent 5 or 6 years on my SQL Server, and the rest is on a RAID NAS device that I can access if and when I need to.

    So, instead of indexing, consider summarizing and archiving.

    Rick
    Disaster Recovery = Backup ( Backup ( Your Backup ) )

  • BTW... You already know this but you didn't write it quite right in the article.  To determine the max number of iterations in a binary search, you have to do a bit more that just taking the LOG of the number of values.  The full formula is CEILING(LOG(N)/LOG(2)) where "N" is the number of rows in a gap-less situation or MAX(ID)-MIN(ID)+1 for something like an IDENTITY column.

    Additional kudos for explaining that the IDENTITY column is probably not gap-less and what you did to compensate.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Just a bit of caution for people reading this... if your DATETIME column only contains whole dates (no times) and you tell it to search for a date with no time or a 00:00:00.000 time, it'll find the correct ID.  If, however, you give it any time larger than 12:00:00.000, it'll find the next day.  I've not tested if you tell it to find a whole date that's not there but it seems like it will find whatever row contains the value closest to the target value.  That means that it might give you some extra rows you hadn't anticipated and that's my point...  once you've used a script like this, you need to double check the data you retrieve.

    None of this is a fault of this code.  It's must one of the devils in the data that you have to watch for.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I believe the problem with time larger than 12:00:00.000 that picks the next day is because of the last part of the script that selects the closest neighbor.

  • Yep... I got that.  It's not a problem if you know about it.  I just wanted to be sure that people knew about it.  Perhaps stripping the time would eliminate that problem if one were looking for whole date ranges, which I believe would be the case for the original task presented in the article.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I remember doing this many years ago, but it didn't get coded up at the time. The same in principle but not quite the same code. Just for S&G I coded up what I think it would have looked like:

     

    DECLARE 
    @DateToFind DATE = '20150210',
    @DateFound DATE = '21000101',
    @ID BIGINT = NULL,
    @MAXID BIGINT = (SELECT MAX(ID) FROM MyTrans) / 67136,
    @Factor INT = 67136, -- 16 seeks max
    @Bitty INT = 67136 / 2,
    @SeekCount SMALLINT = 0

    WHILE @Bitty >= 1 BEGIN

    SET @Factor = CASE WHEN @DateFound < @DateToFind THEN @Factor + @Bitty ELSE @Factor - @Bitty END
    SET @ID = @Factor*@MAXID
    SELECT TOP(1) @ID = ID, @DateFound = [Date] FROM MyTrans WHERE ID >= @ID ORDER BY ID

    IF @DateToFind = @DateFound BREAK

    SELECT @Bitty = @Bitty / 2, @SeekCount = @SeekCount + 1;

    END

    SELECT DateToFind = @DateToFind, DateFound = @DateFound, SeekCount = @SeekCount
    SELECT * FROM MyTrans WHERE ID = @ID
    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Thank you VERY MUCH for this article! I have already used the algorithm to greatly improve the performance of a poorly written query (from another developer).

    Interestingly, this developer needed to be able to apply the query from within Tableau, which apparently does not allow the use of DECLARE statements from their interface. Given that we (the owners of the target database) very rarely write custom views for external access (it's an access control system), I decided to attempt to rewrite your logic in the form of a series of CTE's with a recursive CTE used as the binary search engine in place of the iterative loop (also not allowed in Tableau). I know recursion can get expensive and slow over time; however, given that most cases were found in less than 30 calls, I wasn't too worried about it.

    Here is the result:

    -- Sample from Tbl

    ;WITH CurrentDate AS (
    SELECT CONVERT(DATE,GETDATE()) AS Today
    ), StartOfMonth AS (
    SELECT CONVERT(DATETIME,DATEADD(dd,-(DAY(cd.Today)-1),cd.Today)) AS [Value]
    FROM CurrentDate cd
    ), TargetDate AS (
    SELECT DATEADD(mm, -13, som.[Value]) AS [Value]
    FROM StartOfMonth som
    ), initSplit AS (
    SELECT MIN(tbl.TblIDColumn) AS low, MAX(tbl.TblIDColumn) AS high,
    (SELECT TOP 1 [Value] FROM TargetDate td) AS tgtDate
    FROM Tbl tbl
    ), firstLvl AS (
    SELECT isp.low, split.mid, --tbl.CreateDate,
    isp.high, split.CreateDate,
    CASE WHEN split.CreateDate < isp.tgtDate
    THEN (SELECT ntbl.TblIDColumn FROM Tbl ntbl WHERE ntbl.TblIDColumn >= split.mid + 1
    ORDER BY ntbl.TblIDColumn OFFSET 0 ROWS FETCH FIRST 1 ROWS ONLY)
    ELSE isp.low END AS newLow,
    CASE WHEN split.CreateDate >= isp.tgtDate
    THEN split.mid
    ELSE isp.high END AS newHigh,
    isp.tgtDate, lvl = 0
    FROM initSplit isp
    CROSS APPLY (SELECT TOP 1 midtbl.TblIDColumn AS mid, midtbl.CreateDate
    FROM Tbl midtbl
    WHERE isp.low + FLOOR( (isp.high - isp.low)/2 ) <= midtbl.TblIDColumn) split

    UNION ALL

    SELECT fl1.newLow, split.mid,
    fl1.newHigh AS high, split.CreateDate,
    CASE WHEN split.CreateDate < fl1.tgtDate
    THEN (SELECT ntbl.TblIDColumn
    FROM Tbl ntbl
    INNER JOIN (
    SELECT ROW_NUMBER() OVER (ORDER BY ftbl.TblIDColumn) AS [rank], ftbl.TblIDColumn
    FROM Tbl ftbl
    WHERE ftbl.TblIDColumn >= split.mid + 1) nextId ON ntbl.TblIDColumn = nextId.TblIDColumn
    AND nextId.[rank] = 1
    )
    ELSE fl1.newLow END AS newLow,
    CASE WHEN split.CreateDate >= fl1.tgtDate
    THEN split.mid
    ELSE fl1.newHigh END AS newHigh,
    fl1.tgtDate, fl1.lvl + 1 AS lvl
    FROM firstLvl fl1
    CROSS APPLY (SELECT midtbl.TblIDColumn AS mid, midtbl.CreateDate
    FROM Tbl midtbl
    INNER JOIN (
    SELECT ROW_NUMBER() OVER (ORDER BY tmp.TblIDColumn) AS [rank], tmp.TblIDColumn
    FROM Tbl tmp
    WHERE tmp.TblIDColumn >= (fl1.newLow + FLOOR( (fl1.newHigh- fl1.newLow)/2 ))
    ) sub ON midtbl.TblIDColumn = sub.TblIDColumn
    AND sub.[rank] = 1
    ) split
    WHERE 1 = 1
    AND (split.mid <> fl1.mid)
    ), tgtId AS (
    SELECT fl.mid AS TblIDColumn
    FROM firstLvl fl
    WHERE fl.low = fl.high
    )

    SELECT *
    FROM firstLvl fl

    /*

    SELEC *
    FROM tgtId ti

    */
  • There is a small error in the code above in the tgtId CTE. The fl.low will not always match fl.high. Changing the WHERE clause to the following should fix that issue:

    WHERE fl.newLow = fl.mid
    AND fl.newHigh = fl.mid

     

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply