A quick search without an index

Question

A quick search without an index

Alex Chamchourine

SSC Eights!

Points: 912
More actions
February 14, 2020 at 12:00 am

#3717569

Comments posted to this topic are about the item A quick search without an index

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

heb1014 Hall of Fame Points: 3790 More actions · Answer 1

Thanks for sharing! A successful developer is one who is skilled in the requisite tools AND is confident is his/her creative ability. A developer is a creator.

Jeff Moden SSC Guru Points: 1004052 More actions · Answer 2

5 Stars for realizing that you were stuck in a box and then for thinking outside the box. I wonder how many people would have actually thought of using a Binary Search like this? Nice job!

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

jbowers Old Hand Points: 369 More actions · Answer 3

Brilliant - thanks for sharing! That's a great example of creatively using an existing index rather than adding another index to the table. I'll definitely add the binary search technique to my SQL toolkit.

skeleton567 SSCertifiable Points: 5840 More actions · Answer 4

Alex, just a couple things from an old-timer for you to consider.

First, if you are serious about that number of rows, you need to fix that. What are you doing, documenting the stars in the Milky Way? In my active years as a DBA I was always advocating for archiving data. You need to consider how old the history is and how often you need to access it, versus the time and cost to store it online, back it up, etc.

At least be sure it is NOT in a database with transactional data, preferably on a different server. Better, move anything more than the last year or so to something like a dis-mountable NAS drive and put it 'in the basement'. Break the data up into unique tables such as by year, get it organized and indexed and ready to access. Create summary data which might even suffice for lots of demands. On the rare occasion you need to access it, you can plug it in and query it at will with simple SQL code, instead of spending the time doing complex stuff just to look at it. In the old days of removable 'disk packs', we had archive disks that we could drop in and fire up as needed. Today I use a NAS device with internal drives that idle down when not active, and I'm thinking about getting a second one. I can archive my data to that, and it's great for backups too so they're not taking space on my active server.

It very easy for companies to get carried away with demands for accessible data, and it's up us to figure out how it gets done. Those of us who have been around for decades had to learn all the tricks because budgets were lots smaller then, both for hardware and personnel. Even back when we had to use 1/2 inch magnetic tape, we did weekly tape-to-tape merges of detail into archived data and only kept summary data online.

We even traded storage with other companies close by where we kept copies of our data archives for security and protection from disasters. Was that the original version of 'the cloud'? Before that, several of us would actually carry home backups and archives for off-site storage.

I have applications that use historical data that is as old as 35 years, with a small quantity of data going all the way back to 1944 but I only keep the most recent 5 or 6 years on my SQL Server, and the rest is on a RAID NAS device that I can access if and when I need to.

So, instead of indexing, consider summarizing and archiving.

Rick
Disaster Recovery = Backup ( Backup ( Your Backup ) )

Jeff Moden SSC Guru Points: 1004052 More actions · Answer 5

BTW... You already know this but you didn't write it quite right in the article. To determine the max number of iterations in a binary search, you have to do a bit more that just taking the LOG of the number of values. The full formula is CEILING(LOG(N)/LOG(2)) where "N" is the number of rows in a gap-less situation or MAX(ID)-MIN(ID)+1 for something like an IDENTITY column.

Additional kudos for explaining that the IDENTITY column is probably not gap-less and what you did to compensate.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004052 More actions · Answer 6

Just a bit of caution for people reading this... if your DATETIME column only contains whole dates (no times) and you tell it to search for a date with no time or a 00:00:00.000 time, it'll find the correct ID. If, however, you give it any time larger than 12:00:00.000, it'll find the next day. I've not tested if you tell it to find a whole date that's not there but it seems like it will find whatever row contains the value closest to the target value. That means that it might give you some extra rows you hadn't anticipated and that's my point... once you've used a script like this, you need to double check the data you retrieve.

None of this is a fault of this code. It's must one of the devils in the data that you have to watch for.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ghislain Grasshopper Points: 17 More actions · Answer 7

I believe the problem with time larger than 12:00:00.000 that picks the next day is because of the last part of the script that selects the closest neighbor.

Jeff Moden SSC Guru Points: 1004052 More actions · Answer 8

Yep... I got that. It's not a problem if you know about it. I just wanted to be sure that people knew about it. Perhaps stripping the time would eliminate that problem if one were looking for whole date ranges, which I believe would be the case for the original task presented in the article.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 9

I remember doing this many years ago, but it didn't get coded up at the time. The same in principle but not quite the same code. Just for S&G I coded up what I think it would have looked like:

DECLARE 
@DateToFind DATE = '20150210', 
@DateFound DATE = '21000101', 
@ID BIGINT = NULL,
@MAXID BIGINT = (SELECT MAX(ID) FROM MyTrans) / 67136,
@Factor INT = 67136, -- 16 seeks max
@Bitty INT = 67136 / 2,
@SeekCount SMALLINT = 0

WHILE @Bitty >= 1 BEGIN

SET @Factor = CASE WHEN @DateFound < @DateToFind THEN @Factor + @Bitty ELSE @Factor - @Bitty END
SET @ID = @Factor*@MAXID 
SELECT TOP(1) @ID = ID, @DateFound = [Date] FROM MyTrans WHERE ID >= @ID ORDER BY ID 

IF @DateToFind = @DateFound BREAK

SELECT @Bitty = @Bitty / 2, @SeekCount = @SeekCount + 1; 

END

SELECT DateToFind = @DateToFind, DateFound = @DateFound, SeekCount = @SeekCount
SELECT * FROM MyTrans WHERE ID = @ID

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Noel Cruz SSC Rookie Points: 30 More actions · Answer 10

Thank you VERY MUCH for this article! I have already used the algorithm to greatly improve the performance of a poorly written query (from another developer).

Interestingly, this developer needed to be able to apply the query from within Tableau, which apparently does not allow the use of DECLARE statements from their interface. Given that we (the owners of the target database) very rarely write custom views for external access (it's an access control system), I decided to attempt to rewrite your logic in the form of a series of CTE's with a recursive CTE used as the binary search engine in place of the iterative loop (also not allowed in Tableau). I know recursion can get expensive and slow over time; however, given that most cases were found in less than 30 calls, I wasn't too worried about it.

Here is the result:

-- Sample from Tbl

;WITH CurrentDate AS (
SELECT CONVERT(DATE,GETDATE()) AS Today
), StartOfMonth AS (
SELECT CONVERT(DATETIME,DATEADD(dd,-(DAY(cd.Today)-1),cd.Today)) AS [Value]
FROM CurrentDate cd
), TargetDate AS (
SELECT DATEADD(mm, -13, som.[Value]) AS [Value]
FROM StartOfMonth som
), initSplit AS (
SELECT MIN(tbl.TblIDColumn) AS low, MAX(tbl.TblIDColumn) AS high,
(SELECT TOP 1 [Value] FROM TargetDate td) AS tgtDate
FROM Tbl tbl
), firstLvl AS (
SELECT isp.low, split.mid, --tbl.CreateDate,
isp.high, split.CreateDate,
CASE WHEN split.CreateDate < isp.tgtDate
THEN (SELECT ntbl.TblIDColumn FROM Tbl ntbl WHERE ntbl.TblIDColumn >= split.mid + 1 
ORDER BY ntbl.TblIDColumn OFFSET 0 ROWS FETCH FIRST 1 ROWS ONLY)
ELSE isp.low END AS newLow,
CASE WHEN split.CreateDate >= isp.tgtDate
THEN split.mid
ELSE isp.high END AS newHigh,
isp.tgtDate, lvl = 0
FROM initSplit isp
CROSS APPLY (SELECT TOP 1 midtbl.TblIDColumn AS mid, midtbl.CreateDate
FROM Tbl midtbl
WHERE isp.low + FLOOR( (isp.high - isp.low)/2 ) <= midtbl.TblIDColumn) split

UNION ALL

SELECT fl1.newLow, split.mid,
fl1.newHigh AS high, split.CreateDate, 
CASE WHEN split.CreateDate < fl1.tgtDate
THEN (SELECT ntbl.TblIDColumn 
FROM Tbl ntbl 
INNER JOIN (
SELECT ROW_NUMBER() OVER (ORDER BY ftbl.TblIDColumn) AS [rank], ftbl.TblIDColumn
FROM Tbl ftbl
WHERE ftbl.TblIDColumn >= split.mid + 1) nextId ON ntbl.TblIDColumn = nextId.TblIDColumn
AND nextId.[rank] = 1
)
ELSE fl1.newLow END AS newLow,
CASE WHEN split.CreateDate >= fl1.tgtDate
THEN split.mid
ELSE fl1.newHigh END AS newHigh,
fl1.tgtDate, fl1.lvl + 1 AS lvl
FROM firstLvl fl1
CROSS APPLY (SELECT midtbl.TblIDColumn AS mid, midtbl.CreateDate
FROM Tbl midtbl
INNER JOIN (
SELECT ROW_NUMBER() OVER (ORDER BY tmp.TblIDColumn) AS [rank], tmp.TblIDColumn
FROM Tbl tmp
WHERE tmp.TblIDColumn >= (fl1.newLow + FLOOR( (fl1.newHigh- fl1.newLow)/2 ))
) sub ON midtbl.TblIDColumn = sub.TblIDColumn
AND sub.[rank] = 1
) split
WHERE 1 = 1
AND (split.mid <> fl1.mid)
), tgtId AS (
SELECT fl.mid AS TblIDColumn
FROM firstLvl fl
WHERE fl.low = fl.high
)

SELECT *
FROM firstLvl fl

/*

SELEC *
FROM tgtId ti

*/

Noel Cruz SSC Rookie Points: 30 More actions · Answer 11

There is a small error in the code above in the tgtId CTE. The fl.low will not always match fl.high. Changing the WHERE clause to the following should fix that issue:

WHERE fl.newLow = fl.mid
      AND fl.newHigh = fl.mid