Blog Post

Moving 1/3 of a heap

,

Watch this week’s video on YouTube.

A Giant Heap

Recently I had to filter out 1.2 billion records from a 3.5 billion row heap. Don’t ask me why this 3.5 billion row table is stored as a heap.

If the lack of a clustered index wasn’t bad enough, I also had some other restrictions:

  • I couldn’t add a clustered index (or any index for that matter) sorted on the key I needed to filter on. It wasn’t my system, and I needed to access the 1.2 billion records sooner than it would take to get a clustered index approved and added. Even then, maybe the lack of a clustered index on this table is a feature and my request would be denied. Who knows.
  • I didn’t have a server that could store all of the data. At first I thought of copying all 3.5 billion rows to my own server and indexing it how I needed, but I didn’t have enough storage space anywhere to do that.
  • My connection to the server had a relatively short timeout set on it. This also couldn’t be changed. If I couldn’t copy all 3.5 billion rows because of storage, I also couldn’t copy all 1.2 billion records in one fell swoop because the connection would timeout.

Iterative Process

Scanning the table hundreds of times…

I struggled with this problem for a little bit. My deadline clock was ticking and I was stuck as to how I could copy and subsequently query the 1.2 billion rows of data I needed. My focus transitioned from “what is the best way to do this” to “how do I do this”.

The solution that ended up working for me was to query the table hundreds of times, each time filtering out and copying only 1 week of data by running a query similar to this:

SELECT *
FROM dbo.MyBigHeap
WHERE
CreateDate >= @StartDate
AND CreateDate < @StopDate

Yes, this did cause me to scan the entire table hundreds of times, but in the end it was the right amount of data that I could copy at a time before the connection timed out.

Eventually I had the 1.2 billion rows I needed copied to my own server. I had a clustered column store index on the table (primarily for the compression savings) and some nonclustered indexes to support the queries I would need to run on it. Was this the best solution? I don’t know. But it worked for me given the constraints and deadline I had to meet.

Moral of the Story

Always put a clustered index on your tables. Even if you don’t have a use case to sort/filter them immediately, you will be creating a world of pain when someone comes along who does need to query that data.

Original post (opens in new tab)
View comments in original post (opens in new tab)

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating