Quality data is a goal we all strive to attain. Nobody wants to have bad data. Yet, despite our best efforts, the problem of bad data has its way of presenting itself over and over again.
One fairly common indicator of bad data is maybe an absence of data. Maybe you have a table in the database with what is expected to be contiguous data, yet somehow there appears to be an absence of data here or there. Maybe there is an absence of the contiguous data and it hasn’t been noticed.
This phenomenon is also known as the gaps and islands issue in the data realm. How do you know if you have an issue or not? This article will help you with a sample query that can find those gaps of missing data. From there, your journey is just beginning.
Finding the Gaps
I have this database that I use for mentoring DBAs of various levels (help desk on up). In the database, I have a schema dedicated for daily trivia. Since the intent is for it to be a daily trivia question, I expect to have a question for every day during the time interval in question. Every now and again, however, I forget to upload a question or I mistakenly believe I have already loaded a question for the day. Typically, I try to do this in advance so as to not be scrambling the last minute to get a question built for the email to pick it up.
Due to the factors just explained, I end up with gaps in my daily question data. The question now becomes how to find that I have gaps. I have a couple of queries here that will save you plenty of time if you follow the same principles applied here.
First let’s start with a bit of a tricky version of the query that builds a numbers table and then spits out all of the dates that are missing a question (only for the range of data within the table that is).
DECLARE @MinDate DATE , @MaxDate DATE; SELECT @MinDate = MIN(PublishDate) , @MaxDate = MAX(PublishDate) FROM DBA.QOTD.DailyQuestion; WITH Tens (N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) , Hundreds (N) AS (SELECT 1 FROM Tens t1 , Tens t2) , Millions (N) AS (SELECT 1 FROM Hundreds t1 , Hundreds t2 , Hundreds t3) , Tally (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM Millions) , Dates AS (SELECT TOP (DATEDIFF(DAY, @MinDate, @MaxDate) + 1) ca.Dates FROM Tally CROSS APPLY ( VALUES (DATEADD(DAY, N - 1, @MinDate)) ) ca (Dates) ) SELECT Dates.Dates FROM Dates LEFT OUTER JOIN DBA.QOTD.DailyQuestion dq ON Dates.Dates = dq.PublishDate WHERE dq.PublishDate IS NULL ORDER BY Dates.Dates;
This particular numbers table trick has been around for quite a while and runs fast enough. However, if you are performance nut, and for this particular case, there is a method that is a wee bit better (usually the dynamic numbers table just shown is sufficiently fast).
DECLARE @MinDate DATE ,@MaxDate DATE; SELECT @MinDate = MIN(dq.PublishDate), @MaxDate = MAX(dq.PublishDate) FROM DBA.QOTD.DailyQuestion dq; WITH StaticRange ([Date]) AS (SELECT @MinDate AS [Date] -- Put the start date here UNION ALL SELECT DATEADD(DAY, 1, [Date]) FROM StaticRange WHERE [Date] <= @MaxDate -- Put the end date here ) SELECT d.[Date] , dq.PublishDate , DATENAME(dw, d.[Date]) AS DayofWk FROM StaticRange d LEFT OUTER JOIN DBA.QOTD.DailyQuestion dq ON d.[Date] = dq.PublishDate WHERE dq.PublishDate IS NULL OPTION (MAXRECURSION 32767); -- 32767 allows infinite loop GO
This method is the much maligned recursive CTE method. In my testing it runs consistently faster with a lower memory grant but does cause a bit more IO to be performed. Some trade-off to be considered there. Both queries are returning the desired data-set which happens to be my missing question days. Only, I have added an extra output in the second query to let me know the day of the week that the missing question occurred on. Maybe I forgot to enter it because it was a weekend day or maybe I opted to not create one at all because the day lands on a Holiday. Let’s take a small peek at the results.
That is a pretty good swath of missing data there. If you look closely, not all days from November 10th through December 1st are missing, just a handful. Alas, it looks like I have some work cut out for me because I have some weekdays, weekends, and Holidays in there as well as nearly the entire month of December. Finding the gaps is the easy part once you have a solid query to start with and I believe this query is just the right tool for that.
What if I wanted to check for a rolling period of time instead of just checking for valid data between a min and max range of data already in the table? That is easy enough as well with some minor tweaks. Let’s look at one more query option here.
--TVC method with rolling year DECLARE @MinDate DATE; SELECT @MinDate = MIN(PublishDate) FROM DBA.QOTD.DailyQuestion; WITH RollingYear (n) AS ( -- 1000 rows SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n) ) , Dates AS (SELECT TOP (DATEDIFF(DAY, @MinDate, DATEADD(YEAR, 1, @MinDate)) + 1) ca.Dates FROM RollingYear CROSS APPLY ( VALUES (DATEADD(DAY, N - 1, @MinDate)) ) ca (Dates) ) SELECT d.Dates , dq.PublishDate , DATENAME(dw, d.Dates) AS DayofWk FROM Dates d LEFT OUTER JOIN DBA.QOTD.DailyQuestion dq ON d.Dates = dq.PublishDate WHERE dq.PublishDate IS NULL ORDER BY d.Dates;
This third code snippet illustrates a slight variation over the previous two in that it utilizes the Table Value Constructor to build the numbers table. All in all, it runs with blazing high speed and produces the results I am seeking – but for a rolling year instead of a static range of data.
Gaps and islands will occur in your data. When it does, you have your choice of three high performing scripts to help you investigate where the gaps and islands exist in your data.