RE: Searching for date and weekday counts within Nvarchar column

SSC Guru

Points: 1003851

February 26, 2017 at 1:31 pm

#1930393

I tend to "over-engineer" a bit when it comes to the quality of data. Yep... we need to know the hourly stuff but I'd also want to know how many bad rows (not of the expected type). With that, here's a bit of a demonstration to produce the desired output plus a little bit more...

First, we need some test data. I never mess around with a small handful of rows because I also test for performance at the same time. With that in mind, the following will generate a million rows of data with some good data, some bad data, and some unexpected data that's still "good enough". As usual, details are in the comments in the code. It also tests for indexing of the expected form of the table.
/********************************************************************************************************************** Create a test table to simulate what may really happen. Nothing in this section is a part of the solution. We're just building test data. Most of the data will follow the form of mm-dd-yyyy, hh:mi AM (or PM). Some of it will be a GUID to simulate bad data. Some of it will be in the correct form but missing the comma and the space just before the AM/PM indicator. This whole thing takes about 18 seconds to generate because 444 byte-length rows have been simulated, as well **********************************************************************************************************************/ --===== If the test table already exists, drop it to make reruns in SSMS easier IF OBJECT_ID('tempdb..#TestTable','U') IS NOT NULL DROP TABLE #TestTable ; GO --===== Identify the range of dates we want to use for our test DECLARE @StartDate DATETIME = '2010' --Inclusive, same as 2010-01-01 ,@EndDate DATETIME = GETDATE() --Exclusive ,@NumberOfRows INT = 1000000 ,@BadRows INT = 1000 ; --===== Create a test table with simulated "other" columns CREATE TABLE #TestTable ( RowNum INT IDENTITY(1,1) PRIMARY KEY CLUSTERED ,SimColA NCHAR(100) DEFAULT 'SimColA' --NCHAR(100) used to simply occupy the space of multiple columns ,SomeDate NVARCHAR(40) --This is your column of dates and times ,SimColb NCHAR(100) DEFAULT 'SimColB' --NCHAR(100) used to simply occupy the space of multiple columns ) ; --===== Populate the test table with random-constrained dates and times in the format that claim to be. -- This also uses minimally logging to save on time and disk space. -- The dates are in random order just to simulate worst case. WITH cteGenDates AS ( SELECT TOP (@NumberOfRows) --Not to worry. Only takes 14 seconds/Million rows even though each rows is 444 characters wide. RandomDT = RAND(CHECKSUM(NEWID()))*DATEDIFF(dd,@StartDate,@EndDate)+@StartDate FROM sys.all_columns ac1 CROSS JOIN sys.all_columns ac2 ) INSERT INTO #TestTable WITH (TABLOCK) (SomeDate) SELECT SomeDate = CONVERT(NVARCHAR(40),RandomDT,110) --The Date part +', ' --The comma and space +STUFF(REPLACE(RIGHT(CONVERT(NVARCHAR(40),RandomDT,100),7),' ','0'),6,0,' ') --The Time part FROM cteGenDates OPTION (RECOMPILE) ; --===== Add a non-Clustered Index to the table to prevent the overhead of having to look through the wide -- Clustered Index because this is all going to cause an index scan because of the incorrect datatype. -- This takes only about 2 seconds. CREATE INDEX By_SomeDate ON #TestTable (SomeDate ASC) ; --===== "Wound" a bunch of rows with some bad, non-date data so that we can test error handling WITH cteBadData AS ( SELECT TOP (@BadRows) BadDate = CONVERT(NVARCHAR(40),NEWID()) ,SomeDate FROM #TestTable ORDER BY NEWID() ) UPDATE cteBadData SET SomeDate = BadDate ; --===== Change a bunch of rows to be nearly correct in form but missing the comma or the space -- between the time and the AM/PM indicator. WITH cteChangeDate AS ( SELECT TOP (@BadRows) ChangedDate = STUFF(STUFF(SomeDate,18,1,''),11,1,'') ,SomeDate FROM #TestTable ORDER BY NEWID() ) UPDATE cteChangeDate SET SomeDate = ChangedDate ; GO

When we're writing the code to do math and unusual formatting, it's important to remember to do the same thing in code that we've been taught for application/system design; Keep the data layer and the presentation layer separate. This not only simplifies coding but, since formatting is expensive, the "pre-aggregation" (a term coined by good friend and fellow MVCP, Peter "Peso" Larsson) accomplished prior to any formatting really helps performance because you only need to format a very small handful of rows rather than all the data in the table.

And, again... I "over-engineer" for safety and with the thought that I don't want something to fail... I want it to succeed AND tell me that there's some bad data involved. Here's what I'd end up doing if I weren't allowed to fix the table or add a Persisted Computed Column to validate each date. Again, details are in the comments.

WITH ctePreValidate AS (--==== This not only starts the check for "date" data but it also strips out any commas. SELECT SomeDate = REPLACE(SomeDate,',','') ,IsADate = ISDATE(REPLACE(SomeDate,',','')) FROM #TestTable --WHERE --Note: If you want to filter on a date range, do it here. ) , cteValidate AS (--==== This fine-tunes the validation to prevent accidental forms that may evaluate to a date. -- For example, '2016' would evaluate to a date as 2016-01-01. SELECT SomeDate ,IsADate = CASE WHEN IsADate = 1 AND SomeDate LIKE '[01][0-9]-[0-3][0-9]-[1-2][09][0-9][0-9]%' THEN 1 ELSE 0 END FROM ctePreValidate ) , ctePreAggregate AS (--==== This preaggregates the data so that we don't have so much to work with for final formatting. -- This limits it all to 26 Rows instead of a million (for example) in about 4 seconds. -- Imagine how fast it will run on smaller date ranges. SELECT HourOfPosting = CASE WHEN IsADate = 1 THEN DATEPART(hh,SomeDate) ELSE -1 END ,HourCount = COUNT(*) FROM cteValidate GROUP BY CASE WHEN IsADate = 1 THEN DATEPART(hh,SomeDate) ELSE -1 END WITH ROLLUP )

--===== This does the final formatting and the % calculation on only 26 rows.
SELECT HourOfPosting = CASE
                        WHEN HourOfPosting >= 0
                        THEN RIGHT(100+HourOfPosting,2)+'-'+RIGHT(100+(HourOfPosting+1)%24,2)
                        WHEN HourOfPosting IS NULL
                        THEN 'Total'
                        ELSE 'BadDate'
                        END
        ,HourCount
        ,PercentageOfTotal = CONVERT(DECIMAL(6,2),HourCount*200.0/SUM(HourCount) OVER ()) --200.0 because Total is included
   FROM ctePreAggregate
ORDER BY HourOfPosting
;

The final output looks like this... The HourCount, Total, and Count of bad rows gives the "Bean Counters" the nice, warm fuzzies. 😉
HourOfPosting HourCount   PercentageOfTotal ------------- ----------- --------------------------------------- 00-01         41707       4.17 01-02         41489       4.15 02-03         41236       4.12 03-04         41557       4.16 04-05         41755       4.18 05-06         41804       4.18 06-07         42041       4.20 07-08         41711       4.17 08-09         41193       4.12 09-10         41686       4.17 10-11         41623       4.16 11-12         41719       4.17 12-13         42107       4.21 13-14         41795       4.18 14-15         41319       4.13 15-16         41467       4.15 16-17         41721       4.17 17-18         41639       4.16 18-19         41639       4.16 19-20         41472       4.15 20-21         41893       4.19 21-22         41494       4.15 22-23         41249       4.12 23-00         41684       4.17 BadDate       1000        0.10 Total         1000000     100.00

(26 row(s) affected)

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)