Blog Post

Outlier Detection with SQL Server, part 2.1: Z-Scores

,

By Steve Bolton

…………Using SQL Server to ferret out those aberrant data points we call outliers may call for some complex T-SQL, Multidimensional Expressions (MDX) or Common Language Runtime (CLR) code. Yet thankfully, the logic and math that underpin the standard means of outlier detection I’ll delve into in this series are sometimes mercifully simple. That was not the case in the introductory article in this amateur series of self-tutorials, in which I tackled the fascinating mystery of Benford’s Law, a means of finding outliers that is often used to expose fraud. I also used the occasion to bring up the topic of potential misuses of outlier detection itself, including some duplicitous practices that are nonetheless frighteningly frequent in academic studies – particularly in the health care field, where a lot of money is at stake and where poorly handled stats can do the most damage to ordinary people. I cannot say enough about how critical it is to moor statistics and data mining solidly in reason, because all it takes is a single fallacy to render the conclusions drawn from them useless, misleading or downright false; this point is worth reiterating throughout this blog series, given that it dwarfs all of the technical issues that mathematicians, statisticians, data miners and the field of computing spend much more time on. Our algorithms may go on correctly generating numbers, but if they’re not firmly embedded in reason, they may be deceptively erroneous. Some of the saving graces of this week’s method of outlier detection are that it is well-understood and simple implement, both of which mean there is there is less room for fallacies to worm their way into Z-Score calculations than there are with other methods.
…………Z-Scores may be more run-of-the-mill than cutting edge methods like Benford’s Law, but the technique is ubiquitous throughout the field of statistics precisely because it is a reliable workhorse. In fact, it is used as a building block in many other higher-order statistics, many of which are in turn used in the edifices of today’s sophisticated data mining algorithms. When I wrote my series of self-tutorials on SQL Server Data Mining (SSDM) I merely set out to prove that this neglected component could be used by inexperienced amateurs like myself for practical purposes, but I didn’t yet grasp the mechanics of why the constituent algorithms worked. In this series and the next several I hope to write, I plan to rectify that defect by looking more under the hood. I’m still hardly qualified to write about statistics, but I have found it is much easier to grasp the material by categorizing various stats by their use cases. It is easiest to sort through the confusing welter of numbers and algorithms by viewing the use cases as a function of the questions one wants to ask of the data, the number and type of inputs an algorithm requires, the number and type of outputs it returns, as well as the mathematical properties associated with the inputs and outputs. For example, if you choose to ask a particular question of a dataset but don’t have the correct data types or number of input parameters, or sample sizes, your choices will be quickly narrowed down to a few stats and algorithms. If you require the outputs to have specific mathematical properties, such as positivity or homoscedasticity, your choices will be further constrained, because the formulas are also highly specific as to what kind of data they spit back out. Will G. Hopkins, the author of an excellent plain English guide to stats, likewise writes that he had a difficult time sorting out the various types of statistical models until he categorized them by the types of comparisons being made, such as numerical vs. nominal data, or numeric vs. numeric, etc. Those categories are basically equivalent to the column Content types I discussed in the series on SSDM, where they represent an important consideration in the design of mining models. It might be helpful in the future to develop a matrix of use cases for all permutations of Content types, data types, numbers of inputs, properties of outputs and perhaps even the performance requirements for the inner workings of the algorithms in between the input and output stages. For now, however, we’ll just use the concept to illustrate what we’re doing with Z-Scores.
…………This particular measure is merely a comparison of a data point to the mean and standard deviation of the dataset it belongs to. The formula for Z-Scores is fairly simple: subtract the data point from the average, then divide by the deviation. What’s the purpose in doing this though? I was able to grope towards a better understanding by resorting to categorization again. Averages are merely the most popular instance of a broader classification of measures of central tendency or location, which identify the center of a dataset; the mean really represents the most primordial form of clustering we know of. If we want to tell how close a particular data point is to that center point, we also need a yardstick to measure the distance by. This is where standard deviation, the most basic metric of dispersion, comes in handy. Rather than pinpointing the single center of the dataset, the deviation is a single measure of how diffuse or spread out all the data points are. Like the mean, standard deviation is a fundamental building block of higher statistics, one which also gives us a convenient means of gauging how far a particular data point is from the center point identified by the mean. In plain English, a Z-Score essentially tells us how many standard deviations (i.e. units of dispersion) there are between a given data point and the center. Many other calculations we’ll encounter in the next few tutorial series are either derived directly from Z-Scores, or resemble them in their use of the mean and standard deviation. To someone lost in the swirl of numbers and equations surrounding statistics, it may seem that there is no rhyme or reason to any of them, but there is a method behind the madness. In each Z-Score calculation, we’re not plugging in just anything, like a whole probability distribution or a sequence or a set, but a single data point – which matches our question, “Is this an outlier?” More sophisticated calculations may require us to further limit our choices by such considerations as data types, Content types, the internal performance requirements of the algorithm, the number of inputs, the sample size and the desired mathematical properties of the output. In the case of Z-Scores, all we really have to make sure of is that we’re inputting one of SQL Server’s numeric data types. We obviously can’t plug text or dates into a Z-Score equation, although we could perform calculations on such fields and then plug them in as needed. We also need to know the mean and standard deviation for the entire dataset, rather than a mere sample; in some situations it might be impractical to calculate them due to resource constraints, but DBAs usually have one over on those engaged in scientific research, in that they usually have populations of millions of records to drawn from if they choose to. Sometimes researchers only have access to small samples taken from unknown populations, in which case it may not be possible to apply Z-Scores at all.
…………Fortunately, coding a Z-Score is also a lot less taxing on the brain than the subject of the last post, Benford’s Law. That is why the stored procedure in Figure 1 is a lot easier to follow. The first three parameters allow you to specify a table in any database for which you have privileges, while the fourth identifies the column to be sampled and the fifth is the name of the table’s primary key. Keep in mind that I don’t allow spaces in my object names, so if you’re going to be operating on objects that have them, you’re going to have to add the requisite brackets to this code yourself. Enter the @DecimalPrecision parameter with care, since an incorrect setting will result in arithmetic overflows; leaving that setting up to the end user was a lot easier to code than a workaround that would fit all use cases. The most difficult part of the code to grasp may be the @OrderByCode, which allows you to sort the results by 1) the primary key values, ascending; 2) the primary key value descending; 3) the column values ascending; 4) the column values descending; 5) the Z-Score ascending and 6) the Z-Score descending. I’m a stickler about giving credit where it is due, so I’ll point that I’ve already done ORDER BY CASES before, but double-checked the syntax at a thread by one of the greatest assets of the SQL Server community, Pinal Dave.[ii] Uncommenting the next-to-last line will allow you to debug the procedure as needed by checking the dynamic SQL. Also be aware that I haven’t taken any steps to proof this against SQL injection attacks, so be ready to program your own security requirements into it. In addition, the procedure is created in a schema called Calculations that I will be using frequently throughout this series, so be prepared to add it to your own database or change the code.

Figure 1: Code for the Z-Score Stored Procedure
CREATE PROCEDURE [Calculations].[ZScoreSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @PrimaryKeyName as nvarchar(400), @OrderByCode as tinyint = 1, @DecimalPrecision AS nvarchar(50)
AS

SET @DatabaseName = @DatabaseName + ‘.’
DECLARE @SchemaAndTableName nvarchar(400)
SET @SchemaAndTableName = ISNull(@DatabaseName, ) + @SchemaName + ‘.’ + @TableName –I’ll change this value one time, mainly for legibility purposes

DECLARE @SQLString nvarchar(max)
SET @SQLString =
‘DECLARE @OrderByCode as tinyint ,– pass the outer value like a parameter of sorts
@StDev AS decimal(‘ + @DecimalPrecision + ‘),
@Mean AS decimal(‘ + @DecimalPrecision + ‘)

– precalculating these not only makes the code more legible, but is more efficient because it is a one-time operation
SELECT @StDev = StDEv(‘ + @ColumnName + ‘)   FROM ‘ + @SchemaAndTableName +
SELECT @Mean = Avg(‘ + @ColumnName + ‘)   FROM ‘ + @SchemaAndTableName +

–SELECT @StDev  — uncomment these to debug

value errors
–SELECT @Mean

SET @OrderByCode = ‘ + CAST(@OrderByCode  AS nvarchar(50) ) +

SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, ZScore, DENSE_RANK () OVER (ORDER BY ZScore) AS GroupRank
FROM
       (SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘,  ”ZScore” = CASE WHEN @StDev = 0 THEN 0
     ELSE (‘ + @ColumnName + ‘ – @Mean) / @StDev
     END
        FROM ‘ + @SchemaAndTableName +               GROUP BY ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘) AS T1 — the purpose of the inner query is to allow us to order by the ZScore

       ORDER BY
CASE WHEN @OrderByCode = 1 THEN ‘ + @PrimaryKeyName + ‘ END ASC,
CASE WHEN @OrderByCode = 2 THEN ‘ + @PrimaryKeyName + ‘ END DESC,
CASE WHEN @OrderByCode = 3 THEN ‘ + @ColumnName + ‘ END ASC,
CASE WHEN @OrderByCode = 4 THEN ‘ + @ColumnName + ‘ END DESC,

CASE WHEN @OrderByCode = 5 THEN ZScore END ASC,
CASE WHEN @OrderByCode = 6 THEN ZScore END DESC

–SELECT @SQLString — uncomment this to debug string errors
EXEC (@SQLString)

Figure 2: Sample Results from Column1 of the HiggsBosonTable

EXEC [Calculations].[ZScoreSP]
             @DatabaseName = ‘DataMiningProjects’,
             @SchemaName = ‘Physics’,
             @TableName = N’HiggsBosonTable’,
             @ColumnName = N’Column1′,
             @PrimaryKeyName = N’ID’,
             @OrderByCode = 6,
             @DecimalPrecision = ‘38,28’

ZScoreResults

…………As discussed in my last few posts, I’ll be using three publicly available practice datasets for my next three or four series of tutorials, beginning in last week’s post with a data on the Duchennes form of muscular dystrophy provided by Vanderbilt University’s Department of Biostatistics. In time, I will probably also perform outlier detection and other data mining calculations on the Voynich Manuscript, an inscrutable medieval tome with an encryption scheme so obscure that no one has been able to crack it for more than five centuries, including the National Security Agency (NSA). The best of the three datasets to stress test this procedure is the data on the Higgs Boson made available by the University of California at Irvine’s Machine Learning Repository, given that its single table dwarfs the tiny 9-kilobyte Duchennes table by almost 6 gigabytes. It also consists entirely of numeric data, unlike the transcriptions of the Voynich Manuscript I’ve imported, which are mainly appropriate for text mining. To date, I have yet to find an explanation of what the 28 float columns actually measure, although there’s an outside chance I might be able to interpret an explanation if I find one, given that I understand particle physics too well for my own good back in fourth grade.[iii] Figure 2 depicts a sample query against the first float column in the HiggsBosonTable of my DataMiningProjects database, which includes all three datasets. Note that it’s in descending order by Z-Scores. The GroupRank separates identical Z-Score values into distinct groups, through the DENSE_RANK windowing function; feel free to eliminate it from the code if it drags down performance on your databases. I find it handy when running the procedure against tables with small ranges of distinct and frequently duplicated values. This logic may also be enhanced by intrepid programmers to handle bucketing and banding of contiguous but not quite identical values, perhaps using fuzzy sets with graded memberships. The interpretation is not difficult at all: the further away Z-Scores are in either direction from zero, the more likely they are to represent any outliers. It only becomes difficult once we compare the results to particular probability distributions, which often expect certain percentage of their values to occur in specific ranges and therefore makes the definition of an outlier less arbitrary in that context; for example, the Gaussian or “normal” distribution, i.e. the bell curve, expects about 68 percent of the values to be within the first standard deviation, 95 within the second and 99.7 within the third.
…………The procedure took much less time to execute on the 11 million rows (the seven at the tail end are accidental duplicates I’ve been procrastinating on removing, but they’re inconsequential for today’s article) of the HiggsBosonTable than I expected, given that my poor beat up development machine is hardly a match for the servers most DBAs use every day. Unfortunately, I was unable to test it with a columnstore index because I haven’t been able to afford to upgrade from SQL Server 2012 Developer Edition to 2014, in which Microsoft lifted many of the restrictions that made it impossible to apply them to many user scenarios. The Column1 I was testing in the HiggsBosonTable has a precision of 33 and a scale of 29, so I naturally received this error when trying to create one: “CREATE INDEX statement failed. A columnstore index cannot include a decimal or numeric data type with a precision greater than 18. Reduce the precision of column ‘Column 1′ to 18 or omit column ‘Column1′. (Microsoft SQL Server, Error: 35341).” I was, however, able to reduce the client processing time and total execution time by adding a regular nonclustered index to Column1. The total execution time was higher, but only because of an increased Wait Time on Server Replies, which was probably due to pressure on the server from other unrelated tasks. If you click on the pictures of the execution plans to enlarge them, you’ll see that the index was used when calculating the standard deviation, the average and the Z-Scores alike.

Figure 3: Client Statistics and Execution Plans for the Z-Score Procedure on the HiggsBosonTable
ZScoreClientStats ZScoreExecutionPlans (1) ZScoreExecutionPlans (2)

…………In my last post, I cited many articles by professionals who gave a long laundry list of use cases, limitations and nuances of interpretation for Benford’s Law. I searched for similar materials for Z-Scores but essentially came up empty, perhaps because they’re so well-established, well-understood and trivial to calculate. The one clear exception I saw mentioned in the literature is that fat-tailed distributions, in which the data is significantly skewed in one or more directions, may require the use of less outlier-sensitive techniques. This particularly outlier identification method is used most often in conjunction with the normal distribution but doesn’t have to be; out of all the techniques I’ll survey in this series, it is perhaps the most flexible and suited to the widest range of use cases. It is fairly easy to understand and interpret, performs well and isn’t dependent on any particular cut-off criteria for defining outliers, unlike many others that force unnecessary all-or-nothing choices. This means we can use it to ask how much of an outlier a record is, which is a much more informative question than simply asking if a record belongs in a single outlier bucket or outside it.
…………In the next installment of this series I’ll briefly touch on Iglewicz and Hoaglin’s Modified Z-Scores, which are one of many twists that can be applied to Z-Scores depending on the use cases at hand. That material should be fairly light and easy, as should some of the posts that follow on stats like Grubbs’ Test and the Tietjen-Moore Test that have deceptively scary names. Interquartile Range and Dixon’s Q-Test should also be trivial to handle. I’m not sure yet how difficult it will be to explain and code lesser-known measures like the Generalized Extreme Studentized Deviate (ESD) Test , Cook’s Distance, Peirce’s Criterion, Chauvenet’s Criterion and the Modified Thompson Tau Test. Eventually I’ll also be doing a recap of how to use SSDM Clustering for finding aberrant data and writing a post tentatively titled Visual Outlier Detection with Reporting Services, which will be full of eye candy rather than T-SQL and equations like the others. I’m not yet sure what order I’ll be tackling them all in, except for Mahalanobis Distance, which is apparently a quite sophisticated and useful method that unfortunately has math of commensurate difficulty. If all goes according to plan, climbing that final hurdle will propel me into a much more difficult but useful series, Information Measurement with SQL Server. We will see references to Z-Scores and formulas like it through that series, as well as the remainder of this one.

See See Hopkins, Will G., 2001, A New View of Statistics website. The home page for the site is http://www.sportsci.org/resource/stats/index.html.

[ii] Pinal Dave, 2007, “SQL SERVER – CASE Statement in ORDER BY Clause – ORDER BY Using Variable,” published July 17, 2007 at the Journey to SQL Authority with Pinal Dave website. Available online at

http://blog.sqlauthority.com/2007/07/17/sql-server-case-statement-in-order-by-clause-order-by-using-variable/

[iii] When my family moved out of the home we had in my elementary school days, we never did find the uranium and radium samples that came with my do-it-yourself cloud chamber kit. Oh well, I suppose the new owners either sold it to some North Koreans, or their kids all have seven toes. I suppose I shouldn’t lose hope of finding them again, given that their half-lives were several million years apiece.

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating