Latest Blog Posts - Multidimensional Mayhem - SQLServerCentral
http://www.sqlservercentral.com/blogs/
The largest free SQL Server community.en-USImplementing Fuzzy Sets in SQL Server, Part 1: Membership Functions and the Fuzzy Taxonomy
http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/07/06/implementing-fuzzy-sets-in-sql-server-part-1-membership-functions-and-the-fuzzy-taxonomy/
Wed, 06 Jul 2016 23:54:21 UT/blogs/multidimensionalmayhem/2016/07/06/implementing-fuzzy-sets-in-sql-server-part-1-membership-functions-and-the-fuzzy-taxonomy/0http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/07/06/implementing-fuzzy-sets-in-sql-server-part-1-membership-functions-and-the-fuzzy-taxonomy/#comments<p><strong>By Steve Bolton</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>In the first installment of this amateur self-tutorial series on applying fuzzy set theory to SQL Server databases, I discussed how neatly it dovetails with Behavior-Driven Development (BDD) principles and user stories. This is another compelling reason to take notice of fuzzy sets, beyond the advantages of using a set-based language like T-SQL to implement them, which will become obvious as this series progresses. There aren’t any taxing mental gymnastics involved in flagging imprecision in natural language statements like “hot,” “cloudy” or “wide,” which is strikingly similar to the way user stories are handled in BDD. What fuzzy sets bring to the table is the ability to handle imprecise data that resides in the no-man’s land between ordinal and continuous Content types. In addition to flagging imprecision in natural language and domain knowledge that is difficult to pin down, it may be helpful to look for attributes which represent categories that are ranked in some way (which sets them apart from nominal data, which is not ordered on any scale) but which it would be beneficial to express on a continuous numerical scale, even at the cost of inexactness. Thankfully, mathematicians have already hashed out a whole framework for modeling this notoriously tricky class of data, even though it is as underused as <a href="https://multidimensionalmayhem.wordpress.com/category/a-rickety-stairway-to-sql-server-data-mining/">the SQL Server Data Mining (SSDM) components I tried to publicize in a previous mistutorial series</a>. It is also fortunate that we already have an ideal tool to implement it with in T-SQL, which can already handle most of the mathematical formulas devised over the last few decades. As I’ll demonstrate in this article, it only takes a few minutes to implement simple membership functions that grade records based on how much they belong to a particular set. It is only when we begin combining different types of imprecision together and assigning more nuanced interpretations to the grading systems that complexity quickly arises and the math becomes challenging. Although I’m still learning the topic as I go – I find it is much easier to absorb this kind of material by writing about it – I hope to reduce the challenge involved by taking a stab at explaining it, which will at a minimum at least help readers avoid repeating my inevitable mistakes.<br />
<span style="font-size:10pt;color:white;">…………</span>The first challenge to overcome is intimidation, because the underlying concepts don’t even require a college education to grasp; in fact, some DBAs have probably already worked with forerunners of fuzzy sets unwittingly, on occasions where they’ve added columns that rate a row’s inclusion in a particular set. It doesn’t take much mental juggling to start thinking explicitly about such attributes as measures of membership in a particular set. Perhaps the simplest forms of membership functions are single columns filled with data that has been assigned that kind of meaning, which can even be derived from such sources as subjective grades assigned by end users in exactly the same manner as movie or restaurant ratings. The data can even be permanently static. At the next level of complexity, we could of course store such data in the form of computed columns, regardless of whether it is read-only or not.<br />
<span style="font-size:10pt;color:white;">…………</span>A couple of really simple restrictions are needed to bring this kind of data into line with fuzzy set theory though. First, since the whole object is to treat ordinal data as if it were continuous, we’d normally use T-SQL data types like float, numeric and decimal – which are the closest we can get, considering that our finite computers can’t truly handle infinitesimal scales. Furthermore, it is probably wise to stick with the convention of using a scale between 0 and 1, since this enables us to integrate it seamlessly with evidence theory, stochastics, decision theory, control theory and neural net weights, all of which are also typically bounded in the same range or quite similar ones; some of the theoretical resources I consulted mentioned in an offhand way that it is possible to use other scales, but I haven’t seen a single instance of it yet in the literature. Ordinal categories are often modeled in SQL Server in text data types like nvarchar, tinyint codes or some type of foreign key, which might have to be retained alongside the membership function storage column; in other instances, our membership function may be scoring on the basis of several attributes in a table or view, or perhaps all of them. Of course, in many use cases we won’t need to store the membership function value at all, so it will have to be calculated on the fly. If we’re simply storing a subjective rating or whatever, we might only need some sort of interface to allow end users to enter their own numbers (on a continuous scale of 0 to 1), in which case there is no need for a membership function per se. If a table or view participates in different types of fuzzy sets, it may be necessary to add more of these membership columns for each of them, unless you want to calculate the values as you go. Simply apply the usual rules of data modeling and principles of performance maximization to determine the strategies that fit your use cases best.</p>
<p style="text-align:center;"><strong>Selecting Membership Functions</strong></p>
<p> That is all kid stuff for most DBAs. The challenges begin when we try to identify which membership function would be ideal for the use cases at hand. Since the questions being asked of the data vary from one problem to the next, I cannot possibly answer that. I suppose that you could say the general rule of thumb with membership functions is that the sky’s the limit, as long as we stay at an altitude between 0 and 1. Later in this series I’ll demonstrate how to use particular classes of functions called T-norms and T-conorms, since various mathematical theorems demonstrate that they’re ideal for implementing unions and intersections, but even in these cases, there are so many available to us that the difficulty consists chiefly in selecting an appropriate match to the problem you’re trying to solve. There might be more detailed guidelines available from more recent sources, but my favorite reference for the math formulas, George J. Klir and Bo Yuan’s classic <em>Fuzzy sets and Fuzzy Logic: Theory and Applications</em>, provides some suggestions. For example, membership values can be derived from sample data through LaGrange interpolation and two methods I have used before, least-squares curve fitting and neural networks.[1] They also discuss how to aggregate the opinions of multiple experts using both direct and indirect methods of collection, in order to ascertain the meaning of fuzzy language terms. The specifics they provide get kind of involved, but it is once again not at all difficult to implement the premises in a basic way; a development team could, for example, reach a definition of the inherently fuzzy term “performance” by scoring their opinions, then weighting them by the authority of their positions.[2] The trick is to pick a mathematical operation that pools them altogether into a single value that stays on a scale of 0 to 1, while still capturing the meaning in a way that is relevant to the problem at hand.<br />
<span style="font-size:10pt;color:white;">…………</span>Klir and Yuan refer to this as an application of the newborn field of “knowledge engineering,”[3] which has obvious connections to expert systems. Since fuzzy set theory is still a wide-open field there’s a lot of latitude for inventing your own functions; there might be an optimal function that matches the problem at hand, but no one may have discovered it yet. In situations like these, my first choice would be neural nets, since I saw spectacular evidence long ago of how they can be ideal for modeling unknown functions (which pretty much sparked my interest in data mining). Before trying one of these advanced approaches, however, it might be wise to think hard about what mathematical properties you require of your outputs and then consult a calculus book or other math reference to try to find a matching function. While trying to teach myself calculus all over again recently, I was reintroduced to the whole smorgasbord of properties that distinguish mathematical functions from each other, like differentiability, integrability, monotonicity, analycity, concavity, subadditity, superadditivity, discontinuity, splines, super- and subidempotence and the like. You’ll encounter these terms on every other page in fuzzy set math references, which can be differentiated (pun intended?) into broad categories like function magnitude, result, shape and mapping properties. One thing I can help with is to caution that it’s often difficult or even impossible to implement ones (like the popular gamma function) which require calculations of permutations or combinations. It doesn’t matter whether you’re talking about T-SQL, Visual Basic, C# or some computer language implemented outside of the Microsoft ecosystem: it only takes very small input values before you reach the boundaries of the highest data types. This renders certain otherwise useful data mining and statistical algorithms essentially useless in the era of Big Data. An exclamation point in a math formula ought to elicit a groan, because the highest value you might be able to plug into a factorial function in SQL Server is about 170.</p>
<p style="text-align:center;"><strong>A Trivial Example with Two Membership Functions</strong></p>
<p> I’ll provide an example here of moderate difficulty, in between the two extremes of advanced techniques like least squares (or God forbid, the gamma function and its relatives) on the one hand and cheesy screenshots of an ordinary table that just happens to have a float column scored between 0 and 1 on the other. As we’ll see in the next few tutorials on fuzzy complements, unions, intersections and the like, when calculating set memberships on the fly we usually end up using a lot of CASE, BETWEEN and MIN/MAX statements in T-SQL, but that won’t be the case in the example below because the values are derived from a stored procedure and stored in two temporary tables. To demonstrate how seamlessly fuzzy set techniques can be integrated with standard outlier detection techniques, I’ll recycle the code from my old tutorial <a href="https://multidimensionalmayhem.wordpress.com/2014/10/28/outlier-detection-with-sql-server-part-2-1-z-scores/">Outlier Detection with SQL Server, part 2.1: Z-Scores</a> and use it as my membership function.<br />
<span style="font-size:10pt;color:white;">…………</span>There’s a lot of code in Figure 1, but it’s really easy to follow, since all we’re doing is running the Z-Scores procedure on a dataset on the Duchennes form of muscular dystrophy I downloaded from <a href="http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets">Vanderbilt University’s Department of Biostatistics</a> a couple of tutorial series ago, which now occupies about 9 kilobytes of space in a sham DataMiningProjects database. There’s probably a more efficient way of going about this, but the results are stored in a table variable and the @RescalingMax, @RescalingMin and @RescalingRange variables and the ReversedZScore column are then used to normalize the Z-Score on a range of 0 to 1 (the GroupRank column was needed for the stored procedure definition in the original Z-Scores tutorial, but can be ignored in this context). To illustrate how we can combine fuzzy set approaches together in myriad combinations, I added an identical table that holds Z-Scores for a second column from the same dataset, which is rescaled in exactly the same way. In the subquery SELECT I merely multiply the two membership values together to derive a CombinedMembershipScore. What this essentially does is give us a novel means of multidimensional outlier detection.</p>
<p><strong><u>Figure 1: Using Z-Scores for Membership Functions<br />
</u></strong><span lang="ES" style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">RescalingMax</span></span> <span class="GramE"><span style="color:blue;">decimal</span><span style="color:gray;">(</span></span>38<span style="color:gray;">,</span>6<span style="color:gray;">),</span> <span style="color:teal;">@<span class="SpellE">RescalingMin</span></span> <span style="color:blue;">decimal</span><span style="color:gray;">(</span>38<span style="color:gray;">,</span>6<span style="color:gray;">),</span> <span style="color:teal;">@<span class="SpellE">RescalingRange</span></span> <span style="color:blue;">decimal</span><span style="color:gray;">(</span>38<span style="color:gray;">,</span>6<span style="color:gray;">)<br />
</span></span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@ZScoreTable1 </span><span class="SpellE"><span style="color:blue;">table<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">ID</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">bigint</span> <span style="color:blue;">IDENTITY </span><span style="color:gray;">(</span>1<span class="GramE"><span style="color:gray;">,</span>1</span><span style="color:gray;">),<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:teal;">PrimaryKey</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE"><span style="color:blue;">sql_variant</span></span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">Value</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="GramE"><span style="color:blue;">decimal</span><span style="color:gray;">(</span></span>38<span style="color:gray;">,</span>6<span style="color:gray;">),<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:teal;">ZScore</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="GramE"><span style="color:blue;">decimal</span><span style="color:gray;">(</span></span>38<span style="color:gray;">,</span>6<span style="color:gray;">),<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:teal;">ReversedZScore</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">as </span><span class="GramE"><span style="color:fuchsia;">CAST</span><span style="color:gray;">(</span></span>1 <span style="color:blue;">as</span> <span style="color:blue;">decimal</span><span style="color:gray;">(</span>38<span style="color:gray;">,</span>6<span style="color:gray;">))</span> <span style="color:gray;">–</span> <span style="color:fuchsia;">ABS</span><span style="color:gray;">(</span><span class="SpellE"><span style="color:teal;">ZScore</span></span><span style="color:gray;">),<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:teal;">MembershipScore</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="GramE"><span style="color:blue;">decimal</span><span style="color:gray;">(</span></span>38<span style="color:gray;">,</span>6<span style="color:gray;">),<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:teal;">GroupRank</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">bigint<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">)</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@ZScoreTable2 </span><span style="color:blue;">table<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">ID</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">bigint</span> <span style="color:blue;">IDENTITY </span><span style="color:gray;">(</span>1<span class="GramE"><span style="color:gray;">,</span>1</span><span style="color:gray;">),<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:teal;">PrimaryKey</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE"><span style="color:blue;">sql_variant</span></span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">Value</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="GramE"><span style="color:blue;">decimal</span><span style="color:gray;">(</span></span>38<span style="color:gray;">,</span>6<span style="color:gray;">),<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:teal;">ZScore</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="GramE"><span style="color:blue;">decimal</span><span style="color:gray;">(</span></span>38<span style="color:gray;">,</span>6<span style="color:gray;">),<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:teal;">ReversedZScore</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">as</span><br />
<span class="GramE"><span style="color:fuchsia;">CAST</span><span style="color:gray;">(</span></span>1 <span style="color:blue;">as</span> <span style="color:blue;">decimal</span><span style="color:gray;">(</span>38<span style="color:gray;">,</span>6<span style="color:gray;">))</span> <span style="color:gray;">–</span> <span style="color:fuchsia;">ABS</span><span style="color:gray;">(</span><span class="SpellE"><span style="color:teal;">ZScore</span></span><span style="color:gray;">),<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:teal;">MembershipScore</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="GramE"><span style="color:blue;">decimal</span><span style="color:gray;">(</span></span>38<span style="color:gray;">,</span>6<span style="color:gray;">),<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:teal;">GroupRank</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">bigint<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">)</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">INSERT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">INTO</span> <span style="color:teal;">@ZScoreTable1<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:teal;">PrimaryKey</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">,</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">Value</span><span style="color:gray;">,</span> <span class="SpellE"><span style="color:teal;">ZScore</span></span><span style="color:gray;">,</span> <span class="SpellE"><span style="color:teal;">GroupRank</span></span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE"><span style="color:teal;">Calculations</span><span style="color:gray;">.</span><span style="color:teal;">ZScoreSP<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"> </span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DatabaseName</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’DataMiningProjects</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SchemaName</span></span> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’Health</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">TableName</span></span> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’DuchennesTable</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">ColumnName</span></span> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’CreatineKinase</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@PrimaryKeyName </span><span style="color:gray;">=</span> <span style="color:red;">N’ID’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@DecimalPrecision </span><span style="color:gray;">=</span> <span style="color:red;">’38<span class="GramE">,32′</span></span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@OrderByCode </span><span style="color:gray;">=</span> 8</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">INSERT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">INTO</span> <span style="color:teal;">@ZScoreTable2<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:teal;">PrimaryKey</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">,</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">Value</span><span style="color:gray;">,</span> <span class="SpellE"><span style="color:teal;">ZScore</span></span><span style="color:gray;">,</span> <span class="SpellE"><span style="color:teal;">GroupRank</span></span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE"><span style="color:teal;">Calculations</span><span style="color:gray;">.</span><span style="color:teal;">ZScoreSP<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"> </span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DatabaseName</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’DataMiningProjects</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SchemaName</span></span> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’Health</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">TableName</span></span> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’DuchennesTable</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">ColumnName</span></span> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’LactateDehydrogenase</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@PrimaryKeyName </span><span style="color:gray;">=</span> <span style="color:red;">N’ID’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@DecimalPrecision </span><span style="color:gray;">=</span> <span style="color:red;">’38<span class="GramE">,32′</span></span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@OrderByCode </span><span style="color:gray;">=</span> 8</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:green;">— RESCALING FOR COLUMN 1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">RescalingMax</span></span> <span style="color:gray;">=</span> <span class="GramE"><span style="color:fuchsia;">Max</span><span style="color:gray;">(</span></span><span class="SpellE"><span style="color:teal;">ReversedZScore</span></span><span style="color:gray;">),</span> <span style="color:teal;">@<span class="SpellE">RescalingMin</span></span><span style="color:gray;">=</span> <span style="color:fuchsia;">Min</span><span style="color:gray;">(</span><span class="SpellE"><span style="color:teal;">ReversedZScore</span></span><span style="color:gray;">)</span> <span style="color:blue;">FROM</span> <span style="color:teal;">@ZScoreTable1<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">RescalingRange</span></span> <span style="color:gray;">=</span> <span style="color:teal;">@<span class="SpellE">RescalingMax </span></span><span style="color:gray;">–</span> <span style="color:teal;">@<span class="SpellE">RescalingMin</span></span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:fuchsia;">UPDATE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@ZScoreTable1<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE"><span style="color:teal;">MembershipScore</span></span> <span style="color:gray;">=</span> <span style="color:gray;">(</span><span class="SpellE"><span style="color:teal;">ReversedZScore </span></span><span style="color:gray;">–</span> <span style="color:teal;">@<span class="SpellE">RescalingMin</span></span><span style="color:gray;">)</span> <span style="color:gray;">/</span> <span style="color:teal;">@<span class="SpellE">RescalingRange</span></span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;"> </span><span style="font-size:9.5pt;font-family:Consolas;color:green;">— RESCALING FOR COLUMN 2<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">RescalingMax</span></span> <span style="color:gray;">=</span> <span class="GramE"><span style="color:fuchsia;">Max</span><span style="color:gray;">(</span></span><span class="SpellE"><span style="color:teal;">ReversedZScore</span></span><span style="color:gray;">),</span> <span style="color:teal;">@<span class="SpellE">RescalingMin</span></span><span style="color:gray;">=</span> <span style="color:fuchsia;">Min</span><span style="color:gray;">(</span><span class="SpellE"><span style="color:teal;">ReversedZScore</span></span><span style="color:gray;">)</span> <span style="color:blue;">FROM</span> <span style="color:teal;">@ZScoreTable2<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">RescalingRange</span></span> <span style="color:gray;">=</span> <span style="color:teal;">@<span class="SpellE">RescalingMax </span></span><span style="color:gray;">–</span> <span style="color:teal;">@<span class="SpellE">RescalingMin</span></span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:fuchsia;">UPDATE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@ZScoreTable2<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE"><span style="color:teal;">MembershipScore</span></span> <span style="color:gray;">=</span> <span style="color:gray;">(</span><span class="SpellE"><span style="color:teal;">ReversedZScore </span></span><span style="color:gray;">–</span> <span style="color:teal;">@<span class="SpellE">RescalingMin</span></span><span style="color:gray;">)</span> <span style="color:gray;">/</span> <span style="color:teal;">@<span class="SpellE">RescalingRange</span></span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">ID</span><span style="color:gray;">,</span> <span class="SpellE"><span style="color:teal;">PrimaryKey</span></span><span style="color:gray;">,</span> <span style="color:teal;">Value</span><span style="color:gray;">,</span> <span style="color:teal;">ZScore1</span><span style="color:gray;">,</span> <span style="color:teal;">ZScore2</span><span style="color:gray;">, </span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">MembershipScore1</span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">,</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">MembershipScore2</span><span style="color:gray;">,</span> <span class="SpellE"><span style="color:teal;">CombinedMembershipScore<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">FROM </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">T1</span><span style="color:gray;">.</span><span style="color:teal;">ID</span><span style="color:gray;">,</span> <span style="color:teal;">T1</span><span style="color:gray;">.</span><span style="color:teal;">PrimaryKey</span><span style="color:gray;">,</span> <span style="color:teal;">T1</span><span style="color:gray;">.</span><span style="color:teal;">Value</span><span style="color:gray;">,</span> <span style="color:teal;">T1</span><span style="color:gray;">.</span><span style="color:teal;">ZScore</span> <span style="color:blue;">AS</span> <span style="color:teal;">ZScore1</span><span style="color:gray;">,</span> <span style="color:teal;">T2</span><span style="color:gray;">.</span><span style="color:teal;">ZScore</span> <span style="color:blue;">as</span> <span style="color:teal;">ZScore2</span><span style="color:gray;">,</span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">T1</span><span style="color:gray;">.</span><span style="color:teal;">MembershipScore</span> <span style="color:teal;">MembershipScore1</span><span style="color:gray;">,</span> <span style="color:teal;">T2</span><span style="color:gray;">.</span><span style="color:teal;">MembershipScore</span> <span style="color:blue;">AS</span> <span style="color:teal;">MembershipScore2</span><span style="color:gray;">,</span> <span style="color:teal;">T1</span><span style="color:gray;">.</span><span style="color:teal;">MembershipScore</span> <span style="color:gray;">*</span> <span style="color:teal;">T2</span><span style="color:gray;">.</span><span style="color:teal;">MembershipScore</span> <span style="color:blue;">AS</span> <span class="SpellE"><span style="color:teal;">CombinedMembershipScore<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">FROM</span> <span style="color:teal;">@ZScoreTable1</span> <span style="color:blue;">AS</span> <span style="color:teal;">T1<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">INNER </span><span style="color:gray;">JOIN</span> <span style="color:teal;">@ZScoreTable2 </span><span style="color:blue;">AS</span> <span style="color:teal;">T2<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">ON </span><span style="color:teal;">T1</span><span style="color:gray;">.</span><span style="color:teal;">ID</span> <span style="color:gray;">=</span> <span style="color:teal;">T2</span><span style="color:gray;">.</span><span style="color:teal;">ID</span><span style="color:gray;">)</span> <span style="color:blue;">AS</span> <span style="color:teal;">T3<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">WHERE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE"><span style="color:teal;">CombinedMembershipScore</span></span> <span style="color:gray;">IS</span> <span style="color:gray;">NOT</span> <span style="color:gray;">NULL<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">ORDER</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">BY</span> <span class="SpellE"><span style="color:teal;">CombinedMembershipScore </span></span><span style="color:blue;">DESC</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:green;">— <span class="GramE">if</span> we want to store the values in the original table, we can use code like this:<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:fuchsia;">UPDATE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">T4<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">T4</span><span style="color:gray;">.</span><span style="color:teal;">MembershipScore1</span> <span style="color:gray;">=</span> <span style="color:teal;">T3</span><span style="color:gray;">.</span><span style="color:teal;">MembershipScore1</span><span style="color:gray;">,</span> <span style="color:teal;">T4</span><span style="color:gray;">.</span><span style="color:teal;">MembershipScore2</span> <span style="color:gray;">=</span> <span style="color:teal;">T3</span><span style="color:gray;">.</span><span style="color:teal;">MembershipScore2</span><span style="color:gray;">,</span> <span style="color:teal;">T4</span><span style="color:gray;">.</span><span style="color:teal;">CombinedMembershipScore</span> <span style="color:gray;">=</span><br />
<span style="color:teal;">T3</span><span style="color:gray;">.</span><span style="color:teal;">CombinedMembershipScore<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">FROM</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE"><span style="color:teal;">DataMiningProjects</span><span style="color:gray;">.</span><span style="color:teal;">Health</span><span style="color:gray;">.</span><span style="color:teal;">DuchennesTable </span></span><span style="color:blue;">AS</span> <span style="color:teal;">T4<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">INNER</span> <span style="color:gray;">JOIN</span> </span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"> </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">T1</span><span style="color:gray;">.</span><span style="color:teal;">PrimaryKey</span><span style="color:gray;">,</span> <span style="color:teal;">T1</span><span style="color:gray;">.</span><span style="color:teal;">MembershipScore</span> <span style="color:blue;">AS</span> <span style="color:teal;">MembershipScore1</span><span style="color:gray;">,</span> <span style="color:teal;">T2</span><span style="color:gray;">.</span><span style="color:teal;">MembershipScore</span> <span style="color:blue;">AS</span> <span style="color:teal;">MembershipScore2</span><span style="color:gray;">,</span> <span style="color:teal;">T1</span><span style="color:gray;">.</span><span style="color:teal;">MembershipScore</span> <span style="color:gray;">*</span> <span style="color:teal;">T2</span><span style="color:gray;">.</span><span style="color:teal;">MembershipScore</span> <span style="color:blue;">AS</span> <span class="SpellE"><span style="color:teal;">CombinedMembershipScore<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">FROM</span> <span style="color:teal;">@ZScoreTable1</span> <span style="color:blue;">AS</span> <span style="color:teal;">T1<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">INNER </span><span style="color:gray;">JOIN</span> <span style="color:teal;">@ZScoreTable2 </span><span style="color:blue;">AS</span> <span style="color:teal;">T2<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">ON </span><span style="color:teal;">T1</span><span style="color:gray;">.</span><span style="color:teal;">ID</span> <span style="color:gray;">=</span> <span style="color:teal;">T2</span><span style="color:gray;">.</span><span style="color:teal;">ID</span><span style="color:gray;">)</span> <span style="color:blue;">AS</span> <span style="color:teal;">T3<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">ON</span> <span style="color:teal;">T4</span><span style="color:gray;">.</span><span style="color:teal;">ID</span> <span style="color:gray;">=</span> <span style="color:teal;">T3</span><span style="color:gray;">.</span><span style="color:teal;">PrimaryKey</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;"> </span><strong><u>Figure 2: Sample Results from the Duchennes Practice Data<br />
<a href="https://multidimensionalmayhem.wordpress.com/2016/07/06/implementing-fuzzy-sets-in-sql-server-part-1-membership-functions-and-the-fuzzy-taxonomy/combined-membership-function-example/" rel="attachment wp-att-602"><img class="alignnone size-full wp-image-602" src="https://multidimensionalmayhem.files.wordpress.com/2016/07/combined-membership-function-example.jpg?w=604&h=204" alt="Combined Membership Function Example" width="604" height="204" /></a></u></strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>Figure 2 gives a glimpse of what the original DuchennesTable might look like if we wanted to store these values rather than calculate them on the fly, which can be accomplished by adding the three float columns on the right to the table definition and executing the UPDATE code at the end of Figure 1. In natural language, we might say that “the first record is 0.941446<sup>th</sup> of a member in the set around the average Creatine Kinase value” but “the fifth record is only 0.764556<sup>th</sup> of a member of the set near the mean Lactate Dehydrogenase value.” We could even model deeper levels of imprecision by creating categories like “near” for the high membership values in each column and “outlier” for the lowest ones, then define their boundaries in terms of fuzzy sets. This might be an ideal use for triangular and trapezoidal numbers, which can be worth the expense in extra code, as I’ll explain a few articles from now. We’re also modeling a different type of imprecision in another sense, because we know instinctively that there ought to be some way of gauging whether or not a record’s an outlier when both columns are taken into account; perhaps nobody knows precisely what the rules for constructing such a metric might be, but the CombinedMembershipScore at least allows us to get on the board.<br />
<span style="font-size:10pt;color:white;">…………</span>Please keep in mind that I’m only using Z-Scores here because it’s familiar to me and is ideal for illustrating how fuzzy sets can be easily adapted to one particular use case, outlier detection. If we needed to make inferences about how well the data fit a gamma or exponential distribution, we might as well have used the corresponding goodness-of-fit tests and applied some rescaling techniques to derive our membership values; if we needed to perform fuzzy clustering, we could have plugged in a Manhattan distance function or one of its relatives. Fuzzy set memberships are often completely unrelated to stochastics and should not be interpreted as probabilities unless you specific intend to model them. The usefulness of fuzzy sets is greatly augmented when we move beyond mere set membership by tweaking the meaning a little, so that they can be interpreted as degrees of evidence, reliability, risk, desirability, or the like, which allow us to plug into various other well-developed mathematical theories. All functions can be differentiated by their return types, number of return and input values, allowable data types and ranges, mathematical properties and the like (not to mention performance costs), but in fuzzy set theory the issue of meaning has a somewhat more prominent role. In some cases, it may even be desirable to use multiple membership functions to determine membership in one fuzzy set, as in my crude example above. These myriad shades of meaning and potential for combinations of them lead to a whole new level of complexity, which may nonetheless be worthwhile to wade through for certain imprecision modeling problems.</p>
<p style="text-align:center;"><strong>A Taxonomy of Fuzzy Sets (that Doesn’t Tax the Brain)</strong></p>
<p> I originally figured that I’d have to organize this series according to a taxonomy of different types of fuzzy sets, but it’s actually fairly simply to sketch the outlines of that otherwise advanced topic. Instead of delving into all of the complex math, it’s a lot easier for a layman to dream up all of the combinations of all the places in a set they can apply fuzziness, different means of encoding it and so on. The important thing to keep in mind is that there’s probably a term out there for whatever combination you’re using and that somewhere along the line, mathematicians have probably already figured out most of the logical implications decades ago (thereby saving a lot of the grunt work and reinventing the wheel, assuming that you can interpret their writing and the really thick formulas that often accompany them). The easiest ones to explain are real-valued and interval sets, in which the membership functions are determined on the real number line (which is all we ever encounter in SQL Server) or by a range of values on it.[4] Type-2 Fuzzy Sets illustrate the concept of tacking on further fuzziness perfectly – all we do take an interval-valued set and then assign grades to its boundaries as well. Fuzzy set theorists Yingjie Yang and Chris Hinde state that “A type-2 fuzzy set describes its memberships using type-1 fuzzy sets, but it needs precise crisp values to describe its secondary memberships.”[5] As the levels and number of values needed to define these sets proliferates, the performance costs do as well, so one has to be sure in advance that the extra complexity is useful in modeling real-world data. As Klir and Yuan put it, “Fuzzy sets of type 2 possess a great expressive power, and, hence, are conceptually quite appealing. However, computational demands for dealing with them are even greater than those for dealing with interval-valued sets. This seems to be the primary reason why they have almost never been utilized in any applications.”[6] I’d wager that’s still true, given the fact that the applications of ordinary fuzzy sets to data mining, data warehousing and relational databases have barely been scratched since the mathematicians invented these things years ago.<br />
<span style="font-size:10pt;color:white;">…………</span>Rough sets also involve fuzzy values on intervals in a sense, but they model approximate distinctions between objects. Say, for example, you classify all of the objects in a child’s bedroom and want to see which qualify as part of a set labeled Toys. A sports car might be considered an adult toy to a certain degree, depending on such factors as whether or not the owner uses it for purposes other than occasional joy rides. The plastic dinosaurs and megafauna in a <a href="http://www.dinosaurcollector.150m.com/marx.htm">Prehistoric Playset </a>are certainly toys, as are <a href="https://www.pinterest.com/mgmagnolia/fisher-price-1960s-1970s-1980s-toys/">Fisher Price’s wooden people</a> (well, cheap plastic these days). Medicine definitely wouldn’t belong to the set (at least according to <a href="https://www.youtube.com/watch?v=e3zds9zaDBc">these singing pills</a>). Would one of these <a href="http://www.clubtokyo.org/listings/itemListingRpt.php?catID=4&subCatID=93&contentID=216">classic glow-in-the-dark Godzilla models</a> from the ‘70s qualify? Well, that’s not quite clear, since it’s an object only a child would really appreciate, but they’re unlikely to actually play with it as a toy very often, since it’s designed to stay on display. They could conceivably take them off the shelf and pit them against the Fisher Price people; in this instance, the set membership might be defined by criteria as fuzzy as the whims of a child’s imagination, but we have tools to model it, if a need should arise. The definition of the attribute is in question, not whether a particular row belongs to a set, which is the case with ordinary fuzzy membership functions.<br />
<span style="font-size:10pt;color:white;">…………</span>In Soft Sets, the characteristics that define the set are themselves fuzzy. I haven’t attempted to model those yet, but I imagine it may require comparisons between tables and views and placing weights on how comparable their different columns are to each other, rather than the rows. Here’s a crude and possibly mistaken example I came up with off the top of my head: in Soft Sets you might have a table with columns for Height, Width and Age and another with columns for Height, Width and Time, in which the first two columns of each are completely related to each and therefore are assigned weights of one, whereas Age and Time are only tangentially related and therefore might be assigned a weight somewhere between 0 and 1. Near sets apparently address a problem tangential to rough and soft sets, by quantifying the quantity and quality of resemblances between objects that might belong to a fuzzy set. Once we’ve been introduced to these concepts, they can obviously be combined together into an endless array of variants, which go by such mouthfuls as “rough intuitionistic Level-2 fuzzy near sets.” Just keep in mind that it is more common to encounter such structures in the real world in everyday language than it is to know the labels and their mathematical properties. It is also easier than it sounds to implement them in practice, if we’re using set-based tools like T-SQL that are ideal for the job.<br />
<span style="font-size:10pt;color:white;">…………</span>I probably won’t spend much time in this series on even more sophisticated variants that might nonetheless be useful in modeling particular problems. Shadowed sets used multidimensional projections to qualify the lack of knowledge of whether or not a data point belongs to a fuzzy set. Neural nets are a cutting-edge topic I hope to tackle on this blog in a distant future (my interest in data mining was piqued way back in the 1990s when I saw some I cooked up at home do remarkable things) but it is fairly easy to describe Neuro-Fuzzy Sets, in which we’re merely using neural nets to perform some of the functions related to fuzzy sets. The combinations that can be derived from are limited only by one’s imagination; there are already neural nets in use in industry today that use fuzzy functions for activation and fuzzy sets whose membership values are derived from neural nets, and so forth. Undetermined and Neutrosophic Logic are variants of fuzzy logic that can be applied to fuzzy sets if we need to model different types of indeterminacy, which is a topic I’ll take up in a future article on how fuzzy sets can be put to good use in uncertainty management.<br />
<span style="font-size:10pt;color:white;">…………</span>Blurry sets are a recent innovation designed to incorporate the kind of combinations of fuzziness we’ve just mentioned, but without sacrificing the benefits of normal logic – which might be of great benefit in the long run, since the value of some of recently developed logical systems is at best unproven.[7] Some will probably be substantiated in the long run, but some seem to be motivated by the sort of attention-getting shock value that can make academicians famous overnight these days (some of them seem to be implementations and formal defenses of solipsism, i.e. one of the defining characteristics of schizophrenia). Q-Sets are apparently an even more advanced variants developed for use in the strange world of quantum physics; since making Schrödinger’s cat disappear isn’t among most SQL Server users’ daily duties, I’ll leave that one out for now. I’ll probably also steer away from discussing more advanced types of fuzzy sets that include multiple membership functions, which aren’t referenced often in the literature and apparently are implemented only in rare circumstances.. Intuitionistic Sets have two, one for membership or non-membership, while Vague Sets also use two, except in that case one assesses the truth of the evidence for a record’s membership and the other its falsehood; I presume truth tables and the like are then built from the two values. A novel twist on this theme is the use of multiple membership functions to model the fact that the programmer is uncertain of which membership functions to use in defining fuzzy sets.[8] Multisets are often lumped in with the topic of fuzzy sets, but since they’re just sets that allow duplicate values, I don’t see much benefit in discussing them here. Genuine sets take fuzziness to a new level in an entirely different way, by generalizing the concept of fuzzy set in the same manner that fuzzy sets generalize ordinary “crisp” sets, but I won’t tack on another layer of mathematical complexity at this point, not when the potential for using the established methods of generalization has barely been scratched.</p>
<p style="text-align:center;"><strong>False Mysticism and the Fuzzy Mystique</strong></p>
<p> This wide-open field is paradoxically young in terms of mathematical intellectual history, but overripe for implementation, given that many productive uses for it were derived decades ago but haven’t percolated down from academia yet. Taking a long view of the history of math, it seems that new waves of innovation involve the addition of new dimensions to existing objects. Leonhard Euler introduced complex numbers in the 18<sup>th</sup> Century, then theoreticians like Bernhard Riemann and Charles Hinton contributed the concepts of higher-dimensional space and its curvature in the 19<sup>th</sup>. Around the same time, Georg Cantor was working out set theory and such mind-blowing structures as high-cardinality infinities and transfinities. More recently, Benoit Mandelbrot elaborated the theory of fractional dimensions, which are now cornerstones in chaos theory and modern art, where they go by the better-known term of fractals. This unifying principle of mathematical innovation stretches back as far as ancient Greece, when concepts like infinity, continuous scales and the like were still controversial; in fact, the concept of zero did not reach the West until it was imbibed from Arab sources in the Middle Ages. Given that zero was accepted so late in history, it is thus not at all surprising that negative numbers were often derided by Western mathematicians as absurdities well into the 18<sup>th</sup> and 19<sup>th</sup> Centuries, many centuries after their discovery by Chinese and Indian counterparts.[9] A half-conscious prejudice against the infinite regress of non-repeating digits in pi and Euler’s number is embedded in the moniker they still go by today, “irrational numbers.” The same culprit is behind the term “imaginary number” as well. Each of these incredibly useful innovations was powered by the extension of scales into previously uncharted territory; each was also met by derision and resistance at first, as were fuzzy sets to a certain extent after their development by 20<sup>th</sup> Century theoreticians like Max Black and Lofti A. Zadeh.<br />
<span style="font-size:10pt;color:white;">…………</span>Many of these leaps forward were also accompanied by hype and as sort of unbalanced intellectual intoxication, which is the main risk in using these techniques. Fuzzy sets are unique, however, in that some of the pioneers were conscious of the possibility of leveraging the term “fuzzy” for attention; Zadeh openly acknowledges that the term has its uses in terms of publicity power, although he did not originally invent the term for that purpose. The strategy has backfired to a certain extent, however, by drawing the wrong kind of attention. “Fuzzy” is a term that immediately conjures up many alternative images, many of which don’t seem conducive to a high-powered, mission-critical production environment – like teddy bears, static, 1970s cop shows and something out of <a href="https://www.youtube.com/watch?v=IhfH9RA5dnM">the back of George Carlin’s fridge</a>.<br />
<span style="font-size:10pt;color:white;">…………</span>Many of the taxonomic terms listed above also carry a kind of shock value to them; in other branches of academia this usually signifies that the underlying theory is being overstated or is even the product of crackpots with tenure, but in this case there is substantial value once the advertising dross has been stripped away. In fact, I’d wager that if more neutral terms like “graded set” or “continuously-valued set” were used in place of “fuzzy,” these techniques would be commonplace today in a wide variety of industries, perhaps even database management; in this case, the hype has boomeranged by stunting the adoption of an otherwise indispensable set of tools. As McNeill points out, some of the researchers employed in implementing fuzzy sets in various industries (including the development of the space shuttle) back in the early ‘90s had to overcome significant institutional resistance from “higher-ups” who “fretted about image.”[10] They are right to fret within reason, because these tools can certainly be misapplied; in fact, I’ve seen brilliant theorists who grasp the math a lot better than I do abuse it in illogical ways (for the sake of being charitable, I don’t want to call them out by name). Some highly regarded intellectuals don’t recognize <em>any</em> boundaries to the theory, for all of reality is fuzzy in their eyes – which is the mark of fanaticism, and certain to stiffen any institutional resistance on the other side. Every mathematical innovation in history has not only been accompanied by knee-jerk opposition from Luddites on one side, but also unwarranted hype and irrational exuberance on the other; fuzzy sets are as susceptible to misuse by bad philosophers and fanatics as higher dimensions, chaos theory and information theory have been for decades, so it is not unwise to tread carefully and maintain intellectual sobriety when integrating fuzzy sets into any development process.<br />
<span style="font-size:10pt;color:white;">…………</span>Perhaps the best way to overcome this kind of institutional resistance and receive backing for these techniques is to be up front and demonstrate that you recognize the hype factor, plus have clear litmus tests for discerning when and when not to apply fuzzy set theory. Two of these are the aforementioned criteria of searching for data that resides in between ordinal and continuous data in the hierarchy of Content types and sifting through natural language terms for imprecision modeling. It is also imperative to develop clear standards for differentiating between legitimate and illegitimate uses of fuzzy sets, to prevent the main risk: “fuzzifying” data that it is inherently crisp. It is indeed possible to add graded boundaries to any mathematical objects (some of which we’ll explore later in this series), but in many cases, there is no need to bother. Fuzzy logic in the wrong doses and situations can even lead to fallacious conclusions. In fact, applying fuzziness to inherently crisp objects and vice-versa is one of the fundamental strategies human beings have employed since time immemorial to deceive both themselves and others. Here’s a case in point we’ve all seen: you tell your son or daughter they can’t have a snack, but you catch them eating crackers; invariably, their excuse involves taking advantage of the broad interval inherent in the term “snack,” a set which normally, but not always, included crackers. Of course, when people grow up they sometimes only get more skilled at blurring lines through such clever speech (in which case they often rise high in politics, Corporate America and the legal profession). Here’s an important principle to keep in mind: whenever you see a lot of mental energy expended to tamper with the definitions of things, but find the dividing lines <em>less</em> clear afterwards, then it’s time to throw a red flag. The whole point of fuzzy sets is not to obscure clear things, but to clear up the parts that remain obscure. Fuzziness is in exactly the same boat as mysticism, which as G.K. Chesterton once said, is only useful when it explains mysteries:</p>
<blockquote><p> “A verbal accident has confused the mystical with the mysterious. Mysticism is generally felt vaguely to be itself vague—a thing of clouds and curtains, of darkness or concealing vapours, of bewildering conspiracies or impenetrable symbols. Some quacks have indeed dealt in such things: but no true mystic ever loved darkness rather than light. No pure mystic ever loved mere mystery. The mystic does not bring doubts or riddles: the doubts and riddles exist already…The mystic is not the man who makes mysteries but the man who destroys them. The mystic is one who offers an explanation which may be true or false, but which is always comprehensible—by which I mean, not that it is always comprehended, but that it always can be comprehended, because there is always something to comprehend.”[11]</p></blockquote>
<p><span style="font-size:10pt;color:white;">…………</span>Fuzzy sets are not meant to mystify; they’re not nebulous or airy, but designed to squeeze some clarity out of apparently nebulous or airy data and logic. They are akin to spraying Windex on a streaky windshield; if you instead find your vision blocked by streaks of motor oil, it’s time to ask who smeared it there and what their motive was. Fuzziness isn’t an ingredient you add to a numerical recipe to make it better; it’s a quality inherent in the data, which is made clearer by modeling the innate imprecision that results from incomplete measurement, conflicting evidence and many other types of uncertainty. The point is not to make black and white into grey, but to shine a light on it, so that we can distinguish the individual points of black and white that make up grey, which is just a composite of them. These techniques don’t conjure up information; they only ensure that what little information is left over after we’ve defined the obvious crisp sets doesn’t go to waste. Fuzziness can actually arise from a surfeit of detail or thought, rather than a deficit or either; the definition of an object may be incomplete because so many sense impressions, images, stray thoughts, academic theories and whatnot are attached to its meaning that we can neither include them all nor leave any out.<br />
<span style="font-size:10pt;color:white;">…………</span>As we shall see in future articles on uncertainty management, the manner in which the meaning of set membership can be altered to incorporate evidence theory and the like is indeed empowering, but calls for a lot of mental rigor to resist unconscious drifts in definition. It’s an all-too human problem that can occur to anyone, particular when mind-blowing topics are under discussion; it’s even noticeable at times in the writings of brilliant quantum physicists, who sometimes unconsciously define their terms slightly differently at the beginning of a book than at the end, in ways that nonetheless make all the difference between Schrödinger’s Cat being alive or dead. “Definition drift” also seems to be a Big Problem in Big Analysis for the same reason. It likewise seems to occur in texts on fuzzy sets, where term “fuzz” is often accurately described on one page as a solution to innate imprecision, but on the next, is unconsciously treated as if it were a magic potion that ought to be poured on everything. Another pitfall is getting lost in all of the bewildering combinations of fuzziness I introduced briefly in the taxonomy, but the answer to that is probably to just think of them in terms of ordinary natural language and only use the academic names when sifting through the literature for appropriate membership functions and the like. Above all, avoiding modeling crisp sets that have inherently Boolean yes-or-no membership values as fuzzy sets, because as the saying goes, you can’t be “a little bit pregnant.” Continuous scales can certainly be added to any math object, but if the object being modeled is naturally precise, then it is at best a waste of resources that introduces the risk of fallacious reasoning and at worst, an opening for someone with an axe to grind to pretend a particular scale is much more imprecise than it really is. One dead giveaway is the use of short scales in comparison to the length of the original crisp version. For example, this is the culprit when quibbling erupts over such obviously crisp sets as “dead” and “alive,” on the weak grounds that brain death takes a finite amount of time, albeit just a fraction of a person’s lifespan. It might be possible to develop a Ridiculousness Score by comparing the difference in intervals between those few moments, which occur on an almost infinitesimal scale, against an “alive” state that can span 70-plus years in human beings or the “dead,” which is always infinite. I haven’t seen that done in the literature, but in two weeks, I’ll demonstrate how the complements of fuzzy sets can be used to quantify just how imprecise our fuzzy sets are. The first two installments of this series were lengthy and heavy on text because we needed a solid grounding in the meaning of fuzzy sets before proceeding to lessons in T-SQL code, but the next few articles will be much shorter and immediately beneficial to anyone who wants to put it into action.</p>
<p>[1] pp. 290-293, Klir, George J. and Yuan, Bo, 1995, <u>Fuzzy Sets and Fuzzy Logic: Theory and Applications</u>. Prentice Hall: Upper Saddle River, N.J.</p>
<p>[2] <em>IBID.</em>, pp. 287-288, 292-293.</p>
<p>[3] <em>IBID.</em>, p. 281.</p>
<p>[4] For a quick introduction to the various fuzzy set types, see the <u>Wikipedia</u> article Fuzzy Sets at <a href="http://en.wikipedia.org/wiki/Fuzzy_set">http://en.wikipedia.org/wiki/Fuzzy_set</a>. I consulted it to make sure that I wasn’t leaving out some of the newer variants that came out since Klir and Yuan and some of the older fuzzy set literature I’ve read, much of which dates from the 1990s. I lost some of the citations to the notes I derived these three paragraphs from (so my apologies go out to anyone I might have inadvertently plagiarized) but nothing I said here can’t be looked up quickly on Wikipedia, Google or any recent reference on fuzzy sets.</p>
<p>[5] Hinde, Chris and Yang, Yingjie, 2000, A New Extension of Fuzzy Sets Using Rough Sets: R-Fuzzy Sets, pp. 354-365 in <u>Information Sciences</u>, Vol. 180, No. 3. Available online at the web address <a href="https://dspace.lboro.ac.uk/dspace-jspui/bitstream/2134/13244/3/rough_m13.pdf">https://dspace.lboro.ac.uk/dspace-jspui/bitstream/2134/13244/3/rough_m13.pdf</a></p>
<p>[6] p. 17, Klir and Yuan.</p>
<p>[7] Smith, Nicholas J. J., 2004, Vagueness and Blurry Sets, pp 165-235 in <u>Journal of Philosophical Logic</u>, April 2004. Vol. 33, No. 2. Multiple online sources are available at <a href="http://philpapers.org/rec/SMIVAB" rel="nofollow">http://philpapers.org/rec/SMIVAB</a></p>
<p>[8] See Pagola, Miguel; Lopez-Molina, Carlos; Fernandez, Javier; Barrenechea, Edurne; Bustince, Humberto , 2013, “Interval Type-2 Fuzzy Sets Constructed From Several Membership Functions: Application to the Fuzzy Thresholding Algorithm,” pp. 230-244 in IEEE Transactions on Fuzzy Systems, April, 2013. Vol. 21, No. 2. I haven’t read the paper yet (I simply can’t afford access to many of these sources) but know of its existence.</p>
<p>[9] See Rogers, Leo, 2014, “The History of Negative Numbers,” published online at the <u>NRICH.com</u> web address <a href="http://nrich.maths.org/5961">http://nrich.maths.org/5961</a>.</p>
<p>[10] pp. 261-262, McNeill.</p>
<p>[11] Chesterton, G.K., 1923, <u>St Francis of Assisi</u>. Published online at the <u>Project Gutenberg</u> web address <a href="http://gutenberg.net.au/ebooks09/0900611.txt" rel="nofollow">http://gutenberg.net.au/ebooks09/0900611.txt</a></p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/multidimensionalmayhem.wordpress.com/601/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/multidimensionalmayhem.wordpress.com/601/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=multidimensionalmayhem.wordpress.com&blog=27407452&post=601&subd=multidimensionalmayhem&ref=&feed=1" width="1" height="1" />Implementing Fuzzy Sets in SQL Server, Part 0: The Buzz About Fuzz
http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/06/13/implementing-fuzzy-sets-in-sql-server-part-0-the-buzz-about-fuzz/
Tue, 14 Jun 2016 02:06:56 UT/blogs/multidimensionalmayhem/2016/06/13/implementing-fuzzy-sets-in-sql-server-part-0-the-buzz-about-fuzz/0http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/06/13/implementing-fuzzy-sets-in-sql-server-part-0-the-buzz-about-fuzz/#comments<p><strong>By Steve Bolton</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>I originally planned to post a long-delayed series titled Information Measurement with SQL Server next, in which I’d like to cover scores of different metrics for quantifying the data our databases hold – such as how random, chaotic or ordered it might be, or how much information it might provide. I’m putting it off once again, however, because I stumbled onto a neglected topic that could be of immediate benefit to many DBAs: fuzzy sets and their applications in uncertainty management programs and software engineering processes. Since SQL Server is a set-based storage system, I always suspected that the topic would be directly relevant in some way, but never expected to discover just how advantageous they can be. As in my previous series on this blog, I’m posting this mistutorial series in order to introduce myself to the topic, not because I know what I’m talking about; writing about it helps reinforce what I learn along the way, which will hopefully still be of some use to others once all of the inevitable mistakes are overcome. In fact, I guarantee that every DBA and .Net programmer out there has encountered problems which could be more easily and quickly solved through these proven techniques for modeling imprecision, which is precisely what many software engineering and data modeling problems call for. Despite the fact that my own thinking on the topic is still fuzzy (as usual) I’m certain this series can be immediately helpful to many readers, since there’s such an incredible gap between the math, theory and implementations of fuzzy set techniques in other fields one hand, and their slow adoption in the relational and data mining markets on the other.<br />
<span style="font-size:10pt;color:white;">…………</span>Instead of beating around the bush, I’ll try to encapsulate the purposes and uses cases of fuzzy sets as succinctly as possible: basically, you look for indefinite terms in ordinary speech, then squeeze what little information content you can out of them by assigning grades to records to signify how strongly they belong to a particular set. Most fuzzy set problems are modeled in terms of natural language like this. The overlap with Behavior-Driven Development (BDD) and user stories is quite obvious, but after reading on those hot topics a year prior to learning about fuzzy sets, I was immediately struck by how little these techniques of modeling imprecision are apparently used but how easy it would be to incorporate them into database and application development processes. Uncertainty is a notorious problem in any engineering process, but using sets with graded memberships can even be used to capture it and flesh it out more thoroughly, as part of one of the programs of “uncertainty management” I’ll describe later in this series.</p>
<p style="text-align:center;"><strong>From Crisp Sets to Membership Functions</strong></p>
<p> These powerful techniques arise from the quite simple premise that we can assign membership values to records, which some SQL Server users might being doing from time to time insentiently, without realizing that they were approaching the borderlands of fuzzy set theory. Most relational and cube data is in the form of what mathematicians call “crisp sets,” which don’t require membership functions because they’re clear-cut yes-or-no decisions; to theoreticians, these are actually just a special case of a broader class of fuzzy sets, distinguished only by the fact that their membership functions are limited to values of either 0 or 1. In the relational field as it stands today, you either include a row in a set or you don’t, without any in-between. In contrast, most fuzzy membership functions assign continuous values between 0 and 1; although other scales are possible, I have yet to see an example in the literature where any other scale was used. I doubt it is wise to use any other range even if there might be a performance boost of some kind in applying larger-scale float or decimal data types, given that it helps integrate fuzzy sets with the scales used in a lot of other hot techniques I’ll cover later, like Dempster-Shafer evidence theory, possibility theory, decision theory and my personal favorite, neural net weights. That overlap transforms fuzzy sets into an interchangeable part of sorts, in what might be termed modular knowledge discovery.<br />
<span style="font-size:10pt;color:white;">…………</span>That all sounds very grandiose, but anyone can practice picking out fuzzy sets represented in everyday speech. Artificial intelligence researchers Roger Jang and Enrique Ruspini provide a handy list of obvious ones in a set of slides reprinted by analytics consultant Piero P. Bonissone, including Height, Action Sequences, Hair Color, Sound Intensity, Money, Speed, Distance, Numbers and Decisions. Some corresponding instances of them we encounter routinely might include Tall People, Dangerous Maneuvers, Blonde Individuals, Loud Noises, Large Investments, High Speeds, Close Objects, Large Numbers and Desirable Actions.[i] The literature is replete with such simple examples, of which imprecise weather and height terms like “cloudy,” “hot” and “short” seem to be the most popular. The key things is to look in any BDD or user story implementation are linguistic states where the speech definitely signifies something, but the meaning is not quite clear – particularly when it would still be useful to have a sharper and numerically definable definition, even when we can’t be 100 percent precise.</p>
<p style="text-align:center;"><strong>Filling a Unique Niche in the Hierarchy of Data Types</strong></p>
<p> It may be helpful to look at fuzzy sets as a new tool occupying a new rung in the ladder of Content types we already work with routinely, especially in <a href="https://multidimensionalmayhem.wordpress.com/2012/11/15/a-rickety-stairway-to-sql-server-data-mining-part-0-0-an-introduction-to-an-introduction/">SQL Server Data Mining (SSDM)</a>. At the lowest level of data type complexity we have nominal data, which represents categories that are not ranked on any scale; these are basically equivalent to the Discrete Content type in SSDM and are often implemented in text data types or tinyint codes in T-SQL. On the next rung up the ladder we have ordinal data in which categories are assigned some rank, although the gaps may not be defined or even proportional; above that we have continuous data types (or the best approximation we can get, since modern computers can’t handle infinitesimal scales) that are often implemented in T-SQL in the float, numeric and decimal data types. Fuzzy sets represent a new bridge between the top two rungs, by providing more meaningful continuous values to ordinal data that in turn allow us to do productive things we couldn’t do with them before, like performing arithmetic, set operations or calculating stats. Any fuzzy set modeling process ought to focus on looking for data that is ordinal with an underlying scale that is not precisely discernible, but in which it would be useful to work with a continuous scale. That really isn’t much more difficult than picking imprecise terminology out of natural language, which anyone can make a game of. Given their “ability to translate imprecise/vague knowledge of human experts” we might also want to make a habit of flagging instances where we know a rule is operative, but has not yet been articulated.<br />
<span style="font-size:10pt;color:white;">…………</span>If one were to apply these techniques to database server and other computing terminologies, one of the most obvious examples of imprecise terms would be “performance.” As George J. Klir and Bo Yuan point out in their classic tome <em>Fuzzy sets and Fuzzy Logic: Theory and Applications</em>, this is actually an instance of a specific type of fuzzy set called a fuzzy number, which I will introduce later in the series.[ii] Say, for example, that you have a table full of performance data, which you’ve graded the records on scales of 0 to 1 based on whether they fall into categories like “Acceptable,” “Good” and perhaps “Outside Service Level Agreement Boundaries.” That still leaves open the question of what the term “performance “ itself means, so it constitutes another level of fuzziness on top of the membership issue; in fact, it might be necessary to use some of the techniques already hashed out by mathematicians decades for combining the opinions of multiple experts to arrive at a common fuzzy definition of it.</p>
<p style="text-align:center;"><strong>Modeling Natural Language</strong></p>
<p> The heavy math in that resource may be too much for some readers to bear, but I highly recommended at least skimming the third section of Chapter 8, where Klir and Yuan identify many different types of fuzziness in ordinary speech. They separate them into four possible combinations of unconditional vs. conditional and unqualified vs. qualified fuzzy propositions, such as the statement “Tina is young is very true,” in which the terms “very” and “true” make it unconditional and qualified.[iii] They also delve into identifying “fuzzy quantifiers” like “about 10, much more than 100, at least about 5,” or “almost all, about half, most,” each of which is modeled by a different type of fuzzy number, which I’ll describe at a later date.[iv] Other distinct types to watch for in natural language include linguistic hedges such as “very, more, less, fairly and extremely” that are used to qualify statements of likelihood or truth and falsehood. These can be chained together in myriad ways, in statements like “Tina is very young is very true,” and the like.[v]<a href="#_edn5" name="_ednref5"><br />
</a> In a moment I’ll describe how chaining together such fuzzy terms and fleshing out other types of imprecision can lead to lesser-known but occasionally invaluable twists on fuzzy sets, but for now I just want to call attention to how quickly it added new layers of complexity to an otherwise simple topic. That is where the highly developed ideas of fuzzy set theory come in handy. The math for implementing all of these natural language concepts has existed for decades, so there’s little reason to reinvent the wheel – yet nor is there a need to overburden readers with all of the equations and jargon, which can look quite daunting on paper. There is a crying need in the data mining field for people willing to act as middlemen of sorts between the end users of the algorithms and their inventors, in the same way that a mechanic fits a need between automotive engineers and drivers; as I’ve pointed out before, it shouldn’t require a doctorate in artificial intelligence to operate data mining software, but the end users are nonetheless routinely buried in equations and formulas they shouldn’t have to decipher. It is imperative for end users to know what such techniques are used for, just as drivers must know how to read a speedometer and operate a brake, but it is not necessary for them to provide lemmas, or even know what a lemma is. While writing these mistutorial series, I’m trying to acquire the skills to do that for the end users by at least removing the bricks from the briefcase, so to speak, which means I’ll keep the equations and jargon down to a minimum and omit mathematical proofs altogether. The jargon is indispensable for helping mathematicians communicate with each other, but is an obstacle to implementing these techniques in practice. It is much easier for end users to think of this topic in terms of natural language, in which they’ve been unwittingly expressing fuzzy sets their whole lives on a daily basis. I can’t simplify this or any other data mining completely, so wall-of-text explanations like this are inevitable – but I’d wager it’s a vast improvement over having to wade through whole textbooks of dry equations, which is sometimes the only alternative. Throughout this series I will have to lean heavily on Klir and Yuan’s aforementioned work for the underlying math, which I will implement in T-SQL. If you want a well-written discussion of the concepts in human language, I’d recommend Dan McNeill’s 1993 book <em>Fuzzy Logic</em>.[vi]</p>
<p style="text-align:center;"><strong>The Low-Hanging Fruits of Fuzzy Set Applications</strong></p>
<p> These concepts have often proved to be insanely useful whenever they’ve managed to percolate down to various sectors of the economy. The literature is so chock full of them I don’t even know where to begin; the only thing I see in common to the whole smorgasbord is that they seem to seep into industries almost haphazardly, rather than as part of some concerted push. Their “ability to control unstable systems” makes them an ideal choice for many control theory applications.[vii] Klir and Yuan spend several chapters on the myriad implementations already extant when they wrote two decades ago, in fields like robotics,[viii] estimation of longevity of equipment[ix], mechanical and industrial engineering[x], assessing the strength of bridges[xi], traffic scheduling problems[xii] (including the infamous Traveling Salesman) and image sharpening.[xiii] Another example is the field of reliability ratings, where Boolean all-or-nothing rankings like “working” vs. “broken” are often not sufficient to capture in-between states.[xiv] In one detailed example, they demonstrate how to couple weighted matrices of symptoms with fuzzy sets in medical diagnosis.[xv] Klir and Yuan also lament that these techniques are not put to obvious uses in psychology[xvi], where imprecision is rampant, and provide some colorful examples of how to model the imprecision inherent in interpersonal communication, particularly in dating.[xvii] As they point out, some messages are inherently uncertain, on top of any fuzz introduces by the receiver in interpretation; to that we can add the internal imprecision of the speaker, who might not be thinking through their statements thoroughly or selecting their words carefully.<br />
<span style="font-size:10pt;color:white;">…………</span>Then there is a whole class of applications directly relevant to data mining, such as fuzzy clustering algorithms (like C-Means)[xviii], fuzzy decision trees, neural nets, state sequencing (“fuzzy dynamic systems and automata”)[xix], fuzzified virtual chromosomes in genetic algorithms[xx], fuzzy parameter estimation, pattern recognition[xxi], fuzzy regression procedures and regression on fuzzy data.[xxii] Most of that falls under the rubric of “soft computing,” a catch-all term for bleeding edge topics like artificial intelligence. The one facet of the database server field where fuzzy sets have succeeded in permeating somewhat since Klir and Yuan mentioned the possibility[xxiii] is fuzzy information retrieval, which we can see it in action in SQL Server full-text catalogs.</p>
<p style="text-align:center;"><strong>The Future of Fuzzy Sets in SQL Server</strong></p>
<p> Like many of their colleagues, however, they wrote about ongoing research into fuzzy relational databases by researchers like Bill Buckles and F.E. Petry that has not come into widespread use since then.[xxiv] That is where this series comes in. I won’t be following any explicit prescriptions for implementing fuzzy relational databases per se, but will instead leverage the existing capabilities of T-SQL to demonstrate how easy it is to add your own fuzz for imprecision modeling purposes. Researcher Vivek V. Badami pointed out more than two decades ago that fuzz takes more code, but is easier to think about.[xxv] It takes very little experience with fuzzy sets to grasp what he meant by this – especially now that set-based languages like T-SQL that are ideal for this topic are widely used. I wonder if someday it might be possible to extend SQL or systems like SQL Server to incorporate fuzziness more explicitly, for example, by performing the extra operations on membership functions that are required for joins between fuzzy sets, or even more, fuzzy joins between fuzzy sets; later in the series I’ll demonstrate how DBAs can quickly implement DIY versions of these things, but perhaps there are ways to do the dirty work under the hood, in SQL Server internals. Maybe a generation from now we’ll see fuzzy indexes and SQL Server execution plans with Fuzzy Anti-Semi-Join operators – although I wonder how Microsoft could implement the retrieval of only one-seventh of a record and a third of another, using B-trees or <em>any </em>other type of internal data structure. In order to determine if a record is worthy of inclusion, it first has to be retrieved and inspected instead of passed over, which could lead to a quandary if SQL Server developers tried to implement fuzzy sets in the internals.<br />
<span style="font-size:10pt;color:white;">…………</span>The good news is that we don’t have to wait for the theoreticians to hash out how to implement fuzzy relational databases, or for Microsoft and its competition to add the functionality for us. As it stands, T-SQL is already an ideal tool for implementing fuzzy sets. In the next article, I’ll demonstrate some trivial membership functions that any DBA can implement on their own quite easily, so that these concepts don’t seem so daunting. The difficulties can be boiled down chiefly to the fact that the possibilities are almost <em>too</em> wide open. Choosing the right membership functions to model the problem at hand is not necessarily straightforward, nor is selecting the right type of fuzzy set to model particular types of imprecision. As in regular data modeling, the wrong choices can sometimes lead not only to the waste of server resources, but also of returning incorrect answers. The greatest risk, in fact, consists of fuzzifying relationships that are inherently crisp and vice-versa, which can lead to fallacious reasoning. Fuzz has become a buzzword of sorts, so it would be wise to come up with a standard to discern its true uses from its abuses. In the next installment, I’ll tackle some criteria we can use to discern the difference, plus provide a crude taxonomy of fuzzy sets and get into some introductory T-SQL samples.</p>
<p>[i] p. 18, Bonissone, Piero P., 1998, “Fuzzy Sets & Expert Systems in Computer Eng. (1).” Available online at <a href="http://homepages.rpi.edu/~bonisp/fuzzy-course/99/L1/mot-conc2.pdf">http://homepages.rpi.edu/~bonisp/fuzzy-course/99/L1/mot-conc2.pdf</a></p>
<p>[ii] pp. 101-102, Klir, George J. and Yuan, Bo, 1995, <span style="text-decoration:underline;">Fuzzy sets and Fuzzy Logic: Theory and Applications</span>. Prentice Hall: Upper Saddle River, N.J.</p>
<p>[iii] <em>IBID.</em>, pp. 222-225.</p>
<p>[iv] <em>IBID.</em>, pp. 225-226.</p>
<p>[v] <em>IBID.</em>, pp. 229-230.</p>
<p>[vi] McNeill, Dan, 1993, <u>Fuzzy Logic</u>. Simon & Schuster: New York.</p>
<p>[vii] p. 8, Bonissone.</p>
<p>[viii] Klir and Yuan, p. 440.</p>
<p>[ix] <em>IBID.</em>, p. 432.</p>
<p>[x] <em>IBID.</em>, pp. 427-432.</p>
<p>[xi] <em>IBID.</em>, p. 419.</p>
<p>[xii] <em>IBID.</em>, pp. 422-423.</p>
<p>[xiii] <em>IBID.</em>, pp. 374-376.</p>
<p>[xiv] <em>IBID.</em>, p. 439.</p>
<p>[xv] <em>IBID.</em>, pp. 443-450.</p>
<p>[xvi] <em>IBID.</em>, pp. 463-464.</p>
<p>[xvii] <em>IBID.</em>, pp. 459-461.</p>
<p>[xviii] <em>IBID.</em>, pp. 358-364.</p>
<p>[xix] <em>IBID.</em>, pp. 349-351.</p>
<p>[xx] <em>IBID.</em>, p. 453.</p>
<p>[xxi] <em>IBID.</em>, pp. 365-374.</p>
<p>[xxii] <em>IBID.</em>, pp. 454-459.</p>
<p>[xxiii] <em>IBID.</em>, p. 385.</p>
<p>[xxiv] <em>IBID.</em>, pp. 380-381.</p>
<p>[xxv] p. 278, McNeill.</p>
<p> </p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/multidimensionalmayhem.wordpress.com/597/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/multidimensionalmayhem.wordpress.com/597/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=multidimensionalmayhem.wordpress.com&blog=27407452&post=597&subd=multidimensionalmayhem&ref=&feed=1" width="1" height="1" />Goodness-of-Fit Testing with SQL Server Part 7.4: The Cramér–von Mises Criterion
http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/05/31/goodness-of-fit-testing-with-sql-server-part-74-the-cramérvon-mises-criterion/
Tue, 31 May 2016 10:38:38 UT/blogs/multidimensionalmayhem/2016/05/31/goodness-of-fit-testing-with-sql-server-part-74-the-cramérvon-mises-criterion/0http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/05/31/goodness-of-fit-testing-with-sql-server-part-74-the-cramérvon-mises-criterion/#comments<p><strong>By Steve Bolton</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>This last installment of this series of amateur tutorials features a goodness-of-fit metric that is closely related to <a href="https://multidimensionalmayhem.wordpress.com/2016/05/02/goodness-of-fit-testing-with-sql-server-part-7-3-the-anderson-darling-test/">the Anderson-Darling Test discussed in the last post</a>, with one important caveat: I couldn’t find any published examples to verify my code against. Given that the code is already written and the series is winding down, I’ll post it anyway in the off-chance it may be useful to someone, but my usual disclaimer applies more than ever: I’m writing this series in order to learn about the fields of statistics and data mining, not because I have any real expertise. Apparently, the paucity of information on the Cramér–von Mises Criterion stems from the fact that experience with this particular measure is a lot less common than that of its cousin, the Anderson-Darling Test. They’re both on the high end when it comes to statistical power, so the Cramér–von Mises Criterion might be a good choice when you need to be sure you’re detecting effects with sufficient accuracy.[I]<br />
<span style="font-size:10pt;color:white;">…………</span>Although it is equivalent to the Anderson-Darling with the weighting function[ii] set to 1, the calculations are more akin to those of the other methods based on the empirical distribution function (EDF) we’ve discussed in this segment of the series. It is in fact a refinement[iii] of the <a href="https://multidimensionalmayhem.wordpress.com/2016/03/23/goodness-of-fit-testing-with-sql-server-part-7-1-the-kolmogorov-smirnov-and-kuipers-tests/">Kolmogorov-Smirnov Test we discussed a few articles ago</a>, one that originated with separate papers published in 1928 by statisticians Harald Cramér and Richard Edler von Mises.[iv] One of the advantages it appears to enjoy over the Anderson-Darling Test is that it seems to perform much better, in the same league as the Kolmogorov-Smirnov, Kuiper’s and <a href="https://multidimensionalmayhem.wordpress.com/2016/04/14/goodness-of-fit-testing-with-sql-server-part-7-2-the-lilliefors-test/">Lilliefors Tests</a>. One of the disadvantages is that it that the value of the test statistic might be distorted by Big Data-sized value ranges and counts, which many established statistical tests were never designed to handle. It does appear, however, to suffer from this to a lesser degree than the Anderson-Darling Test, as discussed last time around. Judging from past normality tests I’ve performed on datasets I’m familiar with, it seems to assign higher values to those that were definitely not Gaussian in the proper order, although perhaps not in the correct proportion.<br />
<span style="font-size:10pt;color:white;">…………</span>That of course assumes that I coded it correctly, which I can’t verify in the case of this particular test. In fact, I had to cut out some of the T-SQL that calculated the Watson Test along with it, since the high negative numbers it routinely returned were obviously wrong. I’ve also omitted the commonly used two-sample version of the test, since sampling occupies a less prominent place in SQL Server use cases than it would in ordinary statistical testing and academic research; one of the benefits of having millions of rows of data in our cubes and relational tables is that we can derive more exact numbers, without depending as much on parameter estimation or inexact methods of random selection. I’ve also omitted the hypothesis testing step that usually accompanies the use of the criterion, for the usual reasons: loss of information content by boiling down the metric to a simple either-or choice, the frequency with which confidence intervals are misinterpreted and most of all, the fact that we’re normally going to be doing exploratory data mining with SQL Server, not narrower tasks like hypothesis testing. One of the things that sets the Cramér–von Mises Criterion apart from other tests I’ve covered in the last two tutorial series is that the test statistic is compared to critical values from the F-distribution rather than the Chi-Squared or Student’s T, but the same limitation still arises: most of the lookup tables have gaps or stop at a few hundred values at best, but calculating them for the millions of degrees of freedom we’d need for such large tables would be computationally costly. Moreover, since I can’t be sure the code below for this less common metric is correct, there is less point in performing those expensive calculations.<br />
<span style="font-size:10pt;color:white;">…………</span>The bulk of the procedure in Figure 1 is identical to the sample code posted for the Kolmogorov-Smirnov and Lilliefors Tests, which means they can and really ought to be calculated together. The only differences are in the final calculations of the test statistics, which are trivial in comparison to the derivation of the empirical distribution function (EDF) table variable from a dynamic SQL statement. The @Mean and @StDev aggregates are plugged into the Calculations.NormalDistributionSingleCDFFunction I wrote for <a href="https://multidimensionalmayhem.wordpress.com/2015/11/03/goodness-of-fit-testing-with-sql-server-part-2-1-implementing-probability-plots-in-reporting-services/">Goodness-of-Fit Testing with SQL Server, part 2.1: Implementing Probability Plots in Reporting Services. </a>If you want to test other distributions besides the Gaussian or “normal” distribution (i.e. the bell curve), simply substitute a different cumulative distribution function (CDF) here. The final calculation is straightforward: just subtract the CDF from the EDF, square the result and do a SUM over the whole dataset.[v] The two-sample test, which I haven’t included here in order for the sake of brevity and simplicity, merely involves adding together the results for the same calculation across two different samples and making a couple minor corrections. I’ve also included a one-sample version of the test I saw cited at Wikipedia[vi], since it was trivial to calculate. I would’ve liked to include the Watson Test, since “it is useful for distributions on a circle since its value does not depend on the arbitrary point chosen to begin cumulating the probability density and the sample points”[vii] and therefore meets a distinct but important set of use cases related to circular and cyclical data, but my first endeavors were clearly inaccurate, probably due to mistranslations of the equations.</p>
<p><strong><u>Figure 1: T-SQL Code for the Cramér–von Mises Criterion</u></strong></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CREATE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">PROCEDURE</span> <span style="color:teal;">[Calculations]</span><span class="GramE"><span style="color:gray;">.</span><span style="color:teal;">[</span></span><span class="SpellE"><span style="color:teal;">GoodnessOfFitCramerVonMisesCriterionSP</span></span><span style="color:teal;">]<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">@Database1 <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">)</span> <span style="color:gray;">=</span> <span style="color:gray;">NULL,</span> @Schema1 <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> @Table1 <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span>@Column1 <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">AS</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE </span><span style="font-size:9.5pt;font-family:Consolas;">@SchemaAndTable1 <span class="GramE"><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span></span>400<span style="color:gray;">),</span>@<span class="SpellE">SQLString</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET </span><span style="font-size:9.5pt;font-family:Consolas;">@SchemaAndTable1 <span style="color:gray;">=</span> @Database1 <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> @Schema1 <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> @Table1</span><span style="font-size:9.5pt;font-family:Consolas;"><br />
</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> @Mean <span style="color:blue;">float</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">@StDev <span style="color:blue;">float</span><span style="color:gray;">,<br />
</span></span><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;">@Count<span> </span><span style="color:blue;">float</span></span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">EDFTable</span> <span style="color:blue;">table</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">ID <span style="color:blue;">bigint</span> <span style="color:blue;">IDENTITY </span><span style="color:gray;">(</span>1<span class="GramE"><span style="color:gray;">,</span>1</span><span style="color:gray;">),<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">Value <span style="color:blue;">float</span><span style="color:gray;">,<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;">ValueCount</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">bigint</span><span style="color:gray;">,</span><br />
</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;">EDFValue</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">float</span><span style="color:gray;">,<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;">CDFValue</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="GramE"><span style="color:blue;">decimal</span><span style="color:gray;">(</span></span>38<span style="color:gray;">,</span>37<span style="color:gray;">),<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;">EDFCDFDifference</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="GramE"><span style="color:blue;">decimal</span><span style="color:gray;">(</span></span>38<span style="color:gray;">,</span>37<span style="color:gray;">))</span></span></p>
<p> </p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ExecSQLString</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">),</span> @<span class="SpellE">MeanOUT</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>200<span style="color:gray;">),</span>@<span class="SpellE">StDevOUT</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>200<span style="color:gray;">),</span>@<span class="SpellE">CountOUT</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>200<span style="color:gray;">),</span> @<span class="SpellE">ParameterDefinition</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ParameterDefinition</span> <span style="color:gray;">=</span> <span style="color:red;">‘@<span class="SpellE">MeanOUT</span> <span class="GramE">nvarchar(</span>200) OUTPUT,@<span class="SpellE">StDevOUT</span> nvarchar(200) OUTPUT,@<span class="SpellE">CountOUT</span> nvarchar(200) OUTPUT ‘<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ExecSQLString</span> <span style="color:gray;">=</span> <span style="color:red;">‘SELECT @<span class="SpellE">MeanOUT</span> = <span class="GramE">CAST(</span><span class="SpellE">Avg</span>(Value) as float),@<span class="SpellE">StDevOUT</span> = StDev(Value),@<span class="SpellE">CountOUT </span>= CAST(Count(Value) as float)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>FROM (SELECT <span class="GramE">CAST(</span>‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> @Column1 <span style="color:gray;">+</span> <span style="color:red;">‘ as float) as Value<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> @SchemaAndTable1 <span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> @Column1 <span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL) AS T1’</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:maroon;">sp_executesql</span> @<span class="SpellE">ExecSQLString</span><span class="GramE"><span style="color:gray;">,</span>@<span class="SpellE">ParameterDefinition</span></span><span style="color:gray;">, </span>@<span class="SpellE">MeanOUT</span> <span style="color:gray;">=</span> @Mean <span style="color:blue;">OUTPUT</span><span style="color:gray;">,</span>@<span class="SpellE">StDevOUT</span> <span style="color:gray;">=</span> @StDev <span style="color:blue;">OUTPUT</span><span style="color:gray;">,</span>@<span class="SpellE">CountOUT</span> <span style="color:gray;">=</span> @Count <span style="color:blue;">OUTPUT</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">SQLString</span> <span style="color:gray;">=</span> <span style="color:red;">‘SELECT Value, <span class="SpellE">ValueCount</span>, <span class="GramE">SUM(</span><span class="SpellE">ValueCount</span>) OVER (ORDER BY<br />
Value ASC) / CAST(‘</span> <span style="color:gray;">+</span> <span style="color:fuchsia;">CAST</span><span style="color:gray;">(</span>@Count <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>50<span style="color:gray;">))</span> <span style="color:gray;">+</span> <span style="color:red;">‘AS float) AS <span class="SpellE">EDFValue<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>FROM (SELECT DISTINCT<span> </span>‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span>@Column1 <span style="color:gray;">+</span> <span style="color:red;">‘ AS <span> </span>Value, <span class="GramE">Count(</span>‘ </span><span style="color:gray;">+</span> @Column1 <span style="color:gray;">+</span> <span style="color:red;">‘) OVER (PARTITION BY ‘</span> <span style="color:gray;">+ </span>@Column1 <span style="color:gray;">+</span> <span style="color:red;">‘ ORDER BY ‘ </span><span style="color:gray;">+</span> @Column1 <span style="color:gray;">+</span> <span style="color:red;">‘) AS <span class="SpellE">ValueCount<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> @SchemaAndTable1 <span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> @Column1 <span style="color:gray;">+ </span><span style="color:red;">‘ IS NOT NULL) AS T1’</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">INSERT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">INTO</span> @<span class="SpellE">EDFTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">Value<span style="color:gray;">,</span> <span class="SpellE">ValueCount</span><span style="color:gray;">,</span> <span class="SpellE">EDFValue</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">@<span class="SpellE">SQLString</span><span style="color:gray;">)</span> </span><span style="font-size:9.5pt;font-family:Consolas;"><br />
</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:fuchsia;">UPDATE</span><span style="font-size:9.5pt;font-family:Consolas;"> T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE">CDFValue</span> <span style="color:gray;">=</span> T3<span style="color:gray;">.</span>CDFValue<span style="color:gray;">,</span> <span class="SpellE">EDFCDFDifference</span> <span style="color:gray;">=</span> <span class="SpellE">EDFValue</span> <span style="color:gray;">–</span> T3<span style="color:gray;">.</span>CDFValue<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">FROM</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">EDFTable</span> <span style="color:blue;">AS</span> T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:gray;">INNER</span> <span style="color:gray;">JOIN</span> </span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"><span> </span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE">DistinctValue</span><span style="color:gray;">,</span> <span class="SpellE">Calculations<span style="color:gray;">.</span>NormalDistributionSingleCDFFunction</span> <span style="color:gray;">(</span><span class="SpellE">DistinctValue</span><span style="color:gray;">,</span> @Mean<span style="color:gray;">,</span> @StDev<span style="color:gray;">)</span> <span style="color:blue;">AS</span> <span class="SpellE">CDFValue</span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:blue;">FROM </span><span style="color:gray;">(</span><span style="color:blue;">SELECT</span> <span style="color:blue;">DISTINCT</span> Value <span style="color:blue;">AS</span> <span class="SpellE">DistinctValue<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:blue;">FROM</span> @<span class="SpellE">EDFTable</span><span style="color:gray;">)</span> <span style="color:blue;">AS</span> T2<span style="color:gray;">)</span> <span style="color:blue;">AS</span> T3<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:blue;">ON</span> T1<span style="color:gray;">.</span>Value <span style="color:gray;">=</span> T3<span style="color:gray;">.</span>DistinctValue</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">OneDividedByTwelveN</span> <span style="color:blue;">As</span> <span style="color:blue;">float</span> <span style="color:gray;">=</span> 1 <span style="color:gray;">/</span> <span class="GramE"><span style="color:fuchsia;">CAST</span><span style="color:gray;">(</span></span>12 <span style="color:blue;">as</span> <span style="color:blue;">float</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">TwoTimesCount</span> <span style="color:blue;">as</span> <span style="color:blue;">float</span> <span style="color:gray;">=</span> 2 <span style="color:gray;">*</span> @Count<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ReciprocalOfCount</span> <span style="color:blue;">as</span> <span style="color:blue;">float</span><span> </span><span style="color:gray;">=</span> 1 <span style="color:gray;">/</span> <span class="GramE"><span style="color:fuchsia;">CAST</span><span style="color:gray;">(</span></span>@Count <span style="color:blue;">as</span> <span style="color:blue;">float</span><span style="color:gray;">)</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ResultTable</span> <span style="color:blue;">table<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;">CramerVonMisesTest</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">float</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:gray;">)</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">INSERT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">INTO</span> @<span class="SpellE">ResultTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE">CramerVonMisesTest<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">FROM </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">OneDividedByTwelveN</span><br />
<span style="color:gray;">+</span> <span class="GramE"><span style="color:fuchsia;">Sum</span><span style="color:gray;">(</span></span><span style="color:fuchsia;">Power</span><span style="color:gray;">(((((</span>2 <span style="color:gray;">*</span> ID<span style="color:gray;">)</span> <span style="color:gray;">–</span> 1<span style="color:gray;">)</span> <span style="color:gray;">/</span> <span style="color:fuchsia;">CAST</span><span style="color:gray;">(</span>@<span class="SpellE">TwoTimesCount</span> <span style="color:blue;">as</span> <span style="color:blue;">float</span><span style="color:gray;">))</span> <span style="color:gray;">–</span> <span class="SpellE">CDFValue</span><span style="color:gray;">),</span> 2<span style="color:gray;">))</span> <span style="color:blue;">AS</span> <span class="SpellE">CramerVonMisesTest<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span class="GramE"><span style="color:blue;">FROM</span><span> </span>@<span class="SpellE">EDFTable</span></span><span style="color:gray;">)</span> <span style="color:blue;">AS</span> T1</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE">CramerVonMisesTest<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">FROM</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ResultTable</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> ID<span style="color:gray;">,</span> Value<span style="color:gray;">,</span> <span class="SpellE">ValueCount</span><span style="color:gray;">,</span> <span class="SpellE">EDFValue</span><span style="color:gray;">,</span> <span class="SpellE">CDFValue</span><span style="color:gray;">,</span> <span class="SpellE">EDFCDFDifference<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">FROM</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">EDFTable</span></span></p>
<p><span style="font-size:10pt;color:white;">…………</span>One potential issue with the code above is that I may need to input the EDF rather than the EDFCDFDifference in the final SELECT; in the absence of example data from published journal articles, I can’t be sure of that. Like some of its kin, the criterion can also be adapted “for comparing two empirical distributions”[viii] rather than using the difference between the EDF and the cumulative distribution function (CDF). Most of these concepts have already been covered in the last three articles in this segment of the series; in fact, pretty much all except the last two selects are identical to that of the Kolmogorov-Smirnov, Kuiper’s and Lilliefors procedures. As usual, the parameters allow users to perform the test on any numerical column in any database they have sufficient access to.</p>
<p><strong><u>Figure 2: Sample Results from the Duchennes Table<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">Calculations</span><span style="color:gray;">.</span><span style="color:teal;">GoodnessOfFitCramerVonMisesCriterionSP<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span>@Database1 <span style="color:gray;">=</span> <span style="color:red;">N’DataMiningProjects’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span>@Schema1 <span style="color:gray;">= </span><span style="color:red;">N’Health’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span>@Table1 <span style="color:gray;">= </span><span style="color:red;">N’DuchennesTable’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span>@Column1 <span style="color:gray;">= </span><span style="color:red;">N’PyruvateKinase’</span></span></p>
<p><a href="https://multidimensionalmayhem.wordpress.com/2016/05/31/goodness-of-fit-testing-with-sql-server-part-7-4-the-cramer-von-mises-criterion/cramer-von-mises-results/" rel="attachment wp-att-593"><img class="alignnone size-full wp-image-593" src="https://multidimensionalmayhem.files.wordpress.com/2016/05/cramer-von-mises-results.jpg?w=604&h=420" alt="Cramer Von Mises Results" width="604" height="420" /></a></p>
<p><span style="font-size:10pt;color:white;">…………</span>I performed queries like the one in Figure 2 against two datasets I’ve used throughout the last two tutorial series, one by on the Duchennes form of muscular dystrophy made publicly available by the <a href="http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets">Vanderbilt University’s Department of Biostatistics</a> and another on the Higgs Boson provided by the <a href="https://archive.ics.uci.edu/ml/datasets/HIGGS">University of California at Irvine’s Machine Learning Repository</a>. Now that I’m familiar with how closely their constituent columns follow the bell curve, I was not surprised to see that the LactateDehydrogenase and PyruvateKinase enzymes scored 0.579157871602764 and 2.25027709042408 respectively, or that the test statistic for the Hemopexin protein was 0.471206505704088. Once again, I can’t guarantee that those figures are accurate in the case of this test, but the values follow the expected order (the same cannot be said of the one-sample Wikipedia version, which varied widely across all of the columns I tested it on). Given that the test statistic is supposed to rise in tandem with the lack of fit, I was likewise not surprised to see that the highly abnormal first float column of the Higgs Boson Dataset scored a 118.555073824395. The second float column obviously follows a bell curve in histograms and had a test statistic of 0.6277795953021942279. Note that the results for the same columns in <a href="https://multidimensionalmayhem.wordpress.com/2016/05/02/goodness-of-fit-testing-with-sql-server-part-7-3-the-anderson-darling-test/">Goodness-of-Fit Testing with SQL Server Part 7.3: The Anderson-Darling Test</a> were 5.43863473749926, 17.4386371653374, 5.27843535947881, 870424.402686672 and 12987.3380102254 respectively. One of the reasons I posted the code for this week’s test statistic despite by misgivings about its accuracy is that the numbers are easier to read than those for its closest cousin. Unlike the Kolmogorov-Smirnov, Kuiper’s and Lilliefors Tests, the Cramér–von Mises Criterion is not bounded between 0 and 1, but it at least doesn’t reach such inflated sizes as in the Anderson-Darling stat. Furthermore, the vastly higher count in the Higgs Boson Dataset seems to swell the Anderson-Darling results even for clearly Gaussian data like the second float column, which makes it difficult to compare stats across datasets.</p>
<p><strong><u>Figure 3: Execution Plan for the Cramér–von Mises Criterion</u></strong> (click to enlarge)<strong><u><br />
<a href="https://multidimensionalmayhem.wordpress.com/2016/05/31/goodness-of-fit-testing-with-sql-server-part-7-4-the-cramer-von-mises-criterion/cramer-von-mises-execution-plan/" rel="attachment wp-att-594"><img class="alignnone size-full wp-image-594" src="https://multidimensionalmayhem.files.wordpress.com/2016/05/cramer-von-mises-execution-plan.jpg?w=604&h=282" alt="Cramer von Mises Execution Plan" width="604" height="282" /></a><br />
</u></strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>Another advantage the Cramér–von Mises Criterion might enjoy over the Anderson-Darling Test is far better performance. The execution plan in Figure 3 is almost identical to those we saw in <a href="https://multidimensionalmayhem.wordpress.com/2016/03/23/goodness-of-fit-testing-with-sql-server-part-7-1-the-kolmogorov-smirnov-and-kuipers-tests/">Goodness-of-Fit Testing with SQL Server Part 7.1: The Kolmogorov-Smirnov and Kuiper’s Tests</a> and <a href="https://multidimensionalmayhem.wordpress.com/2016/04/14/goodness-of-fit-testing-with-sql-server-part-7-2-the-lilliefors-test/">Part 7.2: The Lilliefors Test</a>. Once again there are five queries, only two of which have any real cost. Both of those begin with nonclustered Index Seeks, which is usually a good sign. The only really costly savings we might be able to drudge up are with the Hash Match (Aggregate) operator – but there isn’t much point, since the procedure had the same stellar performance as the Kolmogorov-Smirnov, Kuiper’s and Lilliefors Tests. Given that the procedure is structured so similarly to its high-performing kin, it’s not surprising that it took only 24 seconds to race through the 11 million rows in the Higgs Boson Dataset when processing the first float column and 1:15 for the second; in contrast, the Anderson-Darling took a whopping 9:24 for Column1 and 9:09 for Column2. These times will be several orders of magnitude better when run on a real database server instead of my run-down development machine, but it would probably be wise to go with the better-performing measures anyways, assuming they’re otherwise a good fit for our use cases.<br />
<span style="font-size:10pt;color:white;">…………</span>I originally intended to include the Akaike, Bayesian and several other metrics with “Information Criterion,” in their name, but decided that these measures of mining model fitness would best be handled in an upcoming series titled Information Measurement with SQL Server. The last tutorial series on outlier detection ended with articles on Cook’s Distance and Mahalanobis Distance, which were both intended to segue into that more advanced series (which I’m entirely unqualified to write), in which we’ll tackle various types of entropy, measures of structure, quantification of order and other mind-blowing topics. We’ll hear from such familiar names as Cramér and Kolmogorov again in that series, but first I must take a detour into a topic that could be of immediate benefit to a wide range of SQL Server users. In the upcoming tutorial series Implementing Fuzzy Sets with SQL Server, I’ll explain how the tragically neglected but techniques used in fuzzy sets can be immediately applied to real-world problems that the SQL Server community encounters routinely, particularly those where modeling data on continuous scales would be preferable but is not currently done because of inexact measurement methods. This article was probably the weakest entry in the <a href="https://multidimensionalmayhem.wordpress.com/category/goodness-of-fit-testing-with-sql-server/">Goodness-of-Fit series</a>, which was itself only of narrow interest to certain niches in the SQL Server user base; I only went off on this tangent because I recognized my own weaknesses in this area while writing the <a href="https://multidimensionalmayhem.wordpress.com/category/diy-data-mining/outlier-detection-with-sql-server/">Outlier Detection with SQL Server</a> series and sought to rectify it through the school of hard knocks. In the future I may tack a few more articles onto the end of this series, such as sample code for Mardia’s Multivariate Skewness and Kurtosis and other multivariate goodness-of-fit tests, but the bulk of this series is complete. In my next post I’ll introduce some of the basic concepts behind fuzzy sets, before providing code samples that should make this treasure trove of techniques immediately available to a wide range of SQL Server users, in order to solve classes of problems that are not being solved efficiently today. As with much of the research into data mining techniques, the academic study of fuzzy sets is at least two decades ahead of the practice. It’s high time it was brought within reach of non-specialists, many of whom could derive surprising practical benefit from these techniques.</p>
<p> </p>
<p>[i] See the <span style="text-decoration:underline;">Wikipedia</span> webpage “Anderson-Darling Test” at <a href="http://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test">http://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test</a></p>
<p>[ii] <em>IBID.</em></p>
<p>[iii] See the comment “Several goodness-of-fit tests, such as the Anderson-Darling test and the Cramer Von-Mises test, are refinements of the K-S test,” at National Institute for Standards and Technology, 2014, “1.3.5.16 Kolmogorov-Smirnov Goodness-of-Fit Test,” published in the online edition of the <u>Engineering Statistics Handbook.</u> Available at <a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm">http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm</a></p>
<p>[iv] See the <u>Wikipedia</u> page “Cramér–von Mises Criterion” at <a href="http://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93von_Mises_criterion">http://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93von_Mises_criterion</a></p>
<p>[v] I derived the formula from Anderson, T.W. and Stephens, M.A., 1994, The Modified Cramer-Von Mises Goodness-of-Fit Criterion for Time Series. Technical Report No. 47, Jan. 17, 1994. Published by the Office of Naval Research and the National Science Foundation. Available at the DTIC Online web address <a href="http://www.dtic.mil/dtic/tr/fulltext/u2/a275377.pdf">http://www.dtic.mil/dtic/tr/fulltext/u2/a275377.pdf</a> Also see Xiao, Yuanhui; Gordon, Alexander and Yakovlev, Andrei, 2006, “A C++ Program for the Cramér-Von Mises Two-Sample Test,” pp. 1-15 in Journal of Statistical Software, January 2007. Vol. 17, No. 8. Available online at <a href="http://www.jourlib.org/paper/2885039#.VIXYZP4o4uU">http://www.jourlib.org/paper/2885039#.VIXYZP4o4uU</a> All three authors taught at the University of Rochester, which is on the other side of the city from me. I had to dust off my C++ for this one, which brought back interesting memories of typing “Hello World” while precariously balancing a laptop on my knees at my sister’s house ages ago, after having a few beers and a whole box of chocolates on Valentine’s Day.</p>
<p>[vi] See the <u>Wikipedia</u> page “Cramér–von Mises Criterion” at <a href="https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93von_Mises_criterion">https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93von_Mises_criterion</a></p>
<p>[vii] p. 57 , Watson, G. S., 1962, “Goodness-of-Fit Tests on a Circle,” p. 57 in <u>Biometrika,</u> Vol. 49, No. 1 and 2. Available online at <a href="http://phdtree.org/pdf/33054228-goodness-of-fit-tests-on-a-circle-ii/">http://phdtree.org/pdf/33054228-goodness-of-fit-tests-on-a-circle-ii/</a></p>
<p>[viii] <em>IBID.</em></p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/multidimensionalmayhem.wordpress.com/592/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/multidimensionalmayhem.wordpress.com/592/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=multidimensionalmayhem.wordpress.com&blog=27407452&post=592&subd=multidimensionalmayhem&ref=&feed=1" width="1" height="1" />Goodness-of-Fit Testing with SQL Server Part 7.3: The Anderson-Darling Test
http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/05/02/goodness-of-fit-testing-with-sql-server-part-73-the-anderson-darling-test/
Tue, 03 May 2016 05:22:37 UT/blogs/multidimensionalmayhem/2016/05/02/goodness-of-fit-testing-with-sql-server-part-73-the-anderson-darling-test/0http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/05/02/goodness-of-fit-testing-with-sql-server-part-73-the-anderson-darling-test/#comments<p><strong>By Steve Bolton</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>As mentioned in previous installments of this series of amateur self-tutorials, goodness-of-fit tests can be differentiated in many ways, including by the data and content types of the inputs and the mathematical properties, data types and cardinality of the outputs, not to mention the performance impact of the internal calculations in between them. Their uses can be further differentiated by the types of probability distributions or regression models they can be applied to and the points within those distributions where their statistical power is highest, such as in the tails of a bell curve or the central point around the median or mean. The Anderson-Darling Test differs from <a href="https://multidimensionalmayhem.wordpress.com/2016/03/23/goodness-of-fit-testing-with-sql-server-part-7-1-the-kolmogorov-smirnov-and-kuipers-tests/">the</a> <a href="https://multidimensionalmayhem.wordpress.com/2016/03/23/goodness-of-fit-testing-with-sql-server-part-7-1-the-kolmogorov-smirnov-and-kuipers-tests/">Kolmogorov-Smirnov Test we recently surveyed</a> and others in its class in a plethora of ways, some of which I was able to glean from sundry comments scattered across the Internet. Unlike many other such tests, it can be applied beyond the usual Gaussian or “normal” distribution to other distributions including the “lognormal, exponential, Weibull, logistic, extreme value type 1” and “Pareto, and logistic.”[1] This is perhaps its major drawing card, since many other such tests are limited to a really narrow range of distributions, usually revolving around the Gaussian.<br />
<span style="font-size:10pt;color:white;">…………</span>In terms of interpretation of the test statistic, it is “generally valid to compare AD values between distributions and go with the lowest.”[2] When used with the normal distribution it is also “close to optimal” in terms of the Bahadur Slope, one of several methods of assessing the usefulness of the tests statistics produced by goodness-of-fit tests.[3] One of the drawbacks is that “it performs poorly if there are many ties in the data.”[4] Another is that it may be necessary to multiply the final test statistic by specific constants when testing distributions other than the normal[5], but I was unable to find references to any of them in time to include them in this week’s T-SQL stored procedure. This is not true of the Kolmogorov-Smirnov Test we surveyed a few weeks back, which is “distribution-free as the critical values do not depend on whether Gaussianity is being tested or some other form.”[6]</p>
<p style="text-align:center;"><strong>Parameter Estimation and EDA</strong></p>
<p> This particular limitation is not as much of an issue in the kind of exploratory data mining that the SQL Server community is more likely to use these tests for, given that we’re normally not performing hypothesis testing; I’ve generally shied away from that topic in this series for many reasons that I’ve belabored in previous articles, like the ease of misinterpretation of confidence intervals and the information loss involved in either-or hypothesis rejections. Don’t get me wrong, hypothesis testing is a valuable and often necessary step when trying to prove a specific point, at the stage of Confirmatory Data Analysis (CDA), but most of our mining use cases revolve around informal Exploratory Data Analysis (EDA), a distinction made in the ‘70s by John W. Tukey, the father of modern data mining.[7] Another issue with hypothesis testing is the fact that most of the lookup tables and approximations weren’t designed with datasets consisting of millions of rows, as DBAs and miners of SQL Server cubes encounter every day.<br />
This size difference has a side benefit, in that we generally don’t have to estimate the means and variances of our datasets, which is a much bigger issue in the kinds of small random samples that hypothesis tests are normally applied to. One of the properties of the Anderson-Darling Test is that parameter estimation is less of an issue with it, whereas the Lilliefors Test, the subject of last week’s article, is designed specifically for cases where the variance is unknown. There are apparently special formulations where different combinations of the mean and standard deviation are unknown, but these aren’t going to be common in our use cases, since the mean and variance are usually trivial to compute in an instant for millions of rows. Another noteworthy property that may be of more use to us is the fact that the Anderson-Darling Test is more sensitive to departures from normality in the tails of distributions in comparison with other popular fitness tests.[8]</p>
<p style="text-align:center;"><strong>The Perils and Pitfalls of Equation Translation</strong></p>
<p> It is not surprising that this long and varied list of properties differentiates the Anderson-Darling Test from the Kolmogorov-Smirnov, Kuiper’s and Lilliefors Tests we’ve surveyed in the last few articles, given that there are some marked differences in its internal calculations. The inner workings apparently involve transforming the inputs into a uniform distribution, which is still a bit above my head, because I’m still learning stats and stochastics as I go. The same can be said of some of the equations I had to translate for this week’s article, which contained some major stumbling blocks I wasn’t expecting. Once of these was the fact that the Anderson-Darling Test is usually categorized along with other methods based on the empirical distribution function (EDF), which as explained in recent articles, involves computing the difference between the actual values and the probabilities generated for them by the distribution’s cumulative distribution function (CDF). Nevertheless, the CDF is used twice in the calculation of the test statistic and the EDF is not used at all, which led to quite a bit of confusion on my part.<br />
<span style="font-size:10pt;color:white;">…………</span>Another issue I ran into is the fact that the term “N +1 – I” in the formula actually requires the calculation of an order statistic of the kind we used in <a href="https://multidimensionalmayhem.wordpress.com/2016/02/29/goodness-of-fit-testing-with-sql-server-part-6-1-the-shapiro-wilk-test/">Goodness-of-Fit Testing with SQL Server, part 6.1: The Shapiro-Wilk Test</a>. I won’t recap that topic here, except to say that it is akin to writing all of the values in a dataset on a sheet of paper in order, then folding it in half and adding them up on each side. Prior to that discovery I was mired in trying various combinations of Lead and Lag that just weren’t returning the right outputs. I found an offhand remark after the fact in an academic paper (which I can’t recall in order to give proper credit) to the effect that the identification of this term as an order statistic is missing from most of the literature on the subject for some unknown reason. As I’ve learned over the past few months, the translation of equations[9] is not always as straightforward as I originally thought it would be (even though I already had some experience doing back in fourth and fifth grade, when my father taught college physics classes and I used to read all of his textbooks). Other remaining issues with the code in Figure 1 include the fact that I may be setting the wrong defaults for the LOG operations on the CDFValues when they’re equal to zero and the manner in which I handle ties in the order statistics, which may be incorrect. Some of the literature also refers to plugging the standard normal distribution values of 0 and 1 for the mean and standard deviation. Nevertheless, I verified the output of the procedure on two different sets of examples I found on the Internet, so the code may be correct as is.[10]</p>
<p><strong><u>Figure 1: T-SQL Code for the Anderson-Darling Procedure<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CREATE PROCEDURE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE"><span style="color:teal;">Calculations</span><span style="color:gray;">.</span><span style="color:teal;">GoodnessofFitAndersonDarlingTestSP<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;">@Database1 <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">)</span> <span style="color:gray;">=</span> <span style="color:gray;">NULL,</span> @Schema1 <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> @Table1 <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span>@Column1 <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">AS</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE </span><span style="font-size:9.5pt;font-family:Consolas;">@SchemaAndTable1 <span class="GramE"><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span></span>400<span style="color:gray;">),</span>@<span class="SpellE">SQLString</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET </span><span style="font-size:9.5pt;font-family:Consolas;">@SchemaAndTable1 <span style="color:gray;">=</span> @Database1 <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> @Schema1 <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> @Table1 </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> @Mean <span style="color:blue;">float</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">@StDev <span style="color:blue;">float</span><span style="color:gray;">,<br />
</span></span><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;">@Count<span> </span><span style="color:blue;">float</span></span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ValueTable</span> <span style="color:blue;">table<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;">Value<span> </span><span style="color:blue;">float</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">)</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">CDFTable</span> <span style="color:blue;">table<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">ID <span style="color:blue;">bigint</span> <span style="color:blue;">IDENTITY </span><span style="color:gray;">(</span>1<span class="GramE"><span style="color:gray;">,</span>1</span><span style="color:gray;">),<br />
</span></span><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;">Value<span> </span><span style="color:blue;">float</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">,<br />
</span><span class="SpellE"><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;">CDFValue</span></span></span><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:blue;">float</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">)</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ExecSQLString</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">),</span> @<span class="SpellE">MeanOUT</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>200<span style="color:gray;">),</span>@<span class="SpellE">StDevOUT</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>200<span style="color:gray;">),</span>@<span class="SpellE">CountOUT</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>200<span style="color:gray;">),</span> @<span class="SpellE">ParameterDefinition</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ParameterDefinition</span> <span style="color:gray;">=</span> <span style="color:red;">‘@<span class="SpellE">MeanOUT</span> <span class="GramE">nvarchar(</span>200) OUTPUT,@<span class="SpellE">StDevOUT</span> nvarchar(200) OUTPUT,@<span class="SpellE">CountOUT</span> nvarchar(200) OUTPUT ‘<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ExecSQLString</span> <span style="color:gray;">=</span> <span style="color:red;">‘SELECT @<span class="SpellE">MeanOUT</span> = <span class="GramE">CAST(</span><span class="SpellE">Avg</span>(‘</span> <span style="color:gray;">+</span> @Column1 <span style="color:gray;">+</span> <span style="color:red;">‘) as float),@<span class="SpellE">StDevOUT</span> = CAST(StDev(‘</span> <span style="color:gray;">+</span> @Column1 <span style="color:gray;">+</span> <span style="color:red;">‘)as float),@<span class="SpellE">CountOUT</span> = CAST(Count(‘</span> <span style="color:gray;">+ </span>@Column1 <span style="color:gray;">+</span> <span style="color:red;">‘)as float)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span>@SchemaAndTable1 <span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span>@Column1 <span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL’</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:maroon;">sp_executesql</span> @<span class="SpellE">ExecSQLString</span><span class="GramE"><span style="color:gray;">,</span>@<span class="SpellE">ParameterDefinition</span></span><span style="color:gray;">, </span>@<span class="SpellE">MeanOUT</span> <span style="color:gray;">=</span> @Mean <span style="color:blue;">OUTPUT</span><span style="color:gray;">,</span>@<span class="SpellE">StDevOUT</span> <span style="color:gray;">=</span> @StDev <span style="color:blue;">OUTPUT</span><span style="color:gray;">,</span>@<span class="SpellE">CountOUT</span> <span style="color:gray;">=</span> @Count <span style="color:blue;">OUTPUT</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">SQLString</span> <span style="color:gray;">=</span> <span style="color:red;">‘SELECT ‘</span> <span style="color:gray;">+</span> @Column1 <span style="color:gray;">+</span> <span style="color:red;">‘ AS Value<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> @SchemaAndTable1 <span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> @Column1 <span style="color:gray;">+ </span></span><span style="font-size:9.5pt;font-family:Consolas;"><span style="color:red;">‘ IS NOT NULL’</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">INSERT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">INTO</span> @<span class="SpellE">ValueTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">Value<span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">@<span class="SpellE">SQLString</span><span style="color:gray;">)</span> </span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">INSERT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">INTO</span> @<span class="SpellE">CDFTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">Value<span style="color:gray;">,</span> <span class="SpellE">CDFValue</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> T1<span style="color:gray;">.</span>Value<span style="color:gray;">,</span> <span class="SpellE">CDFValue<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">FROM</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ValueTable</span> <span style="color:blue;">AS</span> T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">INNER</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">JOIN</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"><span> </span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE">DistinctValue</span><span style="color:gray;">,</span> <span class="SpellE">Calculations<span style="color:gray;">.</span>NormalDistributionSingleCDFFunction</span> <span style="color:gray;">(</span><span class="SpellE">DistinctValue</span><span style="color:gray;">,</span> @Mean<span style="color:gray;">,</span> @StDev<span style="color:gray;">)</span> <span style="color:blue;">AS</span> <span class="SpellE">CDFValue</span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:blue;">FROM </span><span style="color:gray;">(</span><span style="color:blue;">SELECT</span> <span style="color:blue;">DISTINCT</span> Value <span style="color:blue;">AS</span> <span class="SpellE">DistinctValue<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:blue;">FROM</span> @<span class="SpellE">ValueTable</span><span style="color:gray;">)</span> <span style="color:blue;">AS</span> T2<span style="color:gray;">)</span> <span style="color:blue;">AS</span> T3<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:blue;">ON</span> T1<span style="color:gray;">.</span>Value <span style="color:gray;">=</span> T3<span style="color:gray;">.</span>DistinctValue</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="GramE"><span style="color:fuchsia;">SUM</span><span style="color:gray;">(</span></span><span style="color:gray;">((</span>1 <span style="color:gray;">–</span> <span style="color:gray;">(</span>ID <span style="color:gray;">*</span> 2<span style="color:gray;">))</span> <span style="color:gray;">/</span> @Count<span style="color:gray;">)</span> <span style="color:gray;">*</span> <span style="color:gray;">(</span><span class="SpellE">AscendingValue</span> <span style="color:gray;">+</span> <span class="SpellE">DescendingValue</span><span style="color:gray;">))</span> <span style="color:gray;">–</span> @Count <span style="color:blue;">AS</span> <span class="SpellE">AndersonDarlingTestStatistic</span><br />
<span style="color:blue;">FROM</span> </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">TOP</span> 9999999999 ID<span style="color:gray;">,</span> <span style="color:blue;">CASE</span> <span style="color:blue;">WHEN</span> <span class="SpellE">CDFValue</span> <span style="color:gray;">=</span> 0 <span style="color:blue;">THEN</span> 0 <span style="color:blue;">ELSE</span> <span class="GramE"><span style="color:fuchsia;">Log</span><span style="color:gray;">(</span></span><span class="SpellE">CDFValue</span><span style="color:gray;">)</span> <span style="color:blue;">END</span> <span style="color:blue;">AS</span> <span class="SpellE">AscendingValue<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">FROM</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">CDFTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">ORDER</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">BY</span> ID<span style="color:gray;">)</span> <span style="color:blue;">AS</span> T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:gray;">INNER</span> <span style="color:gray;">JOIN</span> <span style="color:gray;">(</span><span style="color:blue;">SELECT</span> <span style="color:fuchsia;">ROW_<span class="GramE">NUMBER<span style="color:gray;">(</span></span></span><span style="color:gray;">)</span> <span style="color:blue;">OVER </span><span style="color:gray;">(</span><span style="color:blue;">ORDER</span> <span style="color:blue;">BY</span> <span class="SpellE">CDFValue</span> <span style="color:blue;">DESC</span><span style="color:gray;">)</span> <span style="color:blue;">AS</span> RN<span style="color:gray;">,</span> <span style="color:blue;">CASE</span> <span style="color:blue;">WHEN</span> 1 <span style="color:gray;">–</span> <span class="SpellE">CDFValue</span> <span style="color:gray;">=</span> 0 <span style="color:blue;">THEN</span> 0 <span style="color:blue;">ELSE</span><span> </span><span style="color:fuchsia;">Log</span><span style="color:gray;">(</span>1 <span style="color:gray;">–</span> <span class="SpellE">CDFValue</span><span style="color:gray;">)</span> <span style="color:blue;">END</span> <span style="color:blue;">AS</span> <span class="SpellE">DescendingValue<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:blue;">FROM</span> @<span class="SpellE">CDFTable</span><span style="color:gray;">)</span> <span style="color:blue;">AS</span> T2<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:blue;">ON</span> T1<span style="color:gray;">.</span>ID <span style="color:gray;">=</span> T2<span style="color:gray;">.</span>RN </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:green;">— <span class="GramE">this</span> statement is included merely for convenience and can be eliminated<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> Value<span style="color:gray;">,</span> <span class="SpellE">CDFValue<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">FROM</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">CDFTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">ORDER</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">BY</span> Value</span></p>
<p><span style="font-size:10pt;color:white;">…………</span>The code turned out to be quite short in comparison to the length of time it took to write and the number of mistakes I made along the way. Most of it self-explanatory for readers who are used to the format of the T-SQL procedures I’ve posted in the last two tutorial series. As usual, there is no null-handling, SQL injection or validation code, nor can the parameters handle brackets, which I don’t allow in my own code. The first four allow users to run the procedure on any column in any database they have sufficient access to. The final statement returns the table of CDF values used to calculate the test statistic, since there’s no reason not to now that the costs have already been incurred; as noted above, “this statement is included merely for convenience and can be eliminated.” The joins in the INSERT statement lengthen the code but actually make it more efficient, by enabling the calculation of CDF values just once for each unique column value. The Calculations.<a href="https://www.dropbox.com/s/mph0gmymvbxer3x/NormalDistributionSingleCDFFunction.sql?dl=0">NormalDistributionSingleCDFFunction</a> has been introduced in several previous articles, so I won’t rehash it here. In the SELECT where the test statistic is derived, I used an identity value in the join because the ROW_NUMBER operations can be expensive on big table, so I wanted to avoid doing two in one statement.</p>
<p><strong><u>Figure 2: Sample Results from the Anderson-Darling Test<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">Calculations</span><span style="color:gray;">.</span><span style="color:teal;">GoodnessofFitAndersonDarlingTestSP<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span>@Database1 <span style="color:gray;">=</span> <span style="color:red;">N’DataMiningProjects’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span>@Schema1 <span style="color:gray;">= </span><span style="color:red;">N’Health’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span>@Table1 <span style="color:gray;">= </span><span style="color:red;">N’DuchennesTable’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span>@Column1 <span style="color:gray;">= </span><span style="color:red;">N’PyruvateKinase’</span></span></p>
<p><a href="https://multidimensionalmayhem.wordpress.com/2016/05/02/goodness-of-fit-testing-with-sql-server-part-7-3-the-anderson-darling-test/anderson-darling-results/" rel="attachment wp-att-588"><img class="alignnone size-full wp-image-588" src="https://multidimensionalmayhem.files.wordpress.com/2016/05/anderson-darling-results.jpg?w=604" alt="Anderson-Darling Results" /></a></p>
<p><span style="font-size:10pt;color:white;">…………</span>One of the concerns I had when running queries like the one in Figure 2 against the 209 rows of the Duchennes muscular dystrophy dataset and the 11 million rows of the Higgs Boson dataset (which I downloaded from the <a href="http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets">Vanderbilt University’s Department of Biostatistics</a> and <a href="https://archive.ics.uci.edu/ml/datasets/HIGGS">University of California at Irvine’s Machine Learning Repository</a> and converted into SQL Server tables for use in these tutorial series) is that the values seemed to be influenced by the value ranges and cardinality of the inputs. Unlike the last three tests covered in this series, it is not supposed to be bounded between 0 and 1. As I discovered when verifying the procedure against other people’s examples, it’s not uncommon for the test statistic to get into the single or double digits. In the examples at the NIST webpage, those for the Lognormal and Cauchy distributions were in the double and triple digits respectively, while that of the double exponential distribution were well above 1, so it may not be unusual to get a test statistic this high. It is definitely not bounded between 0 and 1 like the stats returned by other goodness-of-fit tests, but the range might be distribution-dependent. This is exactly what happened with the LactateDehydrogenase, PyruvateKinase and Hemopexin columns, which scored 5.43863473749926, 17.4386371653374 and 5.27843535947881respectively. Now contrast that range with the Higgs Boson results, where the second float column scored a 12987.3380102254 and the first a whopping 870424.402686672. The problem is not with the accuracy of the results, which are about what I’d expect, given that Column2 clearly follows a bell curve in a histogram while Column1 is ridiculously lopsided. The issue is that for very large counts, the test statistic seems to be inflated, so that it can’t be compared across datasets. Furthermore, once a measure gets up to about six or seven digits to the left of the decimal point, it is common for readers to semiconsciously count the digits and interpolate commas, which is a very slow and tedious process. The test statistics are accurate, but suffer from legibility and comparability issues when using Big Data-sized record counts.</p>
<p><strong><u>Figure 3: Execution Plan for the Anderson-Darling Procedure<br />
<a href="https://multidimensionalmayhem.wordpress.com/2016/05/02/goodness-of-fit-testing-with-sql-server-part-7-3-the-anderson-darling-test/anderson-darling-execution-plan/" rel="attachment wp-att-589"><img class="alignnone size-full wp-image-589" src="https://multidimensionalmayhem.files.wordpress.com/2016/05/anderson-darling-execution-plan.jpg?w=604&h=336" alt="Anderson-Darling Execution Plan" width="604" height="336" /></a></u></strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>The procedure also performed poorly on the super-sized Higgs Boson dataset, clocking in at 9:24 for Column1 and 9:09 for Column2; moreover, it gobbled up several gigs of RAM and a lot of space in TempDB, probably as a result of the Table Spool in the execution plan above. Perhaps some optimization could also be performed on the Merge Join, which was accompanied by some expensive Sort operators, by forcing a Hash Match or Nested Loops. The major stumbling block is the number of Table Scans, which I tried to overcome with a series of clustered and non-clustered indexes on the table variables, but this unexpectedly degraded the execution time badly, in tandem with outrageous transaction log growth. I’m sure a T-SQL expert could spot ways to optimize this procedure, but as it stands, the inferior performance means it’s not a good fit for our use cases, unless we’re dealing with small recordsets and need to leverage its specific properties. All told, the Anderson-Darling procedure has some limitations that make it a less attractive option for general-purpose fitness testing than the Kolmogorov-Smirnov Test, at least for our unique uses cases. On the other hand, it has a well-defined set of uses cases based on a well-established properties, which means it could be applied to a wide variety of niche cases. Among these properties is its superior ability “for detecting most departures from normality.”[11] In the last installment of this series, we’ll discuss the Cramér–von Mises Criterion, another EDF-based method that is closely related to the Anderson-Darling Test and enjoys comparable statistical power in detecting non-normality.[12]</p>
<p>[1] See National Institute for Standards and Technology, 2014, “1.3.5.14 Anderson-Darling Test,” published in the online edition of the Engineering Statistics Handbook. Available at <a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda35e.htm">http://www.itl.nist.gov/div898/handbook/eda/section3/eda35e.htm</a></p>
<p>[2] Frost, Jim, 2012, “How to Identify the Distribution of Your Data using Minitab,” published March, 2012 at The Minitab Blog web address <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-identify-the-distribution-of-your-data-using-minitab">http://blog.minitab.com/blog/adventures-in-statistics/how-to-identify-the-distribution-of-your-data-using-minitab</a></p>
<p>[3] p. 52, No author listed, 1997, <u>Encyclopaedia of Mathematics, Supplemental Vol. 1</u>. Reidel: Dordrecht. This particular page was retrieved from Google Books.</p>
<p>[4] No author listed, 2014, “Checking Gaussianity,” published online at the <u>MedicalBiostatistics.com</u> web address <a href="http://www.medicalbiostatistics.com/checkinggaussianity.pdf">http://www.medicalbiostatistics.com/checkinggaussianity.pdf</a></p>
<p>[5] See the Wikipedia webpage “Anderson-Darling Test” at <a href="http://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test">http://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test</a></p>
<p>[6] See the aforementioned <u>MedicalBiostatistics.com </u>article.</p>
<p>[7] See Tukey, John W., 1977, <u>Exploratory Data Analysis</u>. Addison-Wesley: Reading, Mass. I’ve merely heard about the book second-hand and have yet to read it, although I may have encountered a few sections here and there.</p>
<p>[8] <em>IBID.</em></p>
<p>[9] For the most part, I depended on the more legible version in National Institute for Standards and Technology, 2014, “1.3.5.14 Anderson-Darling Test,” published in the online edition of the Engineering Statistics Handbook. Available at <a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda35e.htm">http://www.itl.nist.gov/div898/handbook/eda/section3/eda35e.htm</a> All of the sources I consulted though had the same notation, without using the term order statistic.</p>
<p>[10] See Frost, 2012, for the sample data he calculated in Minitab and See Alion System Reliability Center, 2014, “Anderson-Darling: A Goodness of Fit Test for Small Samples Assumptions,” published in <u>Selected Topics in Assurance Related Technologies</u>, Vol. 10, No. 5. Available online at the <u>Alion System Reliability Center </u>web address <a href="http://src.alionscience.com/pdf/A_DTest.pdf">http://src.alionscience.com/pdf/A_DTest.pdf</a> These START publications are well-written , so I’m glad I discovered them recently through Freebird2008’s post on Sept. 4, 2008 at the TalkStats thread “The Anderson-Darling Test,” which is available at <a href="http://www.talkstats.com/showthread.php/5484-The-Anderson-Darling-Test">http://www.talkstats.com/showthread.php/5484-The-Anderson-Darling-Test</a></p>
<p>[11] See the Wikipedia webpage “Anderson-Darling Test” at <a href="http://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test">http://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test</a></p>
<p>[12] <em>IBID.</em></p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/multidimensionalmayhem.wordpress.com/587/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/multidimensionalmayhem.wordpress.com/587/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=multidimensionalmayhem.wordpress.com&blog=27407452&post=587&subd=multidimensionalmayhem&ref=&feed=1" width="1" height="1" />Goodness-of-Fit Testing with SQL Server Part 7.2: The Lilliefors Test
http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/04/15/goodness-of-fit-testing-with-sql-server-part-72-the-lilliefors-test/
Fri, 15 Apr 2016 06:29:43 UT/blogs/multidimensionalmayhem/2016/04/15/goodness-of-fit-testing-with-sql-server-part-72-the-lilliefors-test/4http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/04/15/goodness-of-fit-testing-with-sql-server-part-72-the-lilliefors-test/#comments<p><strong>By Steve Bolton</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>Since I’m teaching myself as I go in this series of self-tutorials, I often have only a vague idea of the challenges that will arise when trying to implement the next goodness-of-fit test with SQL Server. In retrospect, had I known that the Lilliefors Test was so similar to the Kolmogorov-Smirnov and Kuiper’s Tests, I probably would have combined them into single article. The code for this week’s T-SQL stored procedure is nearly the same, as is the execution plan and the performance. The results are also quite similar right to those of the Kolmogorov-Smirnov Test for some of the practice data I’ve used throughout the series, differing in some cases by just a few decimal places. The slight differences may arise from one of the characteristics of the Lilliefors Test that differentiate it from its more famous cousin, namely that “this test of normality is more powerful than others procedures for a wide range of nonnormal conditions.”[i] Otherwise, they share many mathematical properties in common, like location and scale invariance – i.e., the proportions of the test statistics aren’t altered when using a different starting point or multiplying by a common factor.<br />
<span style="font-size:10pt;color:white;">…………</span>On the other hand, the test is apparently more restrictive than the Kolmogorov-Smirnov, in that I’ve seen it referred to specifically as a normality test and I haven’t encountered any mention of it being applied to other distributions. Furthermore, its primary use cases seems to be those in which the variance of the data is unknown[ii], which often doesn’t apply in the types of million-row tables the SQL Server community works with daily. The late Hubert Lilliefors (1928-2008), a stats professor at George Washington University, published it in a Journal of the American Statistical Association titled “On the Kolmogorov-Smirnov Test for Normality with Mean and Variance Unknown” back in 1967[iii] – so augmenting its more famous cousin in a few niche scenarios seems to have been the raison d’etre from the beginning. We can always use more statistical tests in our toolbox to meet the never-ending welter of distributions that arise from actual physical processes, but I won’t dwell on the Lilliefors Test for long because its narrower use cases are less suited to our needs than those of the broader Kolmogorov-Smirnov Test.</p>
<p style="text-align:center;"><strong>Differences from the Kolmogorov-Smirnov</strong></p>
<p> Another reason for not dwelling on it for too long is that most of the code is identical to that of the stored procedure posted in last week’s article. The Lilliefors Test quantifies the difference between the empirical distribution function (EDF) and cumulative distribution function (CDF) exactly the same as the Kolmogorov-Smirnov and Kuiper’s Tests do; in plain English, it orders the actual values and ranks them on a scale of 0 to 1 and computes the difference from the theoretical probability for the Gaussian “normal” distribution, or bell curve, which is also ranked on a scale of 0 to 1. A couple of notes of caution are in order here, because some of the sources I consulted mentioned inputting Z-Scores into the formula and using the standard normal distribution rather than the actual mean and standard deviation of the dataset, but I verified that the procedure is correct as it stands now against an example at Statd.com.[iv]<br />
<span style="font-size:10pt;color:white;">…………</span>One of the main characteristics that set it apart from the Kolmogorov-Smirnov Test is that the test statistic is compared against the Lilliefors distribution, which apparently has a better Bahadur Slope[v] (one of many measures of the efficiency of test statistics) than its competitors in certain hypothesis testing scenarios. That is a broad topic I’ve downplayed for several reasons throughout the last two tutorial series. Among the reasons I’ve brought up in the past are the fact that SQL Server users more likely to be using these tests for exploratory data mining, not proving specific points of evidence, as well as the ease of misinterpretation of p-values, critical values and confidence intervals even among professional academic researchers. What we need are continuous measures of <em>how</em> closely a dataset follows a particular distribution, not simple Boolean either-or choices of the kind used in hypothesis testing, which reduce the information content of the test statistics as sharply as casting a float data type to a bit would in T-SQL. Furthermore, many of the lookup tables and approximation used in hypothesis testing are only valid up to a few hundred values, not the several million that we would need in Big Data scenarios.</p>
<p style="text-align:center;"><strong>Abde and Molin’s Approximation</strong></p>
<p> The Lilliefors distribution was originally derived from Monte Carlo simulations (a broad term encompassing many types of randomized trials) and at least one attempt has been made to approximate it through a set of constants and equations.[vi] I implemented the approximation developed by Hervé Abdi and Paul Molin, but the first couple of SELECTs and the declarations following the comment “code for Molin and Abdi’s approximation” can be safely deleted if you don’t have a need for the P-values the block generates. I verified the P-Values and @A constants used to generate it against the examples given in their undated manuscript “Lilliefors/Van Soest’s Test of Normality,” but as is commonly the case with such workarounds in hypothesis testing, the algorithm is inapplicable when Big Data-sized values and counts are plugged into it.<br />
<span style="font-size:10pt;color:white;">…………</span>Once @A falls below about 0.74 the approximation begins to return negative P-values and when it climbs above about 5.66 it produces P-values greater than 1, which I believe are invalid under the tenets of probability theory. Most of the practice datasets I plugged into the approximation returned invalid outputs, most of them strongly negative. This is a problem I’ve seen with other approximation techniques when they’re fed values beyond the expected ranges. Nevertheless, since I already coded it, I’ll leave that section intact in case anyone runs into scenarios where they can apply it to smaller datasets.</p>
<p><strong><u>Figure 1: T-SQL Code for the Lilliefors Goodness-of-Fit Test<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CREATE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">PROCEDURE</span> [Calculations]<span class="GramE"><span style="color:gray;">.</span>[</span><span class="SpellE">GoodnessOfFitLillieforsTest</span>]<br />
</span><span style="font-size:9.5pt;font-family:Consolas;">@Database1 <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">)</span> <span style="color:gray;">=</span> <span style="color:gray;">NULL,</span> @Schema1 <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> @Table1 <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span>@Column1 <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> @OrderByCode <span style="color:blue;">as</span> <span style="color:blue;">tinyint<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">AS</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE </span><span style="font-size:9.5pt;font-family:Consolas;">@SchemaAndTable1 <span class="GramE"><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span></span>400<span style="color:gray;">),</span>@<span class="SpellE">SQLString</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET </span><span style="font-size:9.5pt;font-family:Consolas;">@SchemaAndTable1 <span style="color:gray;">=</span> @Database1 <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> @Schema1 <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> @Table1 </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> @Mean <span style="color:blue;">float</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">@StDev <span style="color:blue;">float</span><span style="color:gray;">,<br />
</span></span><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;">@Count<span> </span><span style="color:blue;">float</span></span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">EDFTable</span> <span style="color:blue;">table<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">ID <span style="color:blue;">bigint</span> <span style="color:blue;">IDENTITY </span><span style="color:gray;">(</span>1<span class="GramE"><span style="color:gray;">,</span>1</span><span style="color:gray;">),<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">Value <span style="color:blue;">float</span><span style="color:gray;">,<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;">ValueCount</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">bigint</span><span style="color:gray;">,</span><br />
</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;">EDFValue</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">float</span><span style="color:gray;">,<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;">CDFValue</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="GramE"><span style="color:blue;">decimal</span><span style="color:gray;">(</span></span>38<span style="color:gray;">,</span>37<span style="color:gray;">),<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;">EDFCDFDifference</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="GramE"><span style="color:blue;">decimal</span><span style="color:gray;">(</span></span>38<span style="color:gray;">,</span>37<span style="color:gray;">))</span></span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ExecSQLString</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">),</span> @<span class="SpellE">MeanOUT</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>200<span style="color:gray;">),</span>@<span class="SpellE">StDevOUT</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>200<span style="color:gray;">),</span>@<span class="SpellE">CountOUT</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>200<span style="color:gray;">),</span> @<span class="SpellE">ParameterDefinition</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ParameterDefinition</span> <span style="color:gray;">=</span> <span style="color:red;">‘@<span class="SpellE">MeanOUT</span> <span class="GramE">nvarchar(</span>200) UTPUT,@<span class="SpellE">StDevOUT</span> nvarchar(200) OUTPUT,@<span class="SpellE">CountOUT</span> nvarchar(200) OUTPUT ‘<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ExecSQLString</span> <span style="color:gray;">=</span> <span style="color:red;">‘SELECT @<span class="SpellE">MeanOUT</span> = CAST(<span class="SpellE">Avg</span>(‘</span> <span style="color:gray;">+</span> @Column1 <span style="color:gray;">+</span> <span style="color:red;">‘) as float),CAST(@<span class="SpellE">StDevOUT</span> = StDev(‘</span> <span style="color:gray;">+ </span>@Column1 <span style="color:gray;">+</span> <span style="color:red;">‘) as float),CAST(@<span class="SpellE">CountOUT</span> = Count(‘</span> <span style="color:gray;">+</span> @Column1 <span style="color:gray;">+</span> <span style="color:red;">‘) as float)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span>@SchemaAndTable1 <span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span>@Column1 <span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL’</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:maroon;">sp_executesql</span> @<span class="SpellE">ExecSQLString</span><span class="GramE"><span style="color:gray;">,</span>@<span class="SpellE">ParameterDefinition</span></span><span style="color:gray;">, </span>@<span class="SpellE">MeanOUT</span> <span style="color:gray;">=</span> @Mean <span style="color:blue;">OUTPUT</span><span style="color:gray;">,</span>@<span class="SpellE">StDevOUT</span> <span style="color:gray;">=</span> @StDev <span style="color:blue;">OUTPUT</span><span style="color:gray;">,</span>@<span class="SpellE">CountOUT</span> <span style="color:gray;">=</span> @Count <span style="color:blue;">OUTPUT</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">SQLString</span> <span class="GramE"><span style="color:gray;">= </span><span style="color:red;">‘</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT Value, <span class="SpellE">ValueCount</span>, <span class="GramE">SUM(</span><span class="SpellE">ValueCount</span>) OVER (ORDER BY Value ASC) / CAST(‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:fuchsia;">CAST</span><span style="color:gray;">(</span>@Count <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>50<span style="color:gray;">))</span> <span style="color:gray;">+</span> <span style="color:red;">‘AS float) AS <span class="SpellE">EDFValue<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>FROM (SELECT DISTINCT<span> </span>‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span>@Column1 <span style="color:gray;">+</span> <span style="color:red;">‘ AS <span> </span>Value, <span class="GramE">Count(</span>‘ </span><span style="color:gray;">+</span> @Column1 <span style="color:gray;">+</span> <span style="color:red;">‘) OVER (PARTITION BY ‘</span> <span style="color:gray;">+ </span>@Column1 <span style="color:gray;">+</span> <span style="color:red;">‘ ORDER BY ‘ </span><span style="color:gray;">+</span> @Column1 <span style="color:gray;">+</span> <span style="color:red;">‘) AS <span class="SpellE">ValueCount<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> @SchemaAndTable1 <span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> @Column1 <span style="color:gray;">+ </span><span style="color:red;">‘ IS NOT NULL) AS T1</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">‘</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span></span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">INSERT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">INTO</span> @<span class="SpellE">EDFTable </span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">Value<span style="color:gray;">,</span> <span class="SpellE">ValueCount</span><span style="color:gray;">,</span> <span class="SpellE">EDFValue)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">@<span class="SpellE">SQLString</span><span style="color:gray;">)</span> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:fuchsia;">UPDATE</span><span style="font-size:9.5pt;font-family:Consolas;"> T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE">CDFValue</span> <span style="color:gray;">=</span> T3<span style="color:gray;">.</span>CDFValue<span style="color:gray;">,</span> <span class="SpellE">EDFCDFDifference</span> <span style="color:gray;">=</span> <span class="SpellE">EDFValue</span> <span style="color:gray;">–</span> T3<span style="color:gray;">.</span>CDFValue<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">FROM</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">EDFTable</span> <span style="color:blue;">AS</span> T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:gray;">INNER</span> <span style="color:gray;">JOIN</span> </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE">DistinctValue</span><span style="color:gray;">,</span> <span class="SpellE">Calculations<span style="color:gray;">.</span>NormalCalculationsingleCDFFunction</span> <span style="color:gray;">(</span><span class="SpellE">DistinctValue</span><span style="color:gray;">,</span> @Mean<span style="color:gray;">,</span> @StDev<span style="color:gray;">)</span> <span style="color:blue;">AS</span> <span class="SpellE">CDFValue</span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:blue;">FROM </span><span style="color:gray;">(</span><span style="color:blue;">SELECT</span> <span style="color:blue;">DISTINCT</span> Value <span style="color:blue;">AS</span> <span class="SpellE">DistinctValue<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:blue;">FROM</span> @<span class="SpellE">EDFTable</span><span style="color:gray;">)</span> <span style="color:blue;">AS</span> T2<span style="color:gray;">)</span> <span style="color:blue;">AS</span> T3<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"><span style="color:blue;"><span> </span>ON</span> T1<span style="color:gray;">.</span>Value <span style="color:gray;">=</span> T3<span style="color:gray;">.</span>DistinctValue</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE </span><span style="font-size:9.5pt;font-family:Consolas;">@b0 <span style="color:blue;">float</span> <span style="color:gray;">= </span>0.37872256037043<span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">@b1 <span style="color:blue;">float</span> <span style="color:gray;">= </span>1.30748185078790<span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">@b2 <span style="color:blue;">float</span> <span style="color:gray;">= </span>0.08861783849346<span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">@A <span style="color:blue;">float</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">@<span class="SpellE">PValue</span> <span style="color:blue;">float</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">@<span class="SpellE">LillieforsTestStatistic</span> <span style="color:blue;">float</span></span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">LillieforsTestStatistic</span> <span style="color:gray;">=</span> <span class="GramE"><span style="color:fuchsia;">Max</span><span style="color:gray;">(</span></span><span style="color:fuchsia;">ABS</span><span style="color:gray;">(</span><span class="SpellE">EDFCDFDifference</span><span style="color:gray;">))<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">FROM</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">EDFTable</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:green;">— <span class="GramE">code</span> for Molin and <span class="SpellE">Abdis </span>approximation<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:green;">— =======================================<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> @A <span style="color:gray;">=</span> <span style="color:gray;">((-</span>1 <span style="color:gray;">*</span> <span style="color:gray;">(</span>@b1 <span style="color:gray;">+</span> @Count<span style="color:gray;">))</span> <span class="GramE"><span style="color:gray;">+ </span><span style="color:fuchsia;">Power</span></span><span style="color:gray;">(</span><span style="color:fuchsia;">Power</span><span style="color:gray;">((</span>@b1 <span style="color:gray;">+</span> @Count<span style="color:gray;">),</span> 2<span style="color:gray;">)</span> <span style="color:gray;">–</span> <span style="color:gray;">(</span>4 <span style="color:gray;">*</span> @b2 <span style="color:gray;">*</span> <span style="color:gray;">(</span>@b0 <span style="color:gray;">–</span> <span style="color:fuchsia;">Power</span><span style="color:gray;">(</span>@<span class="SpellE">LillieforsTestStatistic</span><span style="color:gray;">,</span> <span style="color:gray;">–</span>2<span style="color:gray;">))),</span> 0.5<span style="color:gray;">))</span> <span style="color:gray;">/</span> <span style="color:gray;">(</span>2 <span style="color:gray;">*</span> @b2<span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="GramE">@<span class="SpellE">PValue </span><span style="color:gray;">=</span></span> <span style="color:gray;">–</span>0.37782822932809 </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">+ </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">1.67819837908004 <span style="color:gray;">*</span> @A<span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">– </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">3.02959249450445 <span style="color:gray;">*</span> <span class="GramE"><span style="color:fuchsia;">Power</span><span style="color:gray;">(</span></span>@A<span style="color:gray;">,</span> 2<span style="color:gray;">))<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">+ </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">2.80015798142101 <span style="color:gray;">*</span> <span class="GramE"><span style="color:fuchsia;">Power</span><span style="color:gray;">(</span></span>@A<span style="color:gray;">,</span> 3<span style="color:gray;">))<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">– </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">1.39874347510845 <span style="color:gray;">*</span> <span class="GramE"><span style="color:fuchsia;">Power</span><span style="color:gray;">(</span></span>@A<span style="color:gray;">,</span> 4<span style="color:gray;">))<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">+ </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">0.40466213484419 <span style="color:gray;">*</span> <span class="GramE"><span style="color:fuchsia;">Power</span><span style="color:gray;">(</span></span>@A<span style="color:gray;">,</span> 5<span style="color:gray;">))<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">– </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">0.06353440854207 <span style="color:gray;">*</span> <span class="GramE"><span style="color:fuchsia;">Power</span><span style="color:gray;">(</span></span>@A<span style="color:gray;">,</span> 6<span style="color:gray;">))<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">+ </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">0.00287462087623 <span style="color:gray;">*</span> <span class="GramE"><span style="color:fuchsia;">Power</span><span style="color:gray;">(</span></span>@A<span style="color:gray;">,</span> 7<span style="color:gray;">))<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">+ </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">0.00069650013110 <span style="color:gray;">*</span> <span class="GramE"><span style="color:fuchsia;">Power</span><span style="color:gray;">(</span></span>@A<span style="color:gray;">,</span> 8<span style="color:gray;">))</span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">– </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">0.00011872227037 <span style="color:gray;">*</span> <span class="GramE"><span style="color:fuchsia;">Power</span><span style="color:gray;">(</span></span>@A<span style="color:gray;">,</span> 9<span style="color:gray;">))<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">+ </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">0.00000575586834 <span style="color:gray;">*</span> <span class="GramE"><span style="color:fuchsia;">Power</span><span style="color:gray;">(</span></span>@A<span style="color:gray;">,</span> 10<span style="color:gray;">))</span></span></p>
<p class="MsoNormal"><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span>@<span class="SpellE">LillieforsTestStatistic</span></span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">AS</span> <span class="SpellE">LillieforsTestStatistic</span><span style="color:gray;">,</span> @<span class="SpellE">PValue</span> <span style="color:blue;">AS</span> <span class="SpellE">PValueAbdiMollinApproximation<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> ID<span style="color:gray;">,</span> Value<span style="color:gray;">,</span> <span class="SpellE">ValueCount</span><span style="color:gray;">,</span> <span class="SpellE">EDFValue</span><span style="color:gray;">,</span> <span class="SpellE">CDFValue</span><span style="color:gray;">,</span> <span class="SpellE">EDFCDFDifference</span><br />
<span style="color:blue;">FROM</span> </span><span style="font-size:9.5pt;font-family:Consolas;">@<span class="SpellE">EDFTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span style="color:blue;">ORDER</span> <span style="color:blue;">BY </span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">=</span> 1 <span style="color:blue;">THEN</span> ID <span style="color:blue;">END</span> <span style="color:blue;">ASC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">=</span> 2 <span style="color:blue;">THEN</span> ID <span style="color:blue;">END</span> <span style="color:blue;">DESC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">=</span> 3 <span style="color:blue;">THEN</span> Value <span style="color:blue;">END</span> <span style="color:blue;">ASC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">=</span> 4 <span style="color:blue;">THEN</span> Value <span style="color:blue;">END</span> <span style="color:blue;">DESC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">=</span> 5 <span style="color:blue;">THEN</span> <span class="SpellE">ValueCount</span> <span style="color:blue;">END</span> <span style="color:blue;">ASC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">=</span> 6 <span style="color:blue;">THEN</span> <span class="SpellE">ValueCount</span> <span style="color:blue;">END</span> <span style="color:blue;">DESC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">=</span> 7 <span style="color:blue;">THEN</span> <span class="SpellE">EDFValue</span> <span style="color:blue;">END</span> <span style="color:blue;">ASC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">=</span> 8 <span style="color:blue;">THEN</span> <span class="SpellE">EDFValue</span> <span style="color:blue;">END</span> <span style="color:blue;">DESC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">=</span> 9 <span style="color:blue;">THEN</span> <span class="SpellE">CDFValue</span> <span style="color:blue;">END</span> <span style="color:blue;">ASC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">= </span>10 <span style="color:blue;">THEN</span> <span class="SpellE">CDFValue</span> <span style="color:blue;">END</span> <span style="color:blue;">DESC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">= </span>11 <span style="color:blue;">THEN</span> <span class="SpellE">EDFCDFDifference </span><span style="color:blue;">END</span> <span style="color:blue;">ASC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">= </span>12 <span style="color:blue;">THEN</span> <span class="SpellE">EDFCDFDifference </span><span style="color:blue;">END</span> <span style="color:blue;">DESC</span></span></p>
<p><strong><u>Figure 2: Sample Results from the Lilliefors Goodness-of-Fit Test<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span class="SpellE"><span style="color:teal;">Calculations</span><span style="color:gray;">.</span><span style="color:teal;">GoodnessofFitLillieforsTestSP<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"><span> </span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DatabaseName</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’DataMiningProjects</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@<span class="SpellE">SchemaName</span></span> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’Health</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@<span class="SpellE">TableName</span></span> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’DuchennesTable</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@<span class="SpellE">ColumnName</span></span> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’LactateDehydrogenase</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@OrderByCode </span><span style="color:gray;">=</span> <span style="color:red;">‘1’</span></span></p>
<p><a href="https://multidimensionalmayhem.wordpress.com/2016/04/14/goodness-of-fit-testing-with-sql-server-part-7-2-the/lillieforsresults/" rel="attachment wp-att-582"><img class="alignnone size-full wp-image-582" src="https://multidimensionalmayhem.files.wordpress.com/2016/04/lillieforsresults.jpg?w=604&h=395" alt="LillieforsResults" width="604" height="395" /></a></p>
<p><span style="font-size:10pt;color:white;">…………</span>Aside from the approximation section, the code in Figure 1 is almost identical to that of last week’s procedure, so I won’t belabor the point by rehashing the explanation here. As usual, I used queries like the one in Figure 2 to test the procedure against several columns in a 209-row dataset on the Duchennes form of muscular dystrophy and an 11-million-row dataset on the Higgs Boson, which are made publicly available by the <a href="http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets">Vanderbilt University’s Department of Biostatistics</a> and <a href="https://archive.ics.uci.edu/ml/datasets/HIGGS">University of California at Irvine’s Machine Learning Repository</a> respectively. It is not surprising that the results nearly matched the Kolmogorov-Smirnov test statistic for many practice columns. For example, the LactateDehydrogenase enzyme scored 0.128712871287129 here and 0.131875117324784 on the Kolmogorov-Smirnov, while the less abnormal Hemopexin protein scored 0.116783569553499 on the Lilliefors and 0.0607407215998911on the Kolmogorov-Smirnov Test. Likewise, the highly abnormal first float column and Gaussian second column in the Higgs Boson table had test statistics of 0.276267238731715 and 0.0181893798916693 respectively, which were quite close to the results of the Kolmogorov-Smirnov. I cannot say if the departure in the case of Hemopexin was the result of some property of the test itself, like its aforementioned higher statistical power for detecting non-normality, or perhaps a coding error on my part. If so, then it would probably be worth it to calculate the Lilliefors test statistic together with the Kolmogorov-Smirnov and Kuiper’s measures and return them in one batch, to give end users a sharper picture of their data at virtually no computational cost.</p>
<p><strong><u>Figure 3: Execution Plan for the Lilliefors Goodness-of-Fit Test</u></strong> (click to enlarge)<br />
<a href="https://multidimensionalmayhem.wordpress.com/2016/04/14/goodness-of-fit-testing-with-sql-server-part-7-2-the/lillieforsexecutionplan/" rel="attachment wp-att-583"><img class="alignnone size-full wp-image-583" src="https://multidimensionalmayhem.files.wordpress.com/2016/04/lillieforsexecutionplan.jpg?w=604&h=282" alt="LillieforsExecutionPlan" width="604" height="282" /></a><br />
<span style="font-size:10pt;color:white;">…………</span>There were six queries in the execution plan, just as there were for last week’s tests, but the first accounted for 19 percent and the second 82 percent of the batch total. Both of those began with non-clustered Index Seeks, which is exactly what we want to see. Only the second would provide any worthwhile opportunities for further optimization, perhaps by targeting the only three operators besides the seek that contributed anything worthwhile to the query cost: a Hash Match (Aggregate) at 14 percent, a Stream Aggregate that accounted for 10 percent and two Parallelism (Repartition Streams) that together amount to 53 percent . Optimization might not really be necessary, given that the first float column in the mammoth Higgs Boson dataset returned in just 23 seconds and the second in 27. Your marks are likely to be several orders of magnitude better, considering that the procedure was executed on an <a href="https://www.youtube.com/watch?v=_7QiQUFQ53U">antiquated semblance of a workstation that is an adventure to start up</a>, not a real database server. The only other fitness tests in this series this fast were the Kolmogorov-Smirnov and Kuiper’s Tests, which I would have calculated together with this test in a single procedure if I’d known there was so much overlap between them. The Anderson-Darling Test we’ll survey in the next installment of the series is also included in the same category of EDF-based fitness tests, but has less in common with the Lilliefors Test and its aforementioned cousins. Unfortunately, high performance is apparently not among the characteristics the Anderson-Darling Test shares with its fellow EDF-based methods. That’s something of a shame, since it is more widely by real statisticians than many other goodness-of-fit tests.</p>
<p>[i] p. 1, Abdi, Hervé and Molin, Paul, undated manuscript “Lilliefors/Van Soest’s Test of Normality,” published at the <u>University of Texas at Dallas School of Behavioral and Brain Sciences</u> web address <a href="https://www.utdallas.edu/~herve/Abdi-Lillie2007-pretty.pdf">https://www.utdallas.edu/~herve/Abdi-Lillie2007-pretty.pdf</a></p>
<p>[ii] See the <u>Wikipedia</u> page “Lilliefors Test” at <a href="http://en.wikipedia.org/wiki/Lilliefors_test">http://en.wikipedia.org/wiki/Lilliefors_test</a></p>
<p>[iii] Lilliefors, Hubert W., 1967, “On the Kolmogorov-Smirnov Test for Normality with Mean and Variance Unknown,” pp. 399-402 in <u>Journal of the American Statistical Association</u>, Vol. 62, No. 318. June, 1967.</p>
<p>[iv] See the <u>Statd.com</u> webpage “Lilliefors Normality Test” at <a href="http://statltd.com/articles/lilliefors.htm">http://statltd.com/articles/lilliefors.htm</a></p>
<p>[v] See Arcones, Miguel A., 2006, “On the Bahadur Slope of the Lilliefors and the Cramér–von Mises Tests of Normality,” pp. 196-206 in the <u>Institute of Mathematical Statistics Lecture Notes – Monograph Series</u>. No. 51. Available at the web address <a href="https://projecteuclid.org/euclid.lnms/1196284113">https://projecteuclid.org/euclid.lnms/1196284113</a></p>
<p>[vi] See p. 3, Abdi and Molin and the aforementioned <u>Wikipedia</u> page “Lilliefors Test” at <a href="http://en.wikipedia.org/wiki/Lilliefors_test">http://en.wikipedia.org/wiki/Lilliefors_test</a></p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/multidimensionalmayhem.wordpress.com/581/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/multidimensionalmayhem.wordpress.com/581/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=multidimensionalmayhem.wordpress.com&blog=27407452&post=581&subd=multidimensionalmayhem&ref=&feed=1" width="1" height="1" />Goodness-of-Fit Testing with SQL Server Part 7.1: The Kolmogorov-Smirnov and Kuiper’s Tests
http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/03/24/goodness-of-fit-testing-with-sql-server-part-71-the-kolmogorov-smirnov-and-kuipers-tests/
Thu, 24 Mar 2016 08:26:48 UT/blogs/multidimensionalmayhem/2016/03/24/goodness-of-fit-testing-with-sql-server-part-71-the-kolmogorov-smirnov-and-kuipers-tests/1http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/03/24/goodness-of-fit-testing-with-sql-server-part-71-the-kolmogorov-smirnov-and-kuipers-tests/#comments<p><strong>By Steve Bolton</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>“The names statisticians use for non-parametric analyses are misnomers too, in my opinion: Kruskal-Wallis tests and Kolmogorov-Smirnov statistics, for example. Good grief! These analyses are simple applications of parametric modeling that belie their intimidating exotic names.”[i]<a href="#_edn1" name="_ednref1"><br />
</a> Apparently even experts like Will G. Hopkins, the author of the plain English online guide to stats <a href="http://www.sportsci.org/resource/stats/index.html">A New View of Statistics</a>, perceive just how dry the subject can be. They feel our pain. Sometimes the topics are simply too difficult to express efficiently without brain-taxing equations and really terse writing, but this is not the case with the Kolmogorov-Smirnov Test, the topic of this week’s mistutorial on how to perform goodness-of-fit tests with SQL Server. This particular test got its lengthy moniker from two Russian mathematicians, Vladimir Smirnov (1887-1974) and Andrey Kolmogorov (1903-1987), the latter of whom is well-known in the field but hardly a household name beyond it.[ii] He made many important contributions to information theory, neural nets and other fields directly related to data mining, which I hope to shed some light on in a future tutorial series, Information Measurement with SQL Server. Among them was Kolmogorov Complexity, a fascinating topic that can be used to embed data mining algorithms more firmly into the use of reason, in order to make inferences based on strict logical necessity. Even more importantly, he was apparently sane – unlike most famous mathematicians and physicists, who as I have noted before tend to be not merely eccentric, but often shockingly degenerate or dangerous, or both.[iii] Despite the imposing name, I was actually looking forward to coding this particular test because Kolmogorov’s work always seems to turn out to be quite useful. I wasn’t disappointed. The concepts aren’t half as hard to grasp as the name is to pronounce, because aside from the usual inscrutable equations (which I almost always omit from these articles after translating them into code) the logic behind it is really common sense. Perhaps best of all, the Kolmogorov-Smirnov Test is hands down the fastest and best-performing goodness-of-fit measure we have yet surveyed in this series. The code I provided for the last few articles was some of the weakest I’ve written in all of my tutorial series, which was compounded by the fact that the tests I surveyed aren’t a good match for SQL Server use cases, but all in all, the T-SQL below for the Kolmogorov-Smirnov is some of the best I’ve written to date. After several rewrites, it now executes on an 11-million-row dataset on a beat-up desktop in less than 30 seconds.</p>
<p style="text-align:center;"><strong>The Benefits of Kolmogorov’s Test</strong></p>
<p> Several studies comparing the various goodness-of-fit tests often rank the Kolmogorov-Smirnov measure near the bottom along with other popular ones like the Chi-Squared Test) because it has lower statistical power (i.e., the ability to detect an effect on a variable when it is actually present) than rivals like the Shapiro-Wilk. As we have seen in previous articles, however, many of these alternate measures are not as well-suited to the use cases the SQL Server community is likely to encounter – particularly the popular Shapiro-Wilk Test, since it can only be applied to very small datasets. Our scenarios are distinctly different from those encountered in the bulk of academic research, since we’re using recordsets of millions or even billions of rows. These datasets are often big enough to reduce the need for random sampling, since they may represent the full population or something close to it. Furthermore, parameters like averages, counts, standard deviations and variances can be instantly calculated for the entire dataset, thereby obviating the need for the complicated statistical techniques often used to estimate them. This advantage forestalls one of the stumbling blocks otherwise associated with the Kolmogorov-Smirnov Test, i.e. the need to fully specify all of the parameters (typically aggregates) for the distribution being tested.[iv]<br />
<span style="font-size:10pt;color:white;">…………</span>The ideal goodness-of-fit test for our purposes would be one applicable to the widest number of distributions, but many of them are limited to the Gaussian “normal” distribution or bell curve. That is not true of the Kolmogorov-Smirnov Test, which can be applied to any distribution that would have a continuous Content type in SQL Server Data Mining (SSDM). It is also an exact test whose accuracy is not dependent on the number of data points fed into it.[v] I would also count among its advantages the fact that it has clear bounds, between 0 and 1; other statistical tests sometimes continually increase in tandem with the values and counts fed into them and can be difficult to read, once the number of digits of the decimal place exceeds six or seven, thereby requiring users to waste time counting them. As we shall see, there is a lingering interpretation issue with this test, or at least my amateur implementation of it. The test can also be “more sensitive near the center of the distribution than at the tails,” but this inconvenience is heavily outweighed by its many other advantages.<br />
<span style="font-size:10pt;color:white;">…………</span>Another plus in its favor is the relative ease with which the inner workings can be grasped. End users should always know how to interpret the numbers returned to them, but there is no reason to burden them with the internal calculations and arcane equations; I think most of the instructional materials on math and stats lose their audiences precisely because they bury non-experts under a mountain of internal details that require a lot of skills they don’t have, nor need. End users are like commuters, who don’t need to give a dissertation in automotive engineering in order to drive to work each day; what they should be able to do is read a speedometer correctly. It is the job of programmers to put the ideas of mathematicians and statisticians into practice and make them accessible to end users, in the same way that mechanics are the middlemen between drivers and automotive engineers. It does help, however, if the internal calculations have the minimum possible level of difficulty, so that programmers and end users alike can interpret and troubleshoot the results better; it’s akin to the way many Americans in rural areas become driveway mechanics on weekends, which isn’t possible if the automotive design becomes too complex for them to work on efficiently.</p>
<p style="text-align:center;"><strong>Plugging in CDFs and EDFs</strong></p>
<p> The Kolmogorov-Smirnov Test isn’t trivial to understand, but some end users might find it easier to grasp the inner workings of this particular goodness-of-fit test. The most difficult concept is that of the cumulative distribution function (CDF), which I covered back in <a href="https://multidimensionalmayhem.wordpress.com/2015/11/03/goodness-of-fit-testing-with-sql-server-part-2-1-implementing-probability-plots-in-reporting-services/">Goodness-of-Fit Testing with SQL Server, part 2.1: Implementing Probability Plots in Reporting Services</a> and won’t rehash here. Suffice it to say that all of the probabilities for all of the possible values of a column are arranged so that they accumulate from 0 to 1. The concept is easier to understand than to code, at least for the normal distribution. One of the strongest points of the Kolmogorov-Smirnov Test is that we can plug the CDF of any continuous distribution into it, but I’ll keep things short by simply reusing one of the CDF functions I wrote for the Implementing Probability Plots article.<br />
<span style="font-size:10pt;color:white;">…………</span>All we have to do to derive the Kolmogorov-Smirnov metric is to add in the concept of the empirical distribution function (EDF), in which we merely put the recordset in the order of the values and assign an EDF value at each point that is equal to the reciprocal of the overall count. Some sources make cautionary statements like this, “Warning: ties should not be present for the Kolmogorov-Smirnov test,”[vi] which would render all of the test based on EDFs useless for our purposes since our billion-row tables are bound to have repeat values. I was fortunate to find a workaround in some undated, uncredited course notes I found at the Penn State University website, which turned out to be the most useful source of info I’ve yet found on implementing EDFs.[vii] To circumvent this issue, all we have to do is use the distinct count for values with ties as the dividend rather than one. Like the CDF, the EDF starts at 0 and accumulates up to a limit of 1, which means they can be easily compared by simply subtracting them at each level. The Kolmogorov-Smirnov test statistic is merely the highest difference between the two.[viii] That’s it. All we’re basically doing is seeing if the order follows the probability we’d expected for each value, if they came from a particular distribution. In fact, we can get two measures for the price of one by using the minimum difference as the test statistic for Kuiper’s Test, which is used sometimes in cases where cyclical variations in the data are an issue.[ix]</p>
<p><strong><u>Figure 1: T-SQL Code for the Kolmogorov-Smirnov and Kuiper’s Tests</u></strong></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CREATE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">PROCEDURE</span> [Calculations]<span style="color:gray;">.</span>[GoodnessOfFitKolomgorovSmirnovAndKuipersTestsSP]<br />
</span><span style="font-size:9.5pt;font-family:Consolas;">@Database1 <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">)</span> <span style="color:gray;">=</span> <span style="color:gray;">NULL,</span> @Schema1 <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> @Table1 <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span>@Column1 <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> @OrderByCode <span style="color:blue;">as</span> <span style="color:blue;">tinyint<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">AS</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE </span><span style="font-size:9.5pt;font-family:Consolas;">@SchemaAndTable1 <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>400<span style="color:gray;">),</span>@SQLString <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET </span><span style="font-size:9.5pt;font-family:Consolas;">@SchemaAndTable1 <span style="color:gray;">=</span> @Database1 <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> @Schema1 <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> @Table1 </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> @Mean <span style="color:blue;">float</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">@StDev <span style="color:blue;">float</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">@Count <span style="color:blue;">float</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> @EDFTable <span style="color:blue;">table<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">ID <span style="color:blue;">bigint</span> <span style="color:blue;">IDENTITY </span><span style="color:gray;">(</span>1<span style="color:gray;">,</span>1<span style="color:gray;">),<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">Value <span style="color:blue;">float</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">ValueCount <span style="color:blue;">bigint</span><span style="color:gray;">,</span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;">EDFValue <span style="color:blue;">float</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">CDFValue <span style="color:blue;">decimal</span><span style="color:gray;">(</span>38<span style="color:gray;">,</span>37<span style="color:gray;">),<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">EDFCDFDifference <span style="color:blue;">decimal</span><span style="color:gray;">(</span>38<span style="color:gray;">,</span>37<span style="color:gray;">))</span></span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE </span><span style="font-size:9.5pt;font-family:Consolas;">@ExecSQLString <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">),</span> @MeanOUT <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>200<span style="color:gray;">),</span>@StDevOUT <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>200<span style="color:gray;">),</span>@CountOUT <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>200<span style="color:gray;">),</span> @ParameterDefinition <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET </span><span style="font-size:9.5pt;font-family:Consolas;">@ParameterDefinition <span style="color:gray;">=</span> <span style="color:red;">‘@MeanOUT nvarchar(200) OUTPUT,@StDevOUT nvarchar(200) OUTPUT,@CountOUT nvarchar(200) OUTPUT ‘<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET </span><span style="font-size:9.5pt;font-family:Consolas;">@ExecSQLString <span style="color:gray;">=</span> <span style="color:red;">‘SELECT @MeanOUT = Avg(‘</span> <span style="color:gray;">+</span> @Column1 <span style="color:gray;">+</span> <span style="color:red;">‘),@StDevOUT = StDev(‘ </span><span style="color:gray;">+</span> @Column1 <span style="color:gray;">+</span> <span style="color:red;">‘),@CountOUT = Count(‘</span> <span style="color:gray;">+ </span>@Column1 <span style="color:gray;">+</span> <span style="color:red;">‘)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span>@SchemaAndTable1 <span style="color:gray;">+</span> <span style="color:red;">‘<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span>@Column1 <span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL’</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:maroon;">sp_executesql</span> @ExecSQLString<span style="color:gray;">,</span>@ParameterDefinition<span style="color:gray;">, </span>@MeanOUT <span style="color:gray;">=</span> @Mean <span style="color:blue;">OUTPUT</span><span style="color:gray;">,</span>@StDevOUT <span style="color:gray;">=</span> @StDev <span style="color:blue;">OUTPUT</span><span style="color:gray;">,</span>@CountOUT <span style="color:gray;">=</span> @Count <span style="color:blue;">OUTPUT<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET </span><span style="font-size:9.5pt;font-family:Consolas;">@SQLString <span style="color:gray;">=</span><span style="color:#ff0000;"> ‘</span></span><span style="color:#ff0000;"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT Value, ValueCount, SUM(ValueCount) OVER (ORDER BY Value ASC) / CAST(‘ +<br />
CAST(@Count as nvarchar(50)) + ‘AS float) AS EDFValue<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> FROM (SELECT DISTINCT ‘</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span>@Column1 <span style="color:gray;">+</span> <span style="color:red;">‘ AS Value, Count(‘</span> <span style="color:gray;">+</span> @Column1 <span style="color:gray;">+</span> <span style="color:red;">‘) OVER (PARTITION BY ‘</span> <span style="color:gray;">+ </span>@Column1 <span style="color:gray;">+</span> <span style="color:red;">‘ ORDER BY ‘ </span><span style="color:gray;">+</span> @Column1 <span style="color:gray;">+</span> <span style="color:red;">‘) AS ValueCount<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> @SchemaAndTable1 <span style="color:gray;">+</span> <span style="color:#ff0000;">‘<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span style="color:#ff0000;">WHERE ‘</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> @Column1 <span style="color:gray;">+ </span><span style="color:red;">‘ IS NOT NULL) AS T1</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">‘</span><span style="font-size:9.5pt;font-family:Consolas;"> </span><span style="font-size:9.5pt;font-family:Consolas;"><br />
</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">INSERT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">INTO</span> @EDFTable<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">Value<span style="color:gray;">,</span> ValueCount<span style="color:gray;">, </span>EDFValue<span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">@SQLString<span style="color:gray;">)</span> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:fuchsia;">UPDATE</span><span style="font-size:9.5pt;font-family:Consolas;"> T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> CDFValue <span style="color:gray;">=</span> T3<span style="color:gray;">.</span>CDFValue<span style="color:gray;">,</span> EDFCDFDifference <span style="color:gray;">= </span>EDFValue <span style="color:gray;">–</span> T3<span style="color:gray;">.</span>CDFValue<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">FROM</span><span style="font-size:9.5pt;font-family:Consolas;"> @EDFTable <span style="color:blue;">AS</span> T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">INNER</span> <span style="color:gray;">JOIN</span> </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT </span><span style="font-size:9.5pt;font-family:Consolas;">DistinctValue<span style="color:gray;">,</span> Calculations<span style="color:gray;">.</span>NormalDistributionSingleCDFFunction <span style="color:gray;">(</span>DistinctValue<span style="color:gray;">,</span> @Mean<span style="color:gray;">,</span> @StDev<span style="color:gray;">)</span> <span style="color:blue;">AS</span> CDFValue<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">FROM </span><span style="color:gray;">(</span><span style="color:blue;">SELECT</span> <span style="color:blue;">DISTINCT</span> Value <span style="color:blue;">AS </span>DistinctValue<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">FROM </span>@EDFTable<span style="color:gray;">)</span> <span style="color:blue;">AS </span>T2<span style="color:gray;">)</span> <span style="color:blue;">AS</span> T3<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">ON</span> T1<span style="color:gray;">.</span>Value <span style="color:gray;">=</span> T3<span style="color:gray;">.</span>DistinctValue</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT </span><span style="font-size:9.5pt;font-family:Consolas;">KolomgorovSmirnovSupremum <span style="color:blue;">AS </span>KolomgorovSmirnovTest<span style="color:gray;">, </span>KolomgorovSmirnovSupremum <span style="color:gray;">– </span>KolomgorovSmirnovMinimum <span style="color:blue;">AS</span> KuipersTest<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">FROM </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:fuchsia;">Max</span><span style="color:gray;">(</span><span style="color:fuchsia;">ABS</span><span style="color:gray;">(</span>EDFValue <span style="color:gray;">–</span> CDFValue<span style="color:gray;">))</span> <span style="color:blue;">AS </span>KolomgorovSmirnovSupremum<span style="color:gray;">,</span> <span style="color:green;">— the supremum i.e. max<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:fuchsia;">Min</span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:fuchsia;">ABS</span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">EDFValue <span style="color:gray;">–</span> CDFValue<span style="color:gray;">))</span> <span style="color:blue;">AS</span> KolomgorovSmirnovMinimum<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">FROM </span>@EDFTable<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHERE </span>EDFCDFDifference <span style="color:gray;">></span> 0</span><span style="font-size:9.5pt;font-family:Consolas;"><span style="color:gray;">)</span> <span style="color:blue;">AS</span> T3</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> ID<span style="color:gray;">,</span> Value<span style="color:gray;">,</span> ValueCount<span style="color:gray;">,</span> EDFValue<span style="color:gray;">,</span> CDFValue<span style="color:gray;">,</span> EDFCDFDifference<br />
<span style="color:blue;">FROM </span></span><span style="font-size:9.5pt;font-family:Consolas;">@EDFTable<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">ORDER</span> <span style="color:blue;">BY </span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">=</span> 1 <span style="color:blue;">THEN</span> ID <span style="color:blue;">END</span> <span style="color:blue;">ASC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">=</span> 2 <span style="color:blue;">THEN</span> ID <span style="color:blue;">END</span> <span style="color:blue;">DESC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">=</span> 3 <span style="color:blue;">THEN</span> Value <span style="color:blue;">END</span> <span style="color:blue;">ASC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">=</span> 4 <span style="color:blue;">THEN</span> Value <span style="color:blue;">END</span> <span style="color:blue;">DESC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">=</span> 5 <span style="color:blue;">THEN</span> ValueCount <span style="color:blue;">END </span><span style="color:blue;">ASC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">=</span> 6 T<span style="color:blue;">HEN</span> ValueCount <span style="color:blue;">END </span><span style="color:blue;">DESC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">=</span> 7 <span style="color:blue;">THEN</span> EDFValue <span style="color:blue;">END </span><span style="color:blue;">ASC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">=</span> 8 <span style="color:blue;">THEN</span> EDFValue <span style="color:blue;">END </span><span style="color:blue;">DESC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">=</span> 9 <span style="color:blue;">THEN</span> CDFValue <span style="color:blue;">END </span><span style="color:blue;">ASC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">= </span>10 <span style="color:blue;">THEN</span> CDFValue <span style="color:blue;">END </span><span style="color:blue;">DESC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">= </span>11 <span style="color:blue;">THEN</span> EDFCDFDifference <span style="color:blue;">END</span> <span style="color:blue;">ASC</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CASE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHEN</span> @OrderByCode <span style="color:gray;">= </span>12 <span style="color:blue;">THEN</span> EDFCDFDifference <span style="color:blue;">END</span> <span style="color:blue;">DESC</span></span></p>
<p><span style="font-size:10pt;color:white;">…………</span>The code above is actually a lot simpler than it looks, given that the last 12 lines are dedicated to implementing the @OrderByCode parameter, which I’ve occasionally provided as an affordance over the course of the last two tutorial series. It’s particularly useful in this test when the column values, distinct counts EDF and CDF results in the @EDFTable are of interest in addition to the test statistic; ordinarily, this would be taken care of in an app’s presentation layer, so the ordering code can be safely deleted if you’re not using SQL Server Management Studio (SSMS). In this instance, 1 orders the results by ID ASC, 2 is by ID Desc, 3 is by Value ASC, 4 is by Value DESC, 5 is by ValueCount ASC, 6 is by ValueCount DESC, 7 is by EDFValue ASC, 9 is by EDFValue Desc, 9 is by CDFValue ASC, 10 is by CDFValue DESC, 11 is by EDFCDFDifference ASC and 12 is by EDFCDFDifference DESC. The rest of the parameters and first couple of line of dynamic SQL allow users to perform the tests against any column in any database they have sufficient access to. As usual, you’ll have to add in your own validation, null handling and SQL injection protection code. Two dynamic SQL statements are necessary because separate count, mean and standard deviation have to be extracted from the original base table. The retrieval of those aggregates needed for subsequent calculations occurs shortly after the declarations section.<br />
<span style="font-size:10pt;color:white;">…………</span>Note that this procedure was markedly faster after substituting the sp_executesql statement for a dynamic INSERT EXEC on the base table (which had been used to populate the @EDFTable in an inefficient way). One quirk I should point out though is the use of the DISTINCT clause in the UPDATE subquery, which is needed to prevent unnecessary repetitive calls to the somewhat expensive Calculations.NormalDistributionSingleCDFFunction in the case of duplicate values. This somewhat convoluted method actually save a big performance hit on large tables with lots of duplicates. In the final query, I bet that the outer subquery would be less expensive than retrieving the max twice in a single query. One of the few concerns I have about the procedure is the use of the actual mean and standard deviation in calculating the CDF values. Some sources recommended using the standard normal, but this typically resulted in ridiculous distortions for most of the recordsets I tested them against. On the other hand, I verified the correctness of the calculations as they stand now by working through the example in the Alion System Reliability Center’s Selected Topics in Assurance Related Technologies, a series of publications on stats I recently discovered and now can’t live without.[x]</p>
<p><strong><u>Figure 2: Sample Results from the Kolmogorov-Smirnov and Kuiper’s Tests<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">Calculations</span><span style="color:gray;">.</span><span style="color:teal;">GoodnessOfFitKolomgorovSmirnovAndKuipersTestsSP<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"> </span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@Database1</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">=</span> <span style="color:red;">N’DataMiningProjects’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@Schema1 </span><span style="color:gray;">=</span> <span style="color:red;">N’Health’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@Table1 </span><span style="color:gray;">=</span> <span style="color:red;">N’DuchennesTable’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@Column1 </span><span style="color:gray;">=</span> <span style="color:red;">N’LactateDehydrogenase’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@OrderByCode </span><span style="color:gray;">=</span> <span style="color:red;">1</span></span></p>
<p><a href="https://multidimensionalmayhem.wordpress.com/2016/03/23/goodness-of-fit-testing-with-sql-server-part-7-1-the-kolmogorov-smirnov-and-kuipers-tests/kolmogorov-smirnov-fixed/" rel="attachment wp-att-575"><img class="alignnone size-full wp-image-575" src="https://multidimensionalmayhem.files.wordpress.com/2016/03/kolmogorov-smirnov-fixed.jpg?w=604&h=391" alt="Kolmogorov-Smirnov Fixed" width="604" height="391" /></a></p>
<p><span style="font-size:10pt;color:white;">…………</span>Even more good news: when I tested it on the 209 rows of the tiny 9-kilobyte set of data on the Duchennes muscular dystrophy and the 11 million rows and near 6 gigs of data in the Higgs Boson dataset (which I downloaded from the <a href="http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets">Vanderbilt University’s Department of Biostatistics</a> and <a href="https://archive.ics.uci.edu/ml/datasets/HIGGS">University of California at Irvine’s Machine Learning Repository</a> respectively) I got pretty much the results I expected. After using the same datasets for the last dozen articles or so, I know which ones follow the Gaussian distribution and which do not, and the Kolmogorov-Smirnov Test consistently returned lower figures for the ones that followed a bell curve and higher ones for those that do not. For example, the query in Figure 2 returned a value of 0.131875117324784 for the LactateDehydrogenase enzyme, while the less abnormal Hemopexin scored a 0.0607407215998911. On the other hand, the highly abnormal, really lopsided first float column in the Higgs Boson dataset scored a whopping 0.276266847552121, while the second float column scored just a 0.0181892303151281 probably because it clearly follows a bell curve in a histogram.<br />
<span style="font-size:10pt;color:white;">…………</span>Other programmers may also want to consider adding in their own logic to implement confidence intervals and the like, which I typically omit for reasons of simplicity, the difficulty of deriving lookup values on a Big Data scale and philosophical concerns about their applicability, not to mention the widespread concern among many professional statisticians about the rampant misuse and misinterpretation of hypothesis testing methods. Suffice it to say that my own interval of confidence in them is steadily narrowing, at least for the unique use cases the SQL Server community faces. The good news is that if you decide to use standard hypothesis testing methods, then the Kolmogorov-Smirnov test statistic doesn’t require modifications before plugging it into lookup tables, unlike the popular Shapiro-Wilk and Anderson-Darling tests.[xi]</p>
<p><strong><u>Figure 3: Execution Plan for the Kolmogorov-Smirnov and Kuiper’s Procedure</u></strong> (click to enlarge)<br />
<a href="https://multidimensionalmayhem.wordpress.com/2016/03/23/goodness-of-fit-testing-with-sql-server-part-7-1-the-kolmogorov-smirnov-and-kuipers-tests/kolmogorov-smirnov-test-execution-plan-better/" rel="attachment wp-att-576"><img class="alignnone size-full wp-image-576" src="https://multidimensionalmayhem.files.wordpress.com/2016/03/kolmogorov-smirnov-test-execution-plan-better.jpg?w=604&h=261" alt="Kolmogorov-Smirnov Test Execution Plan - Better" width="604" height="261" /></a></p>
<p><span style="font-size:10pt;color:white;">…………</span>When all is said and done, the Kolmogorov-Smirnov Test is the closest thing to the ideal goodness-of-fit measure for our use cases. It may have low statistical power, but it can handle big datasets and a wide range of distributions. The internals are a shorter to code and moderately easier to explain to end users than those of some other procedures and the final test statistic is easy to read because it has clear bounds. It also comes with some freebies, like the ability to simultaneously calculate Kuiper’s Test at virtually no extra cost. For most columns I tested there wasn’t much of a difference between the Kolmogorov-Smirnov and Kuiper’s Test results till we got down to the second through fifth decimal places, but there’s no reason not to calculate it if the costs are dwarfed by those incurred by the rest of the procedure. Note that I also return the full @EDFTable, including the ValueCount for each distinct Value, since there’s no point in discarding all that information once the burden of computing it all has been borne. One of the few remaining concerns I have about the test is that much of this information may be wasted in the final test statistics, since merely taking minimums and maximums is often an inefficient way of making inferences about a dataset. This means that more useful, expanded versions of the tests might be possible by calculating more sophisticated measures on the same EDF and CDF data.<br />
<span style="font-size:10pt;color:white;">…………</span>Best of all, the test outperforms any of the others we’ve used in the last two tutorial series. After eliminating most of the dynamic SQL I overused in previous articles, the execution time actually worsened, till I experimented with some different execution plans. On the first float column in the 11-million-row, 6-gig Higgs Boson dataset, the procedure return in just 24 seconds, but for the equally-sized second float column, in returned in an average of just 29. That’s not shabby at all for such a useful statistical test on such a huge dataset, on a clunker of a desktop that’s held together with duct tape. I can’t account for that difference, given that the execution plans were identical and the two columns share the same data type and count; the only significant difference I know of is that one is highly abnormal and the other follows a bell curve. For smaller datasets of a few thousand rows the test was almost instantaneous. I don’t think the execution plan in Figure 3 can be improved upon much, given that just two of the five queries account for practically all of the cost and both of them begin with Index Seeks. In the case of the first, that initial Seek accounts for 92 percent of the cost. The second ought to be the target of any optimization efforts, since it accounts for 85 percent of the batch; within it, however, the only operators that might be worth experimenting with are the Hash Match (Aggregate) and the Sort. Besides, the procedure already performs well enough as it is and should be practically instantly on a real database server. In the next installment, we’ll see whether the Lilliefors Test, another measure based on the EDF, can compete with the Kolmogorov-Smirnov Test, which is thus far the most promising measure of fit we’ve yet covered in the series.</p>
<p>[i] See Hopkins, 2014, “Rank Transformation: Non-Parametric Models,” published at the <u>A New View of Statistics </u>webpage <a href="http://www.sportsci.org/resource/stats/nonparms.html">http://www.sportsci.org/resource/stats/nonparms.html</a></p>
<p>[ii] See the Wikipedia pages “Andrey Kolmogorov” and “Vladimir Smirnov” at <a href="http://en.wikipedia.org/wiki/Andrey_Kolmogorov">http://en.wikipedia.org/wiki/Andrey_Kolmogorov</a> and <a href="http://en.wikipedia.org/wiki/Vladimir_Smirnov_(mathematician)">http://en.wikipedia.org/wiki/Vladimir_Smirnov_(mathematician)</a> respectively.</p>
<p>[iii] I’m slowly compiling a list of the crazy ones and their bizarre antics for a future editorial or whatever – which will include such cases as Rene Descartes’ charming habit of carrying a dummy of his dead sister around Europe and carrying on conversations with it in public. I’m sure there’ll also be room for Kurt Gödel, who had a bizarre fear of being poisoned – so he forced his wife to serve as his food-taster. Nothing says romance like putting the love of your life in harm’s way when you think people are out to get you. When she was hospitalized, he ended up starving to death. Such tales are the norm among the great names in these fields, which is why I’m glad I deliberately decided back in fifth grade not to pursue my fascination with particle physics.</p>
<p>[iv] See National Institute for Standards and Technology, 2014, “1.3.5.16 Kolmogorov-Smirnov Goodness-of-Fit Test,” published in the online edition of the <u>Engineering Statistics Handbook.</u> Available at <a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm">http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm</a></p>
<p>[v] <em>IBID.</em></p>
<p>[vi] p. 14, Hofmann, Heike, 2013, “Nonparametric Inference and Bootstrap { Q-Q plots; Kolmogorov Test,” lecture notes published Oct. 11, 2013 at the Iowa State University web address <a href="http://www.public.iastate.edu/~hofmann/stat415/lectures/07-qqplots.pdf">http://www.public.iastate.edu/~hofmann/stat415/lectures/07-qqplots.pdf</a></p>
<p>[vii] Penn State University, 2014, “Empirical Distribution Functions,” undated course notes posted at the <u>Penn State University </u>website and retrieved Nov. 5, 2014 from the web address <a href="https://onlinecourses.science.psu.edu/stat414/node/333">https://onlinecourses.science.psu.edu/stat414/node/333</a></p>
<p>[viii] I also consulted the Wikipedia page “Kolmogorov-Smirnov Test” at <a href="http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test">http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test</a> for some of these calculations.</p>
<p>[ix] See the Wikipedia article “Kuiper’s Test” at <a href="http://en.wikipedia.org/wiki/Kuiper's_test">http://en.wikipedia.org/wiki/Kuiper’s_test</a></p>
<p>[x] See Alion System Reliability Center, 2014, “Kolmogorov-Simirnov: A Goodness of Fit Test for Small Samples,” published in <u>Selected Topics in Assurance Related Technologies</u>, Vol. 10, No. 6. Available online at the <u>Alion System Reliability Center </u>web address</p>
<p><a href="https://src.alionscience.com/pdf/K_STest.pdf">https://src.alionscience.com/pdf/K_STest.pdf</a></p>
<p>[xi] “Critical value beyond which the hypothesis is rejected in Anderson-Darling test is different when Gaussian pattern is being tested than when another distribution such a lognormal is being tested. Shapiro-Wilk critical value also depends on the distribution under test. But Kolmogorov-Smirnov test is distribution-free as the critical values do not depend on whether Gaussianity is being tested or some other form. No author listed, 2014, “Checking Gaussianity,” published online at the MedicalBiostatistics.com web address <a href="http://www.medicalbiostatistics.com/checkinggaussianity.pdf">http://www.medicalbiostatistics.com/checkinggaussianity.pdf</a></p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/multidimensionalmayhem.wordpress.com/574/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/multidimensionalmayhem.wordpress.com/574/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=multidimensionalmayhem.wordpress.com&blog=27407452&post=574&subd=multidimensionalmayhem&ref=&feed=1" width="1" height="1" />Goodness-of-Fit Testing with SQL Server Part 6.2: The Ryan-Joiner Test
http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/03/12/goodness-of-fit-testing-with-sql-server-part-62-the-ryan-joiner-test/
Sat, 12 Mar 2016 10:06:40 UT/blogs/multidimensionalmayhem/2016/03/12/goodness-of-fit-testing-with-sql-server-part-62-the-ryan-joiner-test/3http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/03/12/goodness-of-fit-testing-with-sql-server-part-62-the-ryan-joiner-test/#comments<p><strong>By Steve Bolton</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>In the last installment of this amateur series of self-tutorials, we saw how the Shapiro-Wilk Test might probably prove less useful to SQL Server users, despite the fact that it is one of the most popular goodness-of-fit tests among statisticians and researchers. Its impressive statistical power is rendered impotent by the fact that the logic of its internal calculations limits it to inputs of just 50 rows (or up to 2,000 when certain revisions are applied), which is chump change when we’re talking about SQL Server tables that often number in the millions of rows. Thankfully, a little-known rival in the same category[1] is available that shares few of the drawbacks that make Shapiro-Wilk such a disappointing choice for our particular use cases. In fact, the Ryan-Joiner Test “is very highly correlated with that of Shapiro and Wilk, so either test may be used and will produce very similar resultspWhen wading through various head-to-head comparisons of the goodness-of-fit tests published on the Internet, I noticed that on the occasions when the Ryan-Joiner Test was mentioned, it received favorable marks like this review by Jim Colton at the Minitab Blog:</p>
<blockquote><p> “I should note that the three scenarios evaluated in this blog are not designed to assess the validity of the Normality assumption for tests that benefit from the Central Limit Theorem, such as such as 1-sample, 2-sample, and paired t-tests. Our focus here is detecting Non-Normality when using a distribution to estimate the probability of manufacturing defective (out-of-spec) unit.<br />
<span style="font-size:10pt;color:white;">…………</span>“In scenario 1, the Ryan-Joiner test was a clear winner. The simulation results are below…<br />
<span style="font-size:10pt;color:white;">…………</span>“…The Anderson-Darling test was never the worst test, but it was not nearly as effective as the RJ test at detecting a 4-sigma outlier. If you’re analyzing data from a manufacturing process tends to produce individual outliers, the Ryan-Joiner test is the most appropriate…”<br />
<span style="font-size:10pt;color:white;">…………</span>“…The RJ test performed very well in two of the scenarios, but was poor at detecting Non-Normality when there was a shift in the data. If you’re analyzing data from a manufacturing process that tends to shift due to unexpected changes, the AD test is the most appropriate.”[3]</p></blockquote>
<p><span style="font-size:10pt;color:white;">…………</span>The most striking drawback is the paucity of public information available on the test, which doesn’t even have a Wikipedia page, thereby forcing me to resort to even less professional sources like Answers.com for matter-of-fact explanations like this: “The Ryan-Joiner test is implemented in the Minitab software package but not widely elsewhere.”[4] It was apparently the brainchild of Brian Joiner and Barbara Ryan, “the founder of Minitab,” but I was unable to find a publicly available copy of the original academic paper they published on the test back in 1976 until after I’d already written most of the code below.[5] Publication of this kind signifies that it is not a proprietary algorithm exclusively owned by Minitab, so we are free to implement it ourselves – provided we can find adequate detail on its inner workings, which turned out to be a tall order. The main drawback of the Ryan-Joiner Test is the difficulty in finding information that can be applied to implementation and testing, which is certainly a consequence of its close association with Minitab, a stats package that competes only tangentially with SQL Server Data Mining (SSDM) as I addressed in <a href="https://multidimensionalmayhem.wordpress.com/2015/06/30/integrating-other-data-mining-tools-with-sql-server-part-2-1-the-minuscule-hassles-of-minitab/">Integrating Other Data Mining Tools with SQL Server, Part 2.1: The Minuscule Hassles of Minitab</a> and <a href="https://multidimensionalmayhem.wordpress.com/2015/07/08/integrating-other-data-mining-tools-with-sql-server-part-2-2-minitab-vs-ssdm-and-reporting-services/">Integrating Other Data Mining Tools with SQL Server, Part 2.2: Minitab vs. SSDM and Reporting Services</a>. This makes it somewhat opaque, but it I was able to overcome this inscrutability enough to get a T-SQL version of it up and running.<br />
<span style="font-size:10pt;color:white;">…………</span>The underlying mechanisms are still somewhat unclear, but this brief introduction in the LinkedIn discussion group Lean Sigma Six Group Brazil is adequate enough for our purposes: “This test assesses normality by calculating the correlation between your data and the normal scores of your data. If the correlation coefficient is near 1, the population is likely to be normal. The Ryan-Joiner statistic assesses the strength of this correlation; if it falls below the appropriate critical value, you will reject the null hypothesis of population normality.”[6] As usual, I’ll be omitting those critical values, because of the numerous issues with hypothesis testing I’ve pointed out in previous blog posts. Apparently my misgivings are widely shared by professional statisticians and mathematicians who actually know what they’re talking about, particularly when it comes to the ease and frequency with which all of the caveats and context that statements of statistical significance are carelessly dispensed with. It is not that significance level stats aren’t useful, but that the either-or nature of standard hypothesis testing techniques discards an awful lot of information by effectively shrinking our hard-won calculations down to simple Boolean either-or choices; not only is this equivalent to casting a float or decimal value down to a SQL Server bit data type, but can also easily lead to errors in interpretation. For this reason and concerns about brevity and simplicity, I’ll leave out the critical values, which can be easily tacked on to my code by anyone with a need for them.</p>
<p style="text-align:center;"><strong>Interpreting the Results in Terms of the Bell Curve</strong></p>
<p> Aside from that, the final test statistic isn’t that hard to interpret: the closer we get to 1, the more closely the data follows the Gaussian or “normal” distribution, i.e. the bell curve. So far, my test results have all remained within the range of 0 to 1 as expected, but I cannot rule out the possibility that in some situations an undiscovered error will cause them to exceed these bounds. When writing the T-SQL code in Figure 1 I had to make use of just two incomplete sources[7], before finally finding the original paper by Ryan and Joiner at the Minitab website late in the game.[8] This find was invaluable because it pointed out that the Z-Scores (a basic topic I explained way back in <a href="http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2014/10/28/outlier-detection-with-sql-server-part-21-z-scores/">Outlier Detection with SQL Server, part 2.1: Z-Scores</a>) in the internal calculations should be done against the standard normal distribution, not the data points.<br />
<span style="font-size:10pt;color:white;">…………</span>My standard disclaimer that I still a novice in the fields of data mining and statistics and that my sample code has not yet been thoroughly tested ought not be glossed over, given the number of other mistakes I caught myself making when writing the code below. At one point I accidentally used a minus sign rather than an asterisk in the top divisor; I tested it once against the wrong online calculator, for the normal probability density function (PDF) rather than the cumulative density function (CDF); later, I realized I should have used the standard normal inverse CDF rather than the CDF or PDF; I also used several different improper step values for the RangeCTE, including one that was based on minimum and maximum values rather than the count and another based on fractions. Worst of all, I garbled my code at the last minute by accidentally (and not for the first time) using the All Open Documents option with Quick Replace in SQL Server Management Studio (SSMS). Once I figured out my mistakes, the procedure ended up being a lot shorter and easier to follow than I ever expected. Keep in mind, however, that I didn’t have any published examples to test it against, so there may be other reliability issues lurking within.</p>
<p><strong><u>Figure 1: T-SQL Code for the Ryan-Joiner Test Procedure<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CREATE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">PROCEDURE</span> [Calculations]<span style="color:gray;">.</span>[NormalityTestRyanJoinerTestSP]<br />
</span><span style="font-size:9.5pt;font-family:Consolas;">@Database1 <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">)</span> <span style="color:gray;">=</span> <span style="color:gray;">NULL,</span> @Schema1 <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> @Table1 <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span>@Column1 <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">AS</p>
<p></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE </span><span style="font-size:9.5pt;font-family:Consolas;">@SchemaAndTable1 <span class="GramE"><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span></span>400<span style="color:gray;">),</span>@SQLString <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET </span><span style="font-size:9.5pt;font-family:Consolas;">@SchemaAndTable1 <span style="color:gray;">=</span> @Database1 <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> @Schema1 <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> @Table1</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ValueTable</span> <span style="color:blue;">table<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">(ID <span style="color:blue;">bigint</span> <span style="color:blue;">IDENTITY </span><span style="color:gray;">(</span>1<span class="GramE"><span style="color:gray;">,</span>1</span><span style="color:gray;">),<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">Value <span style="color:blue;">float</span><span style="color:gray;">)</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">SQLString</span> <span class="GramE"><span style="color:gray;">= ‘</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> @Column1 <span style="color:gray;">+ </span><span style="color:red;">‘ AS Value<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> @SchemaAndTable1 <span class="GramE"><span style="color:gray;">+</span><br />
<span style="color:#ff0000;">‘ </span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span style="color:#ff0000;">WHERE ‘</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> @Column1 <span style="color:gray;">+ </span><span style="color:red;">‘ IS NOT NULL’</span></span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">INSERT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">INTO</span> @<span class="SpellE">ValueTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">Value<span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">@<span class="SpellE">SQLString</span><span style="color:gray;">)</span></span></p>
<p class="MsoNormal"><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span>@<span class="SpellE">Var</span></span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">AS</span> <span style="color:blue;">decimal</span><span style="color:gray;">(</span>38<span style="color:gray;">,</span>11<span style="color:gray;">),<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">@Count <span style="color:blue;">bigint</span><span style="color:gray;">,<br />
</span></span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;">@ConstantBasedOnCount <span style="color:blue;">decimal</span><span style="color:gray;">(</span>5<span style="color:gray;">,</span>4<span style="color:gray;">),</span><br />
</span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;">@Mean <span style="color:blue;">AS</span> <span style="color:blue;">decimal</span><span style="color:gray;">(</span>38<span style="color:gray;">,</span>11<span style="color:gray;">)</span></span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> @Count <span style="color:gray;">=</span> <span class="GramE"><span style="color:fuchsia;">Count</span><span style="color:gray;">(</span></span><span style="color:gray;">*),</span> @<span class="SpellE">Var</span> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:fuchsia;">Var</span></span><span style="color:gray;">(</span>Value<span style="color:gray;">)</span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">FROM</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ValueTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:green;">— <span class="GramE">the</span> NOT NULL clause is not necessary here because that’s taken care of the in @<span class="SpellE">SQLString</span></span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ConstantBasedOnCount</span> <span style="color:gray;">=</span> <span style="color:blue;">CASE</span> <span style="color:blue;">WHEN</span> @Count <span style="color:gray;">></span> 10 <span style="color:blue;">THEN</span> 0.5 <span style="color:blue;">ELSE</span> 0.375 <span style="color:blue;">END</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:gray;">;</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WITH</span> <span class="SpellE"><span class="GramE">RangeCTE</span></span><span class="GramE"><span style="color:gray;">(</span></span><span class="SpellE">RangeNumber</span><span style="color:gray;">)</span> <span style="color:blue;">as<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:blue;">SELECT</span> 1 <span style="color:blue;">as</span> <span class="SpellE">RangeNumber</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:blue;">UNION</span> <span style="color:gray;">ALL</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:blue;">SELECT</span> <span class="SpellE">RangeNumber</span> <span style="color:gray;">+</span> 1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:blue;">FROM</span> <span class="SpellE">RangeCTE<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:blue;">WHERE</span> <span class="SpellE"><span class="GramE">RangeNumber</span></span><span class="GramE"><span> </span><span style="color:gray;">< </span></span>@Count<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">)</span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="GramE"><span style="color:fuchsia;">SUM</span><span style="color:gray;">(</span></span>Value <span style="color:gray;">*</span> <span class="SpellE">RankitApproximation</span><span style="color:gray;">)</span> <span style="color:gray;">/</span><span> </span><span style="color:fuchsia;">Power</span><span style="color:gray;">((</span>@<span class="SpellE">Var</span> <span style="color:gray;">*</span> <span style="color:gray;">(</span>@Count <span style="color:gray;">–</span> 1<span style="color:gray;">)</span><span> </span><span style="color:gray;">*</span> <span style="color:fuchsia;">SUM</span><span style="color:gray;">(</span><span style="color:fuchsia;">Power</span><span style="color:gray;">(</span><span style="color:fuchsia;">CAST</span><span style="color:gray;">(</span><span class="SpellE">RankitApproximation</span> <span style="color:blue;">AS</span> <span style="color:blue;">float</span><span style="color:gray;">),</span> 2<span style="color:gray;">))),</span> 0.5<span style="color:gray;">)</span> </span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">AS</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE">RyanJoinerTestStatistic<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">FROM </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> RN<span style="color:gray;">,</span> Value<span style="color:gray;">,</span> <span class="SpellE"><span class="GramE">Calculations<span style="color:gray;">.</span>NormalDistributionInverseCDFFunction</span></span><span class="GramE"><span style="color:gray;">(</span></span><span style="color:gray;">(</span>RN <span style="color:gray;">–</span> @<span class="SpellE">ConstantBasedOnCount</span><span style="color:gray;">)</span> <span style="color:gray;">/</span> <span style="color:gray;">(</span>@Count <span style="color:gray;">+</span> 1 <span style="color:gray;">–</span> <span style="color:gray;">(</span>2 <span style="color:gray;">*</span> @<span class="SpellE">ConstantBasedOnCount</span><span style="color:gray;">)))</span> <span style="color:blue;">AS</span> <span class="SpellE">RankitApproximation<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:blue;">FROM </span><span style="color:gray;">(</span><span style="color:blue;">SELECT</span> <span style="color:fuchsia;">ROW_<span class="GramE">NUMBER<span style="color:gray;">(</span></span></span><span style="color:gray;">)</span> <span style="color:blue;">OVER </span><span style="color:gray;">(</span><span style="color:blue;">ORDER</span> <span style="color:blue;">BY</span> Value<span style="color:gray;">)</span> <span style="color:blue;">AS</span> RN<span style="color:gray;">,</span> Value<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:blue;">FROM </span>@<span class="SpellE">ValueTable</span> <span style="color:blue;">AS</span> T0<span style="color:gray;">)</span> <span style="color:blue;">AS</span> T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:gray;">INNER</span> <span style="color:gray;">JOIN</span> <span class="SpellE">RangeCTE</span> <span style="color:blue;">AS</span> T2<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span class="GramE"><span style="color:blue;">ON</span><span> </span>T1<span style="color:gray;">.</span>RN</span> <span style="color:gray;">=</span> T2<span style="color:gray;">.</span>RangeNumber<span style="color:gray;">)</span> <span style="color:blue;">AS</span> T3<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">OPTION </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">MAXRECURSION 0<span style="color:gray;">)</span></span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p><span style="font-size:10pt;color:white;">…………</span>Much of the code above is easy to follow if you’ve seen procedures I’ve posted over the last two tutorial series. As usual, the parameters and first couple of lines in the body allow users to perform the test on any column in any database they have sufficient access to, as well as to adjust the precision of the calculations to avoid arithmetic overflows. Starting with my last article, I began using a lot less dynamic SQL to code procedures like these, by instead caching the original values in a @ValueTable table variable. A couple of simple declarations and assignments needed for the rest of the computations follow this. The RangeCTE generates a set of integers that is fed to the Calculations.NormalDistributionInverseCDFFunction I introduced in <a href="https://multidimensionalmayhem.wordpress.com/2015/11/03/goodness-of-fit-testing-with-sql-server-part-2-1-implementing-probability-plots-in-reporting-services/">Goodness-of-Fit Testing with SQL Server, part 2: Implementing Probability Plots in Reporting Services</a>.<br />
<span style="font-size:10pt;color:white;">…………</span>In lieu of making this article any more verbose and dry than it absolutely has to be, I’ll omit a rehash of that topic and simply point users back to the code from that previous article. Once those numbers are derived, the calculations are actually quite simple in comparison to some of the more complex procedures I’ve posted in the past. As usual, I’ll avoid a precise explanation of how I translated the mathematical formulas into code, for the same reason that driver’s ed classes don’t require a graduate degree in automotive engineering: end users need to be able to interpret the final test statistic accurately – which is why I’m not including the easily misunderstood critical values – but shouldn’t be bothered with the internals. I’ve supplied just enough detail so that the T-SQL equivalent of a mechanic can fix my shoddy engineering, if need be. It may be worth noting though that I can’t simply use the standard deviation in place of the root of the variance as I normally do, because the square root in the Ryan-Joiner equations is calculated after the variance has been multiplied by other terms.</p>
<p><strong><u>Figure 2: Sample Results from Ryan-Joiner Test Procedure<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">Calculations</span><span style="color:gray;">.</span><span style="color:teal;">NormalityTestRyanJoinerTestSP<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"><span> </span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DatabaseName</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">=</span> <span style="color:red;">N’DataMiningProjects’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@SchemaName </span><span style="color:gray;">=</span> <span style="color:red;">N’Health’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@TableName </span><span style="color:gray;">=</span> <span style="color:red;">N’DuchennesTable’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@ColumnName </span><span style="color:gray;">=</span> <span style="color:red;">N’Hemopexin’</span></span></p>
<p><a href="https://multidimensionalmayhem.wordpress.com/2016/03/11/goodness-of-fit-testing-with-sql-server-part-6-2-the-ryan-joiner-test/ryanjoinertestresults/" rel="attachment wp-att-571"><img class="alignnone size-full wp-image-571" src="https://multidimensionalmayhem.files.wordpress.com/2016/03/ryanjoinertestresults.jpg?w=604" alt="RyanJoinerTestResults" /></a></p>
<p><span style="font-size:10pt;color:white;">…………</span>The sample query in Figure 2 was run against a column of data on the Hemopexin protein contained within a dataset on the Duchennes form of muscular dystrophy, which I downloaded long ago from the <a href="http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets">Vanderbilt University’s Department of Biostatistics</a> and converted to a SQL Server table. Since this table only has 209 rows and occupies just 9 kilobytes, I customarily stress test the procedures I post in these tutorials against an 11-million-row table of data on the Higgs Boson, which is made freely available by the <a href="https://archive.ics.uci.edu/ml/datasets/HIGGS">University of California at Irvine’s Machine Learning Repository</a> and now occupies nearly 6 gigabytes in the same database.<br />
<span style="font-size:10pt;color:white;">…………</span>I also tested it against columns in other tables I’m quite familiar with and discovered a pattern that is both comforting and disconcerting at the same time: the test statistic is indeed closer to 0 than 1 on columns I already know to be abnormal, but there may be a scaling issue in the internal calculations because the values are still unexpectedly high for all columns. I know from previous goodness-of-fit and outlier detection tests that the Hemopexin column is more abnormal than some of the other Duchennes columns and as expected, it had a lower Ryan-Joiner statistic; the problem is that it was still fairly close to 1. Likewise, the histograms I posted way back in <a href="https://multidimensionalmayhem.wordpress.com/2015/04/21/outlier-detection-with-sql-server-part-6-1-visual-outlier-detection-with-reporting-services/">Outlier Detection with SQL Server, part 6.1: Visual Outlier Detection with Reporting Services</a> clearly show that the first float column in the Higgs Boson dataset is hopelessly lopsided and therefore can’t come from a normal distribution, while the second follows a clear bell curve. It is not surprising then that the former scored a 0.909093348747035 and the latter a 0.996126753961487 in the Ryan-Joiner Test. The order of the values always seems to correctly match the degree of normality for every practice dataset I use, which is a good sign, but the gaps between them may not be proportionally correct. In the absence of example data to verify my procedure against, I can’t tell for sure if this is a problem or not.</p>
<p style="text-align:center;"><strong>Next Up: Kolmogorov and EDF-Based Tests</strong></p>
<p> Either way, the test is useful as-is, because it at least assigns test statistic values that are in the expected order, regardless of whether or not they are scaled correctly. These results come at a moderate, tolerable performance cost, clocking in at 6:43 for the first float column and 6:14 for the second. As usual, your results will probably be several orders of magnitude better than mine, given that I’m using a clunker of a development machine, not a real database server. The execution plans consist of two queries, the second of which accounts for 97 percent of the cost of the whole batch; out of the 24 operators in that query, a single Sort accounts for 96 percent of the cost. It occurs prior to a Merge Join, so there may be some way to optimize the procedure with join hints or recoding with optimization in mind. We’re unlikely to get much benefit out of analyzing the execution plan further, because it consists almost entirely of spools and Compute Scalar operators with infinitesimal costs, plus two Index Seeks, which is what we want to see.<br />
<span style="font-size:10pt;color:white;">…………</span>The Ryan-Joiner Test performs well enough that DBA and data miners might find it a more useful addition to their toolbelt than the far better-known Shapiro-Wilk Test, which is simply inapplicable to most Big Data scenarios because of its fatal limitations on input sizes. There may be some lingering concerns about it reliability, but this can be rectified through a more diligent search of the available literature for examples that we can test it against; if we really need this particular statistic, then conferring with a professional statistician for ten minutes to verify the correctness of the results might also get the job done. If misgivings about its reliability are a real concern, then we can always turn to the alternatives we’ll cover in the next segment of this series, like the Kolmogorov-Smirnov (my personal favorite, which was also invented by my favorite mathematician), Anderson-Darling Kuiper’s and Lilliefors Tests, as well as the Cramér–von Mises Criterion. Judging from the fact that experts seem to divide the various goodness-of-fit tests into categories along the same lines[9], I was right to segregate the Jarque-Bera and D’Agostino’s K-Squared Test into a separate segment at the beginning of this series for measures based on kurtosis and skewness. The Shapiro-Wilk and Ryan-Joiner Tests likewise have a separate set of internal mechanism in commons, based on measures of correlation. In the next five articles, we’ll cover a set of goodness-of-fit measures that rely on a different type of internal mechanism, the empirical distribution function (EDF), which is a lot easier to calculate and explain than the long-winded name would suggest.</p>
<p>[1] These authors say it is “similar to the SW test”: p. 2142, Yap, B. W. and Sim, C. H., 2011, “Comparisons of Various Types of Normality Tests,” pp. 2141-2155 in <u>Journal of Statistical Computation and Simulation</u>, Vol. 81, No. 12. Also see the remark to the effect that “This test is similar to the Shapiro-Wilk normality test” at Gilberto, S. 2013, “Which Normality Test May I Use?” published in the<u> Lean Sigma Six Group Brazil</u> discussion group, at the LinkedIn web address <a href="http://www.linkedin.com/groups/Which-normality-test-may-I-3713927.S.51120536" rel="nofollow">http://www.linkedin.com/groups/Which-normality-test-may-I-3713927.S.51120536</a></p>
<p>[2] See the <u>Answers.com</u> webpage “What is Ryan Joiner Test” at <a href="http://www.answers.com/Q/What_is_Ryan_joiner_test">http://www.answers.com/Q/What_is_Ryan_joiner_test</a></p>
<p>[3] Colton, Jim, 2013, “Anderson-Darling, Ryan-Joiner, or Kolmogorov-Smirnov: Which Normality Test Is the Best?” published Oct. 10, 2013 at <u>The Minitab Blog</u> web address <a href="http://blog.minitab.com/blog/the-statistical-mentor/anderson-darling-ryan-joiner-or-kolmogorov-smirnov-which-normality-test-is-the-best">http://blog.minitab.com/blog/the-statistical-mentor/anderson-darling-ryan-joiner-or-kolmogorov-smirnov-which-normality-test-is-the-best</a></p>
<p>[4] See the aforementioned <u>Answers.com</u> webpage.</p>
<p>[5] See the comment by the user named Mikel on Jan. 23, 2008 in the<u> iSixSigma</u> thread “Ryan-Joiner Test” at <a href="http://www.isixsigma.com/topic/ryan-joiner-test/">http://www.isixsigma.com/topic/ryan-joiner-test/</a></p>
<p>[6] Gilberto, S. 2013, “Which Normality Test May I Use?” published in the<u> Lean Sigma Six Group Brazil</u> discussion group, at the LinkedIn web address <a href="http://www.linkedin.com/groups/Which-normality-test-may-I-3713927.S.51120536">http://www.linkedin.com/groups/Which-normality-test-may-I-3713927.S.51120536</a></p>
<p>[7] No author listed, 2014, “7.5 – Tests for Error Normality,” published at the <u>Penn State University</u> web address <a href="https://onlinecourses.science.psu.edu/stat501/node/366" rel="nofollow">https://onlinecourses.science.psu.edu/stat501/node/366</a> .This source has several other goodness-of-test formulas arranged in a convenient format. Also see Uaieshafizh, 2011, “Normality Test Dengan Menggunakan Uji Ryan-Joiner,” published Nov. 1, 2011 at the <u>Coretan Uaies Hafizh</u> web address <a href="http://uaieshafizh.wordpress.com/2011/11/01/uji-ryan-joiner/" rel="nofollow">http://uaieshafizh.wordpress.com/2011/11/01/uji-ryan-joiner/</a> . Translated from Indonesian by Google Translate.</p>
<p>[8] Ryan, Jr., Thomas A. and Joiner, Brian L., 1976, “Normal Probability Plots and Tests for Normality,” Technical Report, published by the <u>Pennsylvania State University Statistics Department</u>. Available online at the Minitab web address <a href="http://www.minitab.com/uploadedFiles/Content/News/Published_Articles/normal_probability_plots.pdf">http://www.minitab.com/uploadedFiles/Content/News/Published_Articles/normal_probability_plots.pdf</a></p>
<p>[9] For an example, see p. 2143, Yap, B. W. and Sim, C. H., 2011, “Comparisons of Various Types of Normality Tests,” pp. 2141-2155 in <u>Journal of Statistical Computation and Simulation</u>, Vol. 81, No. 12. Available online at the web address <a href="http://www.tandfonline.com/doi/pdf/10.1080/00949655.2010.520163" rel="nofollow">http://www.tandfonline.com/doi/pdf/10.1080/00949655.2010.520163</a> “Normality tests can be classified into tests based on regression and correlation (SW, Shapiro–Francia and Ryan–Joiner tests), CSQ test, empirical distribution test (such as KS, LL, AD andCVM), moment tests (skewness test, kurtosis test, D’Agostino test, JB test), spacings test (Rao’s test, Greenwood test) and other special tests.” I have yet to see the latter two tests mentioned anywhere else, so I’ll omit them from the series for now on the grounds that sufficient information will likely be even harder to find than it was for the Ryan-Joiner Test.</p>
<p> </p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/multidimensionalmayhem.wordpress.com/570/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/multidimensionalmayhem.wordpress.com/570/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=multidimensionalmayhem.wordpress.com&blog=27407452&post=570&subd=multidimensionalmayhem&ref=&feed=1" width="1" height="1" />Goodness-of-Fit Testing with SQL Server Part 6.1: The Shapiro-Wilk Test
http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/02/29/goodness-of-fit-testing-with-sql-server-part-61-the-shapiro-wilk-test/
Tue, 01 Mar 2016 04:45:41 UT/blogs/multidimensionalmayhem/2016/02/29/goodness-of-fit-testing-with-sql-server-part-61-the-shapiro-wilk-test/0http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/02/29/goodness-of-fit-testing-with-sql-server-part-61-the-shapiro-wilk-test/#comments<p><strong>By Steve Bolton</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>Just as a good garage mechanic will fill his or her Craftsman with tools designed to fix specific problems, it is obviously wise for data miners to stockpile a wide range of algorithms, statistical tools, software packages and the like to deal with a wide variety of user scenarios. Some of the tests and algorithms I’ve covered in this amateur self-tutorial series and the previous one on outlier detection are applicable to a broad range of problems, while others are tailor-made to address specific issues; what works in one instance may be entirely inappropriate in a different context. For example, some fitness tests are specifically applicable only to linear regression and other to logistic regression, as explained in <a href="https://multidimensionalmayhem.wordpress.com/2016/01/13/goodness-of-fit-testing-with-sql-server-part-4-1-r2-rmse-and-regression-related-routines/">Goodness-of-Fit Testing with SQL Server, part 4.1::R2, RMSE and Regression-Related Routines</a> and <a href="https://multidimensionalmayhem.wordpress.com/2016/01/26/goodness-of-fit-testing-with-sql-server-part-4-2-the-hosmer-lemeshow-test-with-logistic-regression/">Goodness-of-Fit Testing with SQL Server part 4.2:: The Hosmer-Lemeshow Test</a>. Other measures we’ve surveyed recently, like the Chi-Squared, Jarque-Bera and D’Agostino-Pearson Tests, can only be applied to particular probability distributions or are calculated in ways that can be a drag on performance, when run against the wrong type of dataset. The metric I’ll be discussing this week stands out as one of the most popular goodness-of-fit tests, in large part because it is has better “statistical power,” which is a numerical measure of how often the actual effects of a variable are detected by a particular test.<br />
<span style="font-size:10pt;color:white;">…………</span>The Shapiro-Wilk Test is also apparently flexible enough to be extended to other distribution beyond the “normal” Gaussian (i.e. the bell curve), such as the uniform, the exponential, and to a certain extent “to any symmetric distribution.”[1] Its flexibility is augmented by scale and origin invariance, two properties that statisticians prefer to endow their metrics with because multiplying the terms by a common factor or choosing a different starting point doesn’t lead to incomparable values.[2] For these reasons it is widely implemented in statistical software that competes in a tangential way with SQL Server Data Mining (SSDM), most notably “R, Stata, SPSS and SAS.”[3] As we shall see, however, there is less incentive to implement it in SQL Server than in these dedicated stats packages, because of the specific nature of the datasets we work with.</p>
<p style="text-align:center;"><strong>The Fatal Flaw of Shapiro-Wilk for Big Data</strong></p>
<p> The usefulness of the Shapiro-Wilk Test is severely constrained by a number of drawbacks, such as sensitivity to outliers and the fact that its authors envisioned it as an adjunct to the kind of visualizations we covered in <a href="https://multidimensionalmayhem.wordpress.com/2015/11/03/goodness-of-fit-testing-with-sql-server-part-2-1-implementing-probability-plots-in-reporting-services/">Goodness-of-Fit Testing with SQL Server, part 2: Implementing Probability Plots in Reporting Services</a>, not as a replacement for them.[4] The fatal flaw, however, is that the Shapiro-Wilk Test can only handle datasets up to 50 rows in size; approximations have been developed by statisticians like Patrick Royston that can extend it to at least 2,000 rows, but that is still a drop in the bucket compared to the millions of rows found in SQL Server tables. As I’ve pointed out in previous articles, one of the great strengths of the “Big Data” era is that we can now plumb the depths of such huge treasures troves in order to derive information of greater detail, which is an advantage we shouldn’t have to sacrifice merely to accommodate metrics that were designed generations ago with entirely different contexts in mind. Furthermore, the test is normally used in hypothesis testing on random samples when the means and variances are unknown, which as I have explained in the past, are not user scenarios that the SQL Server community will encounter often.[5] The means and variances of particular columns are trivial to calculate with built-in T-SQL functions. Moreover, random sampling is not as necessary in our field because we have access to such huge repositories of information, which are often equivalent to the full population, depending on what questions we choose to ask about our data.<br />
<span style="font-size:10pt;color:white;">…………</span>I’ll have to implement the T-SQL code for this article against a small sample of our available practice data, simply because of the built-in limitation on row counts. In order to accommodate larger datasets, we’d have to find a different way of performing the internal calculations, which are subject to combinatorial explosion. The main sticking point it a constant in the Shapiro-Wilk equations which must be derived through covariance matrices, which are too large to calculate for large datasets, regardless of the performance costs. As Royston notes, deriving the constant for a 1,500-row table would require the storage of 1,126,500 reals, given that the covariance matrix requires a number of comparisons equivalent to the count of the table multiplied by one less than itself. That exponential growth isn’t ameliorated much by the fact that those results are then halved; I’m still learning the subject of computational complexity classes so I can’t identify which this calculation might fit into, but it certainly isn’t one of those that are easily computable in polynomial time.</p>
<p style="text-align:center;"><strong>Workarounds for Combinatorial Explosion</strong></p>
<p> My math may be off, but I calculated that stress-testing the Shapiro-Wilk procedure against the first float column in the 11-million-row Higgs Boson Dataset, which I downloaded from the <a href="https://archive.ics.uci.edu/ml/datasets/HIGGS">University of California at Irvine’s Machine Learning Repository</a> and converted into a SQL Server table of about 6 gigabytes) would require about 1.2 trillion float values and 67 terabytes of storage space. I have the sneaking suspicion that no one in the SQL Server community has that much free space in their TempDB. And that is before factoring in such further performance hits as the matrix inversion and other such transforms.<br />
<span style="font-size:10pt;color:white;">…………</span>While writing a recent article on <a href="https://multidimensionalmayhem.wordpress.com/2015/09/11/outlier-detection-with-sql-server-part-8-a-t-sql-hack-for-mahalanobis-distance/">Mahalanobis Distance</a>, combinatorial explosion of matrix determinants forced me to scrap my sample code for a type of covariance matrix that compared the global variance values for each column against one another; even that was a cheap workaround for calculating what amounts to a cross product against each set of local values. In this case, we’re only talking about a bivariate comparison, so inserting the easily calculable global variance value would leave us with a covariance matrix of just one entry, which isn’t going to fly.[6] We can’t fudge the covariance matrix in this way, but it might be possible to use one of Royston’s approximations to derive that pesky constant in a more efficient way. Alas, I was only able to read a couple of pages in his 1991 academic journal article on the topic, since Springer.com charges an arm and a leg for full access. I had the distinct sense, however, that it would still not scale to the size of datasets typically associated with the Big Data buzzword. Furthermore, a lot of it was still over my head, as was the original 1965 paper by Samuel S. Shapiro and Martin B. Wilk (although not as far as such topics used to be, which is precisely why I am using exercises like these in order to acquire the skills I lack). Thankfully, that article in <em>Biometrika</em> provides an easily adaptable table of lookup values for that constant[7], as well as a legible example that I was able to verify my results against. Figure 1 below provides DDL for creating a lookup table to hold those values, which you’ll have to copy yourself from one of the many publicly available sources on the Internet, including the original paper.[8]</p>
<p><strong><u>Figure 1: DDL for the Shapiro-Wilk Lookup Table<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CREATE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">TABLE</span> <span style="color:teal;">[Calculations]</span><span class="GramE"><span style="color:gray;">.</span><span style="color:teal;">[</span></span><span class="SpellE"><span style="color:teal;">ShapiroWilkLookupTable</span></span><span style="color:teal;">]</span><span style="color:gray;">(<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">[ID]</span> <span style="color:teal;">[smallint]</span> <span class="GramE"><span style="color:blue;">IDENTITY</span><span style="color:gray;">(</span></span>1<span style="color:gray;">,</span>1<span style="color:gray;">)</span> <span style="color:gray;">NOT</span> <span style="color:gray;">NULL,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">[<span class="SpellE">ICount</span>]</span> <span style="color:blue;">bigint</span> <span style="color:gray;">NULL,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">[<span class="SpellE">NCount</span>]</span> <span style="color:blue;">bigint</span> <span style="color:gray;">NULL,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">[Coefficient] </span><span style="color:teal;">[<span class="GramE">decimal</span>]</span><span style="color:gray;">(</span>5<span style="color:gray;">,</span> 4<span style="color:gray;">)</span> <span style="color:gray;">NULL,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">CONSTRAINT</span> <span style="color:teal;">[<span class="SpellE">PK_ShapiroWilkLookupTable</span>] </span><span style="color:blue;">PRIMARY</span> <span style="color:blue;">KEY</span> <span style="color:blue;">CLUSTERED</span> </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;"><span style="color:teal;">[ID]</span> <span style="color:blue;">ASC</span></span><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;color:gray;">)</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">WITH</span></span> <span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">PAD_INDEX</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">=</span> <span style="color:blue;">OFF</span><span style="color:gray;">,</span> <span style="color:blue;">STATISTICS_NORECOMPUTE</span><br />
<span style="color:gray;">=</span> <span style="color:blue;">OFF</span><span style="color:gray;">,</span> <span style="color:blue;">IGNORE_DUP_KEY</span> <span style="color:gray;">=</span> <span style="color:blue;">OFF</span><span style="color:gray;">,</span> <span style="color:blue;">ALLOW_ROW_LOCKS</span> <span style="color:gray;">=</span> <span style="color:blue;">ON</span><span style="color:gray;">,</span> <span style="color:blue;">ALLOW_PAGE_LOCKS</span> <span style="color:gray;">=</span> <span style="color:blue;">ON</span><span style="color:gray;">)</span> <span style="color:blue;">ON</span> <span style="color:teal;">[PRIMARY]<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">)</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">ON</span> <span style="color:teal;">[PRIMARY]</span></span></p>
<p><strong><u>Figure 2: T-SQL Code for the Shapiro-Wilk Test<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CREATE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">PROCEDURE</span> <span style="color:teal;">[Calculations]</span><span class="GramE"><span style="color:gray;">.</span><span style="color:teal;">[</span></span><span class="SpellE"><span style="color:teal;">GoodnessOfFitShapiroWilkTestSP</span></span><span style="color:teal;">]<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DatabaseName</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">)</span> <span style="color:gray;">=</span> <span style="color:gray;">NULL,</span> <span style="color:teal;">@<span class="SpellE">SchemaName</span></span> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@<span class="SpellE">TableName</span></span> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span><span style="color:teal;">@<span class="SpellE">ColumnName</span></span> <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span><span style="color:teal;">@DecimalPrecision</span> <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>50<span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">AS<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName</span></span> <span class="GramE"><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span></span>400<span style="color:gray;">),</span><span style="color:teal;">@<span class="SpellE">SQLString </span></span><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName</span></span> <span style="color:gray;">=</span> <span style="color:teal;">@DatabaseName</span> <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">SchemaName </span></span><span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">TableName</span></span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"><br />
DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ValueTable</span> <span style="color:blue;">table<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">ID <span style="color:blue;">bigint</span> <span style="color:blue;">IDENTITY </span><span style="color:gray;">(</span>1<span class="GramE"><span style="color:gray;">,</span>1</span><span style="color:gray;">),<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">Value <span style="color:blue;">float</span><span style="color:gray;">)</span></span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">SQLString</span> <span class="GramE"><span style="color:gray;">= </span><span style="color:red;">‘</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> @Column1 <span style="color:gray;">+ </span><span style="color:red;">‘ AS Value<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> @SchemaAndTable1 <span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘<br />
W</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">HERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> @Column1 <span style="color:gray;">+ </span><span style="color:red;">‘ IS NOT NULL’</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">INSERT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">INTO</span> @<span class="SpellE">ValueTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">Value)<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">@<span class="SpellE">SQLString</span><span style="color:gray;">)</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> @Count <span style="color:blue;">bigint</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">@<span class="SpellE">CountPlusOneQuarter</span> <span class="GramE"><span style="color:blue;">decimal</span><span style="color:gray;">(</span></span>38<span style="color:gray;">,</span>2<span style="color:gray;">),</span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;">@<span class="SpellE">CountIsOdd</span> <span style="color:blue;">bit</span> <span style="color:gray;">=</span> 0<span style="color:gray;">,<br />
</span></span><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;">@<span class="SpellE">CountDivisor </span><span style="color:blue;">float</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">,<br />
</span><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;">@S2 <span style="color:blue;">float</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">,<br />
</span><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;">@<span class="SpellE">ShapiroWilkTestStatistic</span> <span style="color:blue;">float</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">,<br />
</span><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;">@One <span style="color:blue;">float</span></span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">=</span> 1</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> @Count <span style="color:gray;">=</span> <span class="GramE"><span style="color:fuchsia;">Count</span><span style="color:gray;">(</span></span><span style="color:gray;">*)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">FROM</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ValueTable</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">CountPlusOneQuarter</span> <span style="color:gray;">= </span>@Count <span style="color:gray;">+</span> 0.25<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">CountIsOdd</span> <span style="color:gray;">=</span> <span style="color:blue;">CASE</span> <span style="color:blue;">WHEN</span> @Count <span style="color:gray;">%</span> 2 <span style="color:gray;">=</span> 1 <span class="GramE"><span style="color:blue;">THEN</span> 1</span> <span style="color:blue;">ELSE </span>0 <span style="color:blue;">END<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">CountDivisor</span> <span style="color:gray;">=</span> <span style="color:blue;">CASE</span> <span style="color:blue;">WHEN</span> @<span class="SpellE">CountIsOdd</span> <span style="color:gray;">=</span> 1 <span style="color:blue;">THEN </span><span style="color:gray;">(</span>@Count <span style="color:gray;">/</span> <span class="GramE"><span style="color:fuchsia;">CAST</span><span style="color:gray;">(</span></span>2 <span style="color:blue;">as</span> <span style="color:blue;">float</span><span style="color:gray;">))</span> <span style="color:gray;">+</span> 1 <span style="color:blue;">ELSE </span><span style="color:gray;">(</span>@Count <span style="color:gray;">/</span> <span style="color:fuchsia;">CAST</span><span style="color:gray;">(</span>2 <span style="color:blue;">as</span> <span style="color:blue;">float</span><span style="color:gray;">))</span> <span style="color:blue;">END</span></span></p>
<p class="MsoNormal"><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">TOP</span></span></span><span style="font-size:9.5pt;font-family:Consolas;"> 1 @S2 <span style="color:gray;">= </span><span style="color:fuchsia;">Sum</span><span style="color:gray;">(</span><span style="color:fuchsia;">Power</span><span style="color:gray;">(</span>Value<span style="color:gray;">,</span> 2<span style="color:gray;">))</span> <span style="color:blue;">OVER </span><span style="color:gray;">(</span><span style="color:blue;">ORDER</span> <span style="color:blue;">BY</span> Value<span style="color:gray;">)</span> <span style="color:gray;">– </span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;color:fuchsia;">Power</span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span></span><span style="font-size:9.5pt;font-family:Consolas;color:fuchsia;">Sum</span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">Value<span style="color:gray;">)</span> <span style="color:blue;">OVER </span><span style="color:gray;">(</span><span style="color:blue;">ORDER</span> <span style="color:blue;">BY</span> Value<span style="color:gray;">),</span> 2<span style="color:gray;">)</span> <span style="color:gray;">*</span> <span style="color:gray;">(</span>@One <span style="color:gray;">/</span> <span style="color:fuchsia;">CAST</span><span style="color:gray;">(</span>@Count <span style="color:blue;">as</span> <span style="color:blue;">float</span><span style="color:gray;">)))<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">FROM</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ValueTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">ORDER</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">BY</span> Value <span style="color:blue;">DESC</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ShapiroWilkTestStatistic</span> <span style="color:gray;">=</span> <span class="GramE"><span style="color:fuchsia;">Power</span><span style="color:gray;">(</span></span><span class="SpellE">CoefficientSum</span><span style="color:gray;">,</span> 2<span style="color:gray;">)</span> <span style="color:gray;">/</span> @S2<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">FROM</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"> </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">TOP</span> 1 <span class="GramE"><span style="color:fuchsia;">SUM</span><span style="color:gray;">(</span></span><span class="SpellE">FactorByShapiroWilkLookup </span><span style="color:gray;">*</span> Coefficient<span style="color:gray;">)</span> <span style="color:blue;">OVER </span><span style="color:gray;">(</span><span style="color:blue;">ORDER</span> <span style="color:blue;">BY</span> Coefficient <span style="color:blue;">DESC</span><span style="color:gray;">)</span> <span style="color:blue;">AS</span> <span class="SpellE">CoefficientSum<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">FROM </span><span style="color:gray;">(</span><span style="color:blue;">SELECT</span> T1<span style="color:gray;">.</span>RN <span style="color:blue;">AS</span> RN<span style="color:gray;">,</span> T2<span style="color:gray;">.</span>Value <span style="color:gray;">–</span> T1<span style="color:gray;">.</span>Value <span style="color:blue;">AS</span> <span class="SpellE">FactorByShapiroWilkLookup<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">FROM </span><span style="color:gray;">(</span><span style="color:blue;">SELECT</span> <span style="color:blue;">TOP</span> 99999999999 Value<span style="color:gray;">, </span><span style="color:fuchsia;">ROW_NUMBER</span> <span style="color:gray;">()</span> <span style="color:blue;">OVER </span><span style="color:gray;">(</span><span style="color:blue;">ORDER</span> <span style="color:blue;">BY</span> Value <span style="color:blue;">ASC</span><span style="color:gray;">)</span> <span style="color:blue;">AS</span> RN<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">FROM</span> @<span class="SpellE">ValueTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHERE</span> Value <span style="color:gray;">IS</span> <span style="color:gray;">NOT</span> <span style="color:gray;">NULL<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">ORDER</span> <span style="color:blue;">BY</span> Value <span style="color:blue;">ASC</span><span style="color:gray;">)</span> <span style="color:blue;">AS</span> T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">INNER </span><span style="color:gray;">JOIN</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"> </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">TOP</span> 99999999999 Value<span style="color:gray;">, </span><span style="color:fuchsia;">ROW_NUMBER</span> <span style="color:gray;">()</span> <span style="color:blue;">OVER </span><span style="color:gray;">(</span><span style="color:blue;">ORDER</span> <span style="color:blue;">BY</span> Value <span style="color:blue;">DESC</span><span style="color:gray;">)</span> <span style="color:blue;">AS</span> RN<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">FROM</span> @<span class="SpellE">ValueTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHERE</span> Value <span style="color:gray;">IS</span> <span style="color:gray;">NOT</span> <span style="color:gray;">NULL<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">ORDER</span> <span style="color:blue;">BY</span> Value <span style="color:blue;">DESC</span><span style="color:gray;">)</span> <span style="color:blue;">AS</span> T2<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">ON </span>T1<span style="color:gray;">.</span>RN <span style="color:gray;">=</span> T2<span style="color:gray;">.</span>RN<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">WHERE </span>T1<span style="color:gray;">.</span>RN <span style="color:gray;"><=</span> @<span class="SpellE">CountDivisor</span><span style="color:gray;">)</span> <span style="color:blue;">AS</span> T3<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">INNER </span><span style="color:gray;">JOIN</span> <span class="SpellE">OutlierDetection<span style="color:gray;">.</span>LookupShapiroWilkTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">ON </span>RN <span style="color:gray;">=</span> <span class="SpellE">ICount</span> <span style="color:gray;">AND</span> <span class="SpellE">NCount</span> <span style="color:gray;">=</span> @Count<br />
</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">ORDER</span> <span style="color:blue;">BY</span> RN <span style="color:blue;">DESC</span><span style="color:gray;">)</span> <span style="color:blue;">AS</span> T4</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ShapiroWilkTestStatistic</span> <span style="color:blue;">AS </span><span class="SpellE">ShapiroWilkTestStatistic</span></span></p>
<p><span style="font-size:10pt;color:white;">…………</span>The use of the lookup table removes the need for the complex matrix logic, which might have made the T-SQL in Figure 2 even longer than the matrix code I originally wrote for <a href="https://multidimensionalmayhem.wordpress.com/2015/09/11/outlier-detection-with-sql-server-part-8-a-t-sql-hack-for-mahalanobis-distance/">Outlier Detection with SQL Server, part 8: A T-SQL Hack for Mahalanobis Distance</a> (which might have set a record for the lengthiest T-SQL samples ever posted in a blog, if I hadn’t found a workaround at the last minute). Longtime readers may notice a big change in the format of my SQL; gone is the @DecimalPrecision parameter, which enabled users to set their own precision and scale, but which made the code a lot less legible by requiring much bigger blocks of dynamic SQL. From now on, I’ll be using short dynamic SQL statements like the one included in @SQLString and performing a lot of the math operations on a table variable that holds the results. I ought to have done this sooner, but one of the disadvantages of working in isolation is that you’re missing the feedback that would ferret out bad coding habits more quickly. As usual, the parameters and first couple of lines within the body enable users to perform the test on any table column in any database they have sufficient access to.<br />
<span style="font-size:10pt;color:white;">…………</span>Most of the internal variables and constants we’ll need for our computations are declared near the top, followed by the some simple assignments of values based on the record count. The @S2 assignment requires a little more code. It is then employed in a simple division operation in the last block, which is a series of subqueries and windowing operations that retrieve the appropriate lookup value, which depends on the record count. It also sorts the dataset by value, then derives order statistics by essentially folding the table in half, so that the first and last values are compared, the second from the beginning and second from the end, etc. etc. right up to the midpoint. The final calculations on the lookup values and these order statistics are actually quite simple. For this part, I also consulted the National Institute for Standards and Technology’s Engineering Statistics Handbook, which is one of the most succinctly written sources of information I’ve found to date on the topic of statistics.[9] Because I’m still a novice, the reasons why these particular calculations are used is still a mystery to me, although I’ve frequently seen Shapiro and Wilk mentioned in connection with Analysis of Variance (ANOVA), which is a simpler topic to grasp if not to implement. If a float would do in place of variable precision then this code could be simplified, by inserting the results of a query on the @SchemaAndTableName into a table variable and then performing all the math on it outside of the dynamic SQL block.</p>
<p><strong><u>Figure 3: Sample Results from the Shapiro-Wilk Test<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">Calculations</span><span style="color:gray;">.</span><span style="color:teal;">GoodnessOfFitShapiroWilkTestSP<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"> </span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DatabaseName</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">=</span> <span style="color:red;">N’DataMiningProjects’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@SchemaName </span><span style="color:gray;">=</span> <span style="color:red;">N’Health’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">TableName</span></span> <span style="color:gray;">=</span> <span style="color:red;">N’First50RowsPyruvateKinaseView’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">ColumnName</span></span> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’PyruvateKinase</span></span><span style="color:red;">‘</span></span></p>
<p><a href="https://multidimensionalmayhem.wordpress.com/2016/02/29/goodness-of-fit-testing-with-sql-server-part-6-1-the-shapiro-wilk-test/shapirowilkqueryresults/" rel="attachment wp-att-563"><img class="alignnone size-full wp-image-563" src="https://multidimensionalmayhem.files.wordpress.com/2016/02/shapirowilkqueryresults.jpg?w=604" alt="ShapiroWilkQueryResults" /></a></p>
<p><span style="font-size:10pt;color:white;">…………</span>In Figure 3, I ran the procedure against a view created on the first 50 non-null values of the Pyruvate Kinase enzyme, derived from the 209-row table of Duchennes muscular dystrophy data I downloaded from the <a href="http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets">Vanderbilt University’s Department of Biostatistics</a>. Given that we can’t yet calculate this on more than 50 rows at this point, doing the traditional performance test of the procedure on the HiggsBosonTable is basically pointless. Only if the lookup table could be extended somehow with new coefficients it might pay to look at the execution plan. When run against the trivial 7-row example in the Shapiro-Wilk paper, it had a couple of Clustered Index Scans that could probably be turned into Seeks with proper indexing on both the lookup table and the table being tested. It also had a couple of expensive Sort operators and a Hash Match that might warrant further inspection if the procedure could somehow be extended to datasets big enough to affect performance.<br />
<span style="font-size:10pt;color:white;">…………</span>Interpretation of the final test statistics is straightforward in one sense, yet tricky in another. The closer the statistic is to 1, the more closely the data approaches a normal distribution. It is common to assign confidence intervals, P-values and the like with the Shapiro-Wilk Test, but I am omitting this step out of growing concern about the applicability of hypothesis testing to our use cases. I’ve often questioned the wisdom of reducing high-precision test statistics down to simple Boolean, yes-no answers about whether a particular column is normally distributed, or a particular value is an outlier; not only is it akin to taking a float column in a table and casting it to a bit, but it prevents us from asking more sophisticated questions of our hard-won computations like, “<em>How</em> normally distributed is my data?”</p>
<p style="text-align:center;"><strong>More Misgivings About Hypothesis Testing-Style Metrics</strong></p>
<p> The more I read by professional statisticians and data miners who really know what they’re talking about, the less at ease I feel. Doubts about the utility of hypothesis tests of normality are routinely expressed in the literature; for some easily accessible examples that pertain directly to today’s metric, see the StackOverflow threads “Seeing if Data is Normally Distributed in R”[10] and <a href="http://stackoverflow.com/questions/15427692/perform-a-shapiro-wilk-normality-test">“Perform a Shapiro-Wilk Normality Test”</a>.[11] Some of the books I’ve read recently in my crash course in stats have not just echoed the same sentiments, but added dozens of different potential pitfalls in interpretation.[12] Hypothesis testing encompasses a set of techniques that are routinely wielded without the precision and skill required to derive useful information from them, as many professional statisticians lament. Worse still, the inherent difficulties are greatly magnified by Big Data, which comes with a unique set of use cases. The SQL Server user community might find bona fide niches for applying hypothesis testing, but for the foreseeable future I’ll forego that step and simply use the test statistics as measures in their own right, which still gives end users the freedom to implement confidence intervals and the like if they find a need.<br />
<span style="font-size:10pt;color:white;">…………</span>The Shapiro-Wilk Test in its current form is likewise not as likely to be as useful to us as it is to researchers in other fields, in large part because of the severe limitations on input sizes. As a rule, DBAs and data miners are going to be more interested in exploratory data mining rather than hypothesis testing, using very large datasets where the means and variances are often easily discernible and sampling is less necessary. Perhaps the Shapiro-Wilk Test could be adapted to accommodate much larger datasets, as Royston apparently attempted to do by using quintic regression coefficients to approximate that constant the Shapiro-Wilk equations depend upon.[13] In fact, given that I’m still learning about the field of statistics, it is entirely possible that a better workaround is already available. I’ve already toyed with the idea of breaking up entire datasets into random samples of no more than 50 rows, but I’m not qualified to say if averaging the test statistics together would be a logically valid measure. I suspect that the measure would be incorrectly scaled because of the higher record counts.<br />
<span style="font-size:10pt;color:white;">…………</span>Until some kind of enhancement becomes available, it is unlikely that the Shapiro-Wilk Test will occupy a prominent place in any DBA’s fitness testing toolbox. There might be niches where small random sampling and hypothesis testing might make it a good choice, but for now it is simply too small to accommodate the sheer size of the data we’re working with. I looked into another potential workaround in the form of the Shapiro-Francia Test, but since it is calculated in a similar way and is “asymptotically equivalent”[14] to the Shapiro-Wilk (i.e., they basically converge and become equal for all intents and purposes) I chose to skip that alternative for the time being. In next week’s article we’ll instead discuss the Ryan-Joiner Test, which is often lumped in the same category with the Shapiro-Wilk. After that, we’ll survey a set of loosely related techniques that are likely to be of more use to the SQL Server community, encompassing the Kolmogorov-Smirnov, Anderson-Darling, Kuiper’s and Lilliefors Tests, as well as the Cramér–von Mises Criterion</p>
<p>[1] Royston, Patrick, 1991, “Approximating the Shapiro-Wilk W-Test for Non-Normality,” pp. 117-119 in <u>Statistics and Computing</u>, September, 1992. Vol. 2, No. 3. Available online at <a href="http://link.springer.com/article/10.1007/BF01891203#page-1">http://link.springer.com/article/10.1007/BF01891203#page-1</a></p>
<p>[2] p. 591, Shapiro, Samuel S. and Wilk, Martin B., 1965, “An Analysis of Variance Test for Normality (Complete Samples),” pp. 591-611 in <u>Biometrika</u>, December 1965. Vol. 52, Nos. 3-4.</p>
<p>[3] See the Wikipedia page “Shapiro-Wilk Test” at <a href="http://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test">http://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test</a></p>
<p>[4] p. 610, Shapiro and Wilk, 1965.</p>
<p>[5] p. 593, Shapiro, Samuel S. and Wilk, Martin B., 1965, “An Analysis of Variance Test for Normality (Complete Samples),” pp. 591-611 in <u>Biometrika</u>, December 1965. Vol. 52, Nos. 3-4.</p>
<p>[6] Apparently there is another competing definition of the term, in which values are compared within a particular column, not against across columns. See the Wikipedia page “Covariance Matrix” at <a href="http://en.wikipedia.org/wiki/Covariance_matrix#Conflicting_nomenclatures_and_notations">http://en.wikipedia.org/wiki/Covariance_matrix#Conflicting_nomenclatures_and_notations</a></p>
<p>[7] pp. 603-604, Shapiro, Samuel S. and Wilk, Martin B., 1965, “An Analysis of Variance Test for Normality (Complete Samples),” pp. 591-611 in <u>Biometrika</u>, December 1965. Vol. 52, Nos. 3-4.</p>
<p>[8] Another source of the Shapiro-Wilk coefficient is Zaiontz, Charles, 2014, “Shapiro-Wilk Tables,” posted at the Real Statistics Using Excel blog web address <a href="http://www.real-statistics.com/statistics-tables/shapiro-wilk-table/">http://www.real-statistics.com/statistics-tables/shapiro-wilk-table/</a></p>
<p>[9] For this part, I also consulted the National Institute for Standards and Technology, 2014, “7.2.1.3 Anderson-Darling and Shapiro-Wilk Tests,” published in the online edition of the <u>Engineering Statistics Handbook</u>. Available at <a href="http://www.itl.nist.gov/div898/handbook/prc/section2/prc213.htm">http://www.itl.nist.gov/div898/handbook/prc/section2/prc213.htm</a></p>
<p>[10] See especially the comment by Ian Fellows on Oct. 17, 2011:</p>
<blockquote><p> “Normality tests don’t do what most think they do. Shapiro’s test, Anderson Darling, and others are null hypothesis tests AGAINST the assumption of normality. These should not be used to determine whether to use normal theory statistical procedures. In fact they are of virtually no value to the data analyst. Under what conditions are we interested in rejecting the null hypothesis that the data are normally distributed? I have never come across a situation where a normal test is the right thing to do. When the sample size is small, even big departures from normality are not detected, and when your sample size is large, even the smallest deviation from normality will lead to a rejected null…”<br />
<span style="font-size:10pt;color:white;">…………</span>“…So, in both these cases (binomial and lognormal variates) the p-value is > 0.05 causing a failure to reject the null (that the data are normal). Does this mean we are to conclude that the data are normal? (hint: the answer is no). Failure to reject is not the same thing as accepting. This is hypothesis testing 101.”<br />
<span style="font-size:10pt;color:white;">…………</span>“But what about larger sample sizes? Let’s take the case where there the distribution is very nearly normal.”<br />
<span style="font-size:10pt;color:white;">…………</span>“Here we are using a t-distribution with 200 degrees of freedom. The qq-plot shows the distribution is closer to normal than any distribution you are likely to see in the real world, but the test rejects normality with a very high degree of confidence.”<br />
<span style="font-size:10pt;color:white;">…………</span>“Does the significant test against normality mean that we should not use normal theory statistics in this case? (another hint: the answer is no<img width='16' height='16' class='wp-smiley emoji' draggable='false' alt=':)' src='https://s1.wp.com/wp-content/mu-plugins/wpcom-smileys/simple-smile.svg' style='height: 1em; max-height: 1em;' /> )”</p></blockquote>
<p>[11] Note these helpful comments by Paul Hiemstra on March 15, 2013:</p>
<blockquote><p> “An additional issue with the Shapiro-Wilks test is that when you feed it more data, the chances of the null hypothesis being rejected becomes larger. So what happens is that for large amounts of data even veeeery small deviations from normality can be detected, leading to rejection of the null hypothesis even though for practical purposes the data is more than normal enough…”<br />
<span style="font-size:10pt;color:white;">…………</span>“…In practice, if an analysis assumes normality, e.g. lm, I would not do this Shapiro-Wilks test, but do the analysis and look at diagnostic plots of the outcome of the analysis to judge whether any assumptions of the analysis where violated too much. For linear regression using lm this is done by looking at some of the diagnostic plots you get using plot (lm()). Statistics is not a series of steps that cough up a few numbers (hey p < 0.05!) but requires a lot of experience and skill in judging how to analysis your data correctly.”</p></blockquote>
<p>[12] A case in point with an entire chapter devoted to the shortcomings of hypothesis testing methods is Kault, David, 2003, <u>Statistics with Common Sense</u>. Greenwood Press: Westport, Connecticut.</p>
<p>[13] His approximation method is also based on Weisberg, Sanford and Bingham, Christopher, 1975, “An Approximate Analysis of Variance Test for Non-Normality Suitable for Machine Calculation,” pp 133-134 in <u>Technometrics</u>, Vol. 17.</p>
<p>[14] p. 117, Royston.</p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/multidimensionalmayhem.wordpress.com/562/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/multidimensionalmayhem.wordpress.com/562/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=multidimensionalmayhem.wordpress.com&blog=27407452&post=562&subd=multidimensionalmayhem&ref=&feed=1" width="1" height="1" />Goodness-of-Fit Testing with SQL Server Part 5: The Chi-Squared Test
http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/02/13/goodness-of-fit-testing-with-sql-server-part-5-the-chi-squared-test/
Sat, 13 Feb 2016 09:19:38 UT/blogs/multidimensionalmayhem/2016/02/13/goodness-of-fit-testing-with-sql-server-part-5-the-chi-squared-test/0http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/02/13/goodness-of-fit-testing-with-sql-server-part-5-the-chi-squared-test/#comments<p><strong>By Steve Bolton</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>As I’ve cautioned before, I’m writing this series of amateur self-tutorials in order to learn how to use SQL Server to perform goodness-of-fit testing on probability distributions and regression lines, not because I already know the topic well. Along the way, one of the things I’ve absorbed is that the use cases for the various goodness-of-fit tests are more sharply delineated than the statistical tests for the outlier detection, which was the topic of my last tutorial series. I covered some of the more general measures <a href="https://multidimensionalmayhem.wordpress.com/2015/10/16/goodness-of-fit-testing-with-sql-server-part-1-the-simplest-methods/">in Goodness-of-Fit Testing with SQL Server, part 1: The Simplest Methods</a>, but even some of these – like the 3-Sigma Rule – are limited only to the Gaussian or “normal” distribution, i.e. the bell curve. Many of the other metrics we’ll survey later in this series are likewise limited to specific data types, such as the popular Kolmogorov-Smirnov and Anderson-Darling Tests, which cannot be applied to nominal data (i.e. corresponding to the Discrete Content type in SQL Server Data Mining).[1] For that task, you need a metric like the Chi-Squared Test (or ??²), which can handle nominal data types as well as continuous ones, which are measured in infinitesimal gradations; cases in point would include the decimal and float types in SQL Server.<br />
<span style="font-size:10pt;color:white;">…………</span>In addition to the bell curve, the ??²can handle such other popular distributions as the Poisson, log normal, Weibull, exponential, binomial and logistic, plus any others that have an associated cumulative distribution function (CDF).[2] Ralph B. D’Agostino, one of the inventors of the metric we discussed in <a href="https://multidimensionalmayhem.wordpress.com/2015/12/21/goodness-of-fit-testing-with-sql-server-part-3-2-dagostinos-k-squared-test/">Goodness-of-Fit Testing with SQL Server, part 3.2: D’Agostino’s K-Squared Test</a>, cautions though that analyses of the ??² Test indicate this flexibility comes at the cost of decreased statistical power; as he and some of his colleagues put it in a 1990 academic paper, “The extensive power studies just mentioned have also demonstrated convincingly that the old warhorses, the chi-squared test and the Kolmogorov test (1933), have poor power properties and should not be used when testing for normality.”[3] Some experts consider this flaw to be almost fatal, to the point where one writes, “If you want to test normality, a chi-squared test is a really bad way to do it. Why not, say, a Shapiro-Francia test or say an Anderson-Darling adjusted for estimation? You’ll have far more power.”[4] As we shall see in a few weeks, the Anderson-Darling Test has other limitations beyond its inability to handle nominal columns, whereas I believe the Shapiro-Francia Test is based on the Shapiro-Wilk, which is computationally expensive and limited to what the SQL Server community would consider very small sample sizes. Each test has its own unique set of strengths and weaknesses, which ought to strongly influence a data miner’s choices.</p>
<p style="text-align:center;"><strong>More “Gotchas” with the </strong><strong>??</strong><strong>² Test (and Its Inventor)</strong></p>
<p> A further caveat of the ??² Test is that the population ought to be ten times more numerous than the sample[5], but one of the strengths of Big Data-era analysis is that we can use modern set-based methods to traverse gigantic datasets, rather than taking dinky slices of the kind normally seen in hypothesis testing. As discussed over the course of the last two tutorial series, I’m shying away from the whole field of hypothesis testing because it is not well-suited to our use cases, which may involve hundreds of millions of rows that might represent a full population rather than 50 or 100 from a sample that rarely does; furthermore, the act of applying the usual confidence and significance levels and the like reduces such tests down to a simple Boolean, yes-no answer. This represents a substantial reduction in the information provided by the test statistics, akin to truncating a float or decimal column down to a SQL Server bit data type; by retaining the full statistic, we can measure <em>how</em> normal or exponential or uniform a particular dataset may be.[6]<a href="#_edn6" name="_ednref6"><br />
</a> That is why in the last article, I skipped the usual step of plugging the Hosmer-Lemeshow Test results into a ??² Test, to derive confidence levels and the like based on how well they approximate a ??² distribution.[7] In fact, such comparisons to the ??² distribution seem to be as common in hypothesis testing as those to Student’s T-distribution, or the F-distribution in the case of Analysis of Variance (ANOVA). Further adding to the confusion is the fact that there is also a ??² Test of Independence, in which contingency tables are used to establish relationships between multiple variables. There is some overlap in the underlying concepts, but the two ??² Tests are not identical.[8] The goodness-of-fit version we’re speaking of here was the developed by Karl Pearson, one of the most brilliant statisticians of the 19<sup>th</sup> Century – but also one of the most twisted. As I’ve pointed out several times since beginning my series on <a href="http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2012/11/28/a-rickety-stairway-to-sql-server-data-mining-part-01-data-in-data-out/">A Rickety Stairway to SQL Server Data Mining</a>, ordinary mathematicians might be stable people, but a frightening share of the rare geniuses among them have been not just eccentric, but Norman Bates-level crazy. Pearson was a blatant racist and Social Darwinist who sought to extinguish the “unfit” through such schemes as eugenics[9], and thereby helped feed the intellectual current in Europe that eventually brought Hitler to power. We can still use his statistical tests, just as we use the rockets devised by Werner von Braun and the quantum mechanics pioneered by Werner Heisenberg – provided that we put them to better purposes.</p>
<p style="text-align:center;"><strong>Deriving the </strong><strong>??</strong><strong>² from CDFs</strong></p>
<p> You don’t have to be a proverbial rocket scientist in order to calculate the ??² Test, nor do you need to do the kind of mental gymnastics required for the Heisenberg Uncertainty Principle. The equation is actually quite simple, especially since it follows a form similar to that of many of the other test statistics surveyed in the last two tutorial series. Like Z-Scores and so many other metrics, the ??² Test involves subtracting one value from another for each row, squaring the result and then summing them across the entire dataset, all of which can be easily implemented with T-SQL windowing functions. The difference is that in this case, we’re putting the data in ascending order, then subtracting probabilities generated by the CDF of the distribution we’re testing from the actual value.<br />
Some CDFs are trivial to calculate, but as I mentioned in <a href="https://multidimensionalmayhem.wordpress.com/2015/11/03/goodness-of-fit-testing-with-sql-server-part-2-1-implementing-probability-plots-in-reporting-services/">Goodness-of-Fit Testing with SQL Server, part 2: Implementing Probability Plots in Reporting Services</a>, I had a hell of a time deriving the correct values for the normal distribution’s CDF – as do many novices, in large part because there is no closed-form solution to that particular formula. Rather than rehash that whole topic of how to use approximations to derive the normal CDF, I’ll simply reuse most of the code from that article to implement that part of the ??² equation. I had to tweak a little so that I could calculate only the handful of CDF values we actually need, rather than every single probability in a defined range; this called for passing it a table parameter of the type shown below, which is populated in the middle of Figure 1. Keep in mind that this Gaussian CDF is based on the simplest approximation I could find, so once you get about five or six places right of the decimal point, some inaccuracy creeps in, which might be magnified in certain cases by the use of float rather than decimal in the type definition.</p>
<p><strong><u>Figure 1: DDL for the </u></strong><strong><u>??</u></strong><strong><u>²Test<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CREATE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">TYPE</span> <span style="color:teal;">[Calculations]</span><span class="GramE"><span style="color:gray;">.</span><span style="color:teal;">[</span></span><span style="color:teal;">SimpleFloatValueTableParameter]</span> <span style="color:blue;">AS</span> <span style="color:blue;">TABLE</span><span style="color:gray;">(<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">[RN]</span> <span style="color:teal;">[<span class="GramE">bigint</span>]</span> <span style="color:gray;">NULL,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">[Value]</span> <span class="GramE"><span style="color:blue;">float</span><span style="color:gray;">(</span></span>53<span style="color:gray;">)</span> <span style="color:gray;">NULL</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">)</span></p>
<p class="MsoNormal"><span style="font-size:10pt;"> </span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CREATE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">PROCEDURE</span> <span style="color:teal;">[Calculations]</span><span class="GramE"><span style="color:gray;">.</span><span style="color:teal;">[</span></span><span class="SpellE"><span style="color:teal;">NormalDistributionCDFSupplyTableParameterSP</span></span><span style="color:teal;">]<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@Mean</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="GramE"><span style="color:blue;">decimal</span><span style="color:gray;">(</span></span>38<span style="color:gray;">,</span>21<span style="color:gray;">),</span> <span style="color:teal;">@StDev</span> <span style="color:blue;">decimal</span><span style="color:gray;">(</span>38<span style="color:gray;">,</span>21<span style="color:gray;">),</span><span> </span><span style="color:teal;">@<span class="SpellE">InputTableParameter</span></span> <span style="color:blue;">AS</span> <span style="color:teal;">[Calculations]</span><span style="color:gray;">.</span><span style="color:teal;">[<span class="SpellE">SimpleFloatValueTableParameter</span>] </span><span style="color:blue;">READONLY<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">AS</span></p>
<p class="MsoNormal"><span lang="PT" style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@StDevTimesSquareRootOf2</span> <span style="color:blue;">as </span><span style="color:blue;">decimal</span><span style="color:gray;">(</span>38<span style="color:gray;">,</span>21<span style="color:gray;">),</span><span> </span></span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;color:teal;">@One</span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">as</span> <span style="color:blue;">decimal</span><span style="color:gray;">(</span>38<span style="color:gray;">,</span>37<span style="color:gray;">)</span> <span style="color:gray;">=</span> 1<span style="color:gray;">,</span><span> </span></span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;color:teal;">@Two</span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">as</span> <span style="color:blue;">decimal</span><span style="color:gray;">(</span>38<span style="color:gray;">,</span>37<span style="color:gray;">)</span> <span style="color:gray;">=</span> 2<span style="color:gray;">,</span><span> </span></span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;color:teal;">@EulersConstant</span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">decimal</span><span style="color:gray;">(</span>38<span style="color:gray;">,</span>37<span style="color:gray;">)</span> <span style="color:gray;">=</span> 2.7182818284590452353602874713526624977<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@StDevTimesSquareRootOf2</span> <span style="color:gray;">= </span><span style="color:teal;">@StDev</span> <span style="color:gray;">*</span> <span class="GramE"><span style="color:fuchsia;">Power</span><span style="color:gray;">(</span></span><span style="color:teal;">@Two</span><span style="color:gray;">,</span> 0.5<span style="color:gray;">)</span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"><br />
SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE"><span style="color:teal;">ColumnValue</span></span><span style="color:gray;">,</span> <span style="color:blue;">CASE</span> <span style="color:blue;">WHEN</span> <span class="SpellE"><span style="color:teal;">ColumnValue </span></span><span style="color:gray;">>=</span> <span style="color:teal;">@Mean</span> <span style="color:blue;">THEN</span> <span class="SpellE"><span style="color:teal;">CDFValue </span></span><span style="color:blue;">ELSE</span> 1 <span style="color:gray;">–</span> <span class="SpellE"><span style="color:teal;">CDFValue</span></span> <span style="color:blue;">END</span> <span style="color:blue;">AS</span> <span class="SpellE"><span style="color:teal;">CDFValue<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">FROM </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">Value</span> <span style="color:blue;">AS</span> <span class="SpellE"><span style="color:teal;">ColumnValue</span></span><span style="color:gray;">,</span> 0.5 <span style="color:gray;">+</span> <span style="color:gray;">(</span>0.5 <span style="color:gray;">*</span> <span class="GramE"><span style="color:fuchsia;">Power</span><span style="color:gray;">(</span></span><span style="color:teal;">@One</span> <span style="color:gray;">–</span> <span style="color:fuchsia;">Power</span><span style="color:gray;">(</span><span style="color:teal;">@<span class="SpellE">EulersConstant</span></span><span style="color:gray;">,</span> <span style="color:gray;">((-</span>0.147 <span style="color:gray;">*</span> <span style="color:fuchsia;">Power</span><span style="color:gray;">(</span><span class="SpellE"><span style="color:teal;">ErrorFunctionInput</span></span><span style="color:gray;">,</span> 4<span style="color:gray;">))</span> <span style="color:gray;">–</span> <span style="color:gray;">(</span>1.27324 <span style="color:gray;">*</span> <span style="color:fuchsia;">Power</span><span style="color:gray;">(</span><span class="SpellE"><span style="color:teal;">ErrorFunctionInput</span></span><span style="color:gray;">,</span> 2<span style="color:gray;">)))</span> <span style="color:gray;">/</span> <span style="color:gray;">(</span><span style="color:teal;">@One</span> <span style="color:gray;">+</span> <span style="color:gray;">(</span>0.147 <span style="color:gray;">*</span> <span style="color:fuchsia;">Power</span><span style="color:gray;">(</span><span class="SpellE"><span style="color:teal;">ErrorFunctionInput</span></span><span style="color:gray;">,</span> 2<span style="color:gray;">)))),</span> 0.5<span style="color:gray;">))</span> <span style="color:blue;">AS</span> <span class="SpellE"><span style="color:teal;">CDFValue<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">FROM </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:blue;">SELECT</span> <span style="color:teal;">Value</span><span style="color:gray;">,</span> <span style="color:gray;">(</span><span style="color:teal;">Value</span> <span style="color:gray;">–</span> <span style="color:teal;">@Mean</span><span style="color:gray;">)</span> <span style="color:gray;">/</span> <span style="color:teal;">@StDevTimesSquareRootOf2 </span><span style="color:blue;">AS</span> <span class="SpellE"><span style="color:teal;">ErrorFunctionInput<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span class="GramE"><span style="color:blue;">FROM</span><span> </span><span style="color:teal;">@<span class="SpellE">InputTableParameter<br />
</span></span></span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:blue;">WHERE</span> <span style="color:teal;">Value</span> <span style="color:gray;">IS</span> <span style="color:gray;">NOT</span> <span style="color:gray;">NULL)</span> <span style="color:blue;">AS</span> <span style="color:teal;">T1</span><span style="color:gray;">)</span> <span style="color:blue;">AS</span> <span style="color:teal;">T2</span></span><span style="font-size:10pt;"> </span></p>
<p><span style="font-size:10pt;color:white;">…………</span>As annoying as it might be to create these extra objects just to run the procedure in Figure 2, it saves us from having to calculate zillions of CDF values on large tables, when we only need the minimum and maximum values for each band. The ??² Test is applied to distributions rather than regression lines as the Hosmer-Lemeshow Test is, but they have at least one thing in common: the division of the dataset into probability bands, which are then graded on how close the expected values match the actual observations. The criteria for membership in these bands is up to you, but in my implementation, I’m simply using the NTILE windowing function to break up a dataset into subsets of almost equal size, in order of the values of the column being tested. Several sources caution that the type of banding can have a strong effect on the final test statistic. As the <a href="http://www.itl.nist.gov/div898/handbook/index.htm">National Institute for Standards and Technology’s Engineering Statistics Handbook</a> (one of the best online resources for anyone learning statistics) puts it, “This test is sensitive to the choice of bins. There is no optimal choice for the bin width (since the optimal bin width depends on the distribution). Most reasonable choices should produce similar, but not identical, results… The chi-square goodness-of-fit test is applied to binned data (i.e., data put into classes). This is actually not a restriction since for non-binned data you can simply calculate a histogram or frequency table before generating the chi-square test. However, the value of the chi-square test statistic are dependent on how the data is binned.”[10]<a href="#_edn10" name="_ednref10"><br />
</a> They’re not kidding, as I observed first-hand. I set the @NumberOfBands parameter to a default of 10, but you’re probably going to want to run several trials and experiment with higher and lower values, especially when it’s calculated against large tables, because it can dramatically affect the test statistic. Many sources mention that the count of records in each bucket ought to be more than 5, so you don’t want to set the @NumberOfBands so high that the bucket size falls below this threshold. I found it helpful to look at the output of the @FrequencyTable to make sure there weren’t too many bands with identical bounds, which will happen if the @NumberOfBounds is too high. Use some common sense: if you’re operating on nominals that can only be assigned integer values between 0 and 5, then a bin count of 6 might be a wise starting point.</p>
<p style="text-align:center;"><strong>An Explanation of the Sample Code</strong></p>
<p> Most of the rest of the code is self-explanatory, to those who have slogged their way through one of my procedures before. As usual, the first four parameters allow you to run it against any numerical column in any database you have adequate access to, while the first couple of lines in the body help implement this. The rest is all dynamic SQL, beginning with the usual declaration sections and assignments of the aggregate values we’ll need for other calculations. After that I declare a couple of table variables and a table parameter to hold the final results as well as some intermediate steps. Most of the work occurs in the first INSERT, which divides the dataset into bands; a few statements later, the distinct minimum and maximum values that are inserted in this step are fed to the CDF procedure to derive probability values. Note that it can be drastically simplified if the flexible variable ranges that the @DecimalPrecision parameter implements are not needed; in that case, simply return the results from @SchemaAndTableName into a table variable and perform all the math on it outside the dynamic SQL block.<br />
<span style="font-size:10pt;color:white;">…………</span>If you receive NULL values for your CDFs in the final results, it’s a clue that you probably need to try a @DecimalPrecision parameter (which I normally provide to help end users avoid arithmetic overflows) with a smaller scale; it signifies that the procedure can’t match values in the joins properly due to rounding somewhere. For a distribution other than the normal, simply plug in a different CDF and adjust the degrees of freedom to account for additional parameters, such as the shape parameter used in the Weibull. There might be a more efficient way to do the updates to the @FrequencyTable that follow, but the costs of these statements compared to the rest of the batch are inconsequential, plus the procedure is easier to follow this way. The two cumulative frequency counts are provided just as a convenience and can be safely eliminated if you don’t need them. After that, I return the full @FrequencyTable to the user (since the costs of calculating it have already been incurred) and compute the final test statistic in a single line in the last SELECT.<br />
As mentioned in previous articles, many of these older tests were not designed for datasets of the size found in modern relational databases and data warehouse, so there are no checks built in to keep the final test statistic from being grossly inflated by the accumulation of millions of values. For that reason, I’m using a variant known as “Reduced ??²” that simply divides by the count of records to scale the results back down to a user-friendly, easily readable stat. Note that in previous articles, I misidentified Euler’s Number in my variable names as Euler’s Constant, for obvious reasons. Adding to the confusion is the fact that the former is sometimes also known as Napier’s Constant or the Exponential Constant, while the latter is also referred to as the Euler-Mascheroni Constant, which I originally thought to be distinct from Euler’s Constant. I used the correct constant and high-precision value for it, but applied the wrong name in my variable declarations.</p>
<p><strong><u>Figure 2: T-SQL for the </u></strong><strong><u>??</u></strong><strong><u>² Goodness-of-Fit Test<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CREATE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">PROCEDURE</span> <span style="color:teal;">[Calculations]</span><span class="GramE"><span style="color:gray;">.</span><span style="color:teal;">[</span></span><span class="SpellE"><span style="color:teal;">GoodnessOfFitChiSquaredTestSP</span></span><span style="color:teal;">]<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DatabaseName</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">)</span> <span style="color:gray;">=</span> <span style="color:gray;">NULL,</span> <span style="color:teal;">@<span class="SpellE">SchemaName</span></span> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@<span class="SpellE">TableName</span></span> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span><span style="color:teal;">@<span class="SpellE">ColumnName</span></span> <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>50<span style="color:gray;">),</span> <span style="color:teal;">@<span class="SpellE">NumberOfBands </span></span><span style="color:blue;">as</span> <span style="color:blue;">bigint</span> <span style="color:gray;">=</span> 10<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">AS</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName</span></span> <span class="GramE"><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span></span>400<span style="color:gray;">),</span><span style="color:teal;">@<span class="SpellE">SQLString </span></span><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName</span></span> <span style="color:gray;">=</span> <span style="color:teal;">@DatabaseName</span> <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">SchemaName </span></span><span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">TableName</span></span> </span><span style="font-size:9.5pt;font-family:Consolas;"><br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SQLString</span></span> <span style="color:gray;">=</span> </span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;color:red;">‘DECLARE @Mean decimal(‘</span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision </span><span style="color:gray;">+</span> <span style="color:red;">‘), </span></span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;color:red;">@StDev decimal(‘</span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision </span><span style="color:gray;">+</span> <span style="color:red;">‘), </span></span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;color:red;">@Count decimal(‘</span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+ </span><span style="color:red;">‘),<br />
</span></span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;color:red;">@EulersNumber decimal(38,37) = 2.7182818284590452353602874713526624977 </span></p>
<p class="MsoNormal"><span lang="PT" style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Count=Count(CAST(‘</span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName </span><span style="color:gray;">+</span> <span style="color:red;">‘ AS Decimal(‘ </span><span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision </span><span style="color:gray;">+</span> <span style="color:red;">‘))), @Mean = Avg(CAST(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName </span><span style="color:gray;">+</span> <span style="color:red;">‘ AS Decimal(‘ </span><span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision </span><span style="color:gray;">+</span> <span style="color:red;">‘))), @StDev = StDev(CAST(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName </span><span style="color:gray;">+</span> <span style="color:red;">‘ AS Decimal(‘ </span><span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision </span><span style="color:gray;">+</span> <span style="color:red;">‘)))<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName</span></span> <span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘ </span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">ColumnName</span></span> <span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">DECLARE @<span class="SpellE">CDFTable</span> table<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">(ID bigint IDENTITY (1<span class="GramE">,1</span>),<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">Value <span class="GramE">decimal(</span>‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+ </span><span style="color:red;">‘),<br />
</span></span><span class="SpellE"><span lang="ES" style="font-size:9.5pt;font-family:Consolas;color:red;">CDFValue</span></span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;color:red;"> <span class="GramE">decimal(</span>‘</span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘))</span></span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">DECLARE @<span class="SpellE">FrequencyTable</span> table<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">(ID bigint,<br />
</span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;color:red;">MinValue decimal(‘</span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision </span><span style="color:gray;">+</span> <span style="color:red;">‘),<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">MaxValue</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> <span class="GramE">decimal(</span>‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘),<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">LowerCDFValue</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> <span class="GramE">decimal(</span>‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘),<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">UpperCDFValue</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> <span class="GramE">decimal(</span>‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘),<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">ActualFrequencyCount </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">bigint,<br />
</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">ExpectedFrequencyCount </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span class="GramE">decimal(</span>‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+ </span><span style="color:red;">‘),<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">CumulativeActualFrequencyCount </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span class="GramE">decimal(</span>‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+ </span><span style="color:red;">‘),<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">CumulativeExpectedFrequencyCount </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span class="GramE">decimal(</span>‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+ </span><span style="color:red;">‘)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">)</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">INSERT INTO @<span class="SpellE">FrequencyTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">(ID, M<span class="SpellE">inValue</span>, <span class="SpellE">MaxValue</span>, <span class="SpellE">ActualFrequencyCount</span>)</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT DISTINCT <span class="SpellE">BandNumber</span>, <span class="GramE">Min(</span>CAST(‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">ColumnName </span></span><span style="color:gray;">+</span> <span style="color:red;">‘ AS decimal(‘ </span><span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision </span><span style="color:gray;">+</span> <span style="color:red;">‘))) OVER (PARTITION BY <span class="SpellE">BandNumber</span> ORDER BY ‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">ColumnName </span></span><span style="color:gray;">+</span> <span style="color:red;">‘) AS <span class="SpellE">BandMin</span>,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span><span class="GramE">Max(</span>CAST(‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">ColumnName </span></span><span style="color:gray;">+</span> <span style="color:red;">‘ AS decimal(‘ </span><span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision </span><span style="color:gray;">+</span> <span style="color:red;">‘))) OVER (PARTITION BY <span class="SpellE">BandNumber</span> ORDER BY ‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">ColumnName </span></span><span style="color:gray;">+</span> <span style="color:red;">‘ DESC) AS <span class="SpellE">BandMax</span>, — note the DESC to go in the opposite order<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span><span> </span><span class="GramE">Count(</span>*) OVER (PARTITION BY <span class="SpellE">BandNumber</span>) AS <span class="SpellE">BandFrequencyCount<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>(SELECT ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@<span class="SpellE">ColumnName</span></span> <span style="color:gray;">+</span> <span style="color:red;">‘, <span class="SpellE"><span class="GramE">NTile</span></span><span class="GramE">(</span>‘</span> <span style="color:gray;">+</span> <span style="color:fuchsia;">CAST</span><span style="color:gray;">(</span><span style="color:teal;">@<span class="SpellE">NumberOfBands </span></span><span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">))</span> <span style="color:gray;">+</span> <span style="color:red;">‘) OVER (ORDER BY ‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">ColumnName </span></span><span style="color:gray;">+</span> <span style="color:red;">‘) AS <span class="SpellE">BandNumber<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@<span class="SpellE">SchemaAndTableName</span></span> <span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@<span class="SpellE">ColumnName</span></span> <span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL) AS T1</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">DECLARE @<span class="SpellE">InputTableParameter</span> AS <span class="SpellE">Calculations.SimpleFloatValueTableParameter</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">INSERT INTO @<span class="SpellE">InputTableParameter<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">(Value)<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT DISTINCT Value FROM (SELECT <span class="SpellE">MinValue</span> AS Value<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>FROM @<span class="SpellE">FrequencyTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>UNION<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>SELECT <span class="SpellE"><span class="GramE">MaxValue</span></span><span class="GramE"><span> </span>AS</span> Value<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>FROM @<span class="SpellE">FrequencyTable</span>)<br />
AS T1</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">INSERT INTO @<span class="SpellE">CDFTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">(Value, <span class="SpellE">CDFValue</span>)<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">EXEC <span class="SpellE">Calculations.NormalDistributionCDFSupplyTableParameterSP </span>@Mean, @StDev, @<span class="SpellE">InputTableParameter</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">UPDATE T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SET <span class="SpellE">LowerCDFValue</span> = T2.CDFValue<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM @<span class="SpellE">FrequencyTable</span> AS T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>INNER JOIN @<span class="SpellE">CDFTable </span>AS T2<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>ON T1.MinValue = Value</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">UPDATE T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SET <span class="SpellE">UpperCDFValue</span> = T2.CDFValue<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM @<span class="SpellE">FrequencyTable</span> AS T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>INNER JOIN @<span class="SpellE">CDFTable </span>AS T2<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>ON<span> </span>T1.MaxValue = T2.Value</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">UPDATE @<span class="SpellE">FrequencyTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SET <span class="SpellE">ExpectedFrequencyCount</span> = (<span class="SpellE">UpperCDFValue </span>– <span class="SpellE">LowerCDFValue</span>) * @Count</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— <span class="GramE">the</span> Cumulatives are just for convenience and can be safely eliminated from the table if you <span class="SpellE">don’t</span> need them</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">UPDATE T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SET T1.CumulativeActualFrequencyCount = T2.CumulativeActualFrequencyCount,<br />
T1.CumulativeExpectedFrequencyCount = T2.CumulativeExpectedFrequencyCount<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM @<span class="SpellE">FrequencyTable</span> AS T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">INNER JOIN </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">(SELECT ID, <span class="GramE">Sum(</span><span class="SpellE">ActualFrequencyCount</span>) OVER (ORDER BY ID)<span> </span>AS <span class="SpellE">CumulativeActualFrequencyCount</span>, Sum(<span class="SpellE">ExpectedFrequencyCount</span>)<br />
OVER (ORDER BY ID)<span> </span>AS <span class="SpellE">CumulativeExpectedFrequencyCount<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM @<span class="SpellE">FrequencyTable</span>) AS T2<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">ON T1.ID = T2.ID</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— return all of the results<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT ID, <span class="SpellE">MinValue</span>, <span class="SpellE">MaxValue</span>, <span class="SpellE">LowerCDFValue</span>, <span class="SpellE">UpperCDFValue</span>, <span class="SpellE">ActualFrequencyCount</span>, <span class="SpellE">ExpectedFrequencyCount</span>, <span class="SpellE">CumulativeActualFrequencyCount</span>, <span class="SpellE">CumulativeExpectedFrequencyCount<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM @<span class="SpellE">FrequencyTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">ORDER BY ID</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— this is an alternate version of the test called “reduced chi squared” in which the degrees of freedom are taken into account to scale the results back down<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT <span class="GramE">Sum(</span><span class="SpellE">ExpectedFrequencyCountSum</span>) /<span> </span>Count(*) AS <span class="SpellE">ChiSquaredTestReduced,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Count AS <span class="SpellE">FullPopulationCount</span>, @Mean AS Mean, @StDev AS StDev<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">(SELECT CASE WHEN <span class="SpellE">ExpectedFrequencyCount</span> = 0 THEN 0 <span class="GramE">ELSE<span> </span>Power</span>(<span class="SpellE">ActualFrequencyCount</span> – <span class="SpellE">ExpectedFrequencyCount</span>, 2) / <span class="SpellE">ExpectedFrequencyCount</span> END AS <span class="SpellE">ExpectedFrequencyCountSum<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM @<span class="SpellE">FrequencyTable</span>) AS T1</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">‘</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:green;">–SELECT @<span class="SpellE">SQLString</span> — <span class="GramE">uncomment this </span>to debug dynamic SQL errors<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@SQLString</span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">)</span></p>
<p><span style="font-size:10pt;color:white;">…………</span>As has become standard fare over the past two tutorial series, I first tested the results against a tiny 9-kilobyte table of data on the Duchennes form of muscular dystrophy from the <a href="http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets">Vanderbilt University’s Department of Biostatistics</a>. Then I stress-tested it against the 11 million rows in the Higgs Boson Dataset I downloaded from the <a href="https://archive.ics.uci.edu/ml/datasets/HIGGS">University of California at Irvine’s Machine Learning Repository</a> and converted into a nearly 6-gigabyte SQL Server table. The query in Figure 3 on the Hemopexin protein produced the first resultset below it, while the much longer resultset was the product of a similar query on the first float column in the HiggsBosonTable. An unreasonable selection of bands can also apparently affect performance; on my first trial on the HiggsBosonTable, I forgot to set the number well above 7, which may be why it took 7:26. Subsequent trials with values around 100 took between 5:46 and 5:52; the results depicted here are only for the first 22 out of 110 bands.<br />
<span style="font-size:10pt;color:white;">…………</span>I’m not surprised that the final test statistic has six digits to the left of the decimal points, given that I know from previous outlier detection and goodness-of-fit tests that Column 1 is highly abnormal. Column 2 follows an obvious bell curve when displayed in a histogram, so it is likewise not surprising that its ??² Test result was only 1,326, or less a hundredth of Column1. I have the feeling that the sheer size of the dataset can distort the final test statistic, thereby making it difficult to compare them across datasets, but probably not as severely as in other measures, particularly the Jarque-Bera and K² Tests. The query on the second float column likewise took 5:45 on my beat-up development machine – which more closely resembles the <a href="https://www.youtube.com/watch?v=L8hX2Ex58os">Bluesmobile</a> than a real server, so your mileage will probably be a lot better. It’s not as quick as the procedure I wrote in <a href="https://multidimensionalmayhem.wordpress.com/2016/01/13/goodness-of-fit-testing-with-sql-server-part-4-1-r2-rmse-and-regression-related-routines/">Goodness-of-Fit Testing with SQL Server Part 4.1: R<sup>2</sup>, RMSE and Regression-Related Routines</a>, but certainly faster than many others I’ve done in past articles.</p>
<p><strong><u>Figure 3: Sample Results from the Duchennes and Higgs Boson Datasets<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@return_value </span><span style="color:gray;">=</span> <span style="color:teal;">[Calculations]</span><span class="GramE"><span style="color:gray;">.</span><span style="color:teal;">[</span></span><span class="SpellE"><span style="color:teal;">GoodnessOfFitChiSquaredTestSP</span></span><span style="color:teal;">]<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"><span> </span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DatabaseName</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’DataMiningProjects</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@<span class="SpellE">SchemaName</span></span> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’Health</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@<span class="SpellE">TableName</span></span> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’DuchennesTable</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@<span class="SpellE">ColumnName</span></span> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’Hemopexin</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@DecimalPrecision </span><span style="color:gray;">=</span> <span style="color:red;">N’38<span class="GramE">,17′</span></span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@<span class="SpellE">NumberOfBands</span></span> <span style="color:gray;">=</span> 7</span><span style="font-size:10pt;"> </span></p>
<p><a href="https://multidimensionalmayhem.wordpress.com/2016/02/12/goodness-of-fit-testing-with-sql-server-part-5-the-chi-squared-test/chisquaredresults-1/" rel="attachment wp-att-558"><img class="alignnone size-full wp-image-558" src="https://multidimensionalmayhem.files.wordpress.com/2016/02/chisquaredresults-1.jpg?w=604&h=164" alt="ChiSquaredResults 1" width="604" height="164" /></a> <a href="https://multidimensionalmayhem.wordpress.com/2016/02/12/goodness-of-fit-testing-with-sql-server-part-5-the-chi-squared-test/chisquaredresults-2/" rel="attachment wp-att-559"><img class="alignnone size-full wp-image-559" src="https://multidimensionalmayhem.files.wordpress.com/2016/02/chisquaredresults-2.jpg?w=604&h=274" alt="ChiSquaredResults 2" width="604" height="274" /></a></p>
<p><span style="font-size:10pt;color:white;">…………</span>The full execution plan it too large to depict here, but suffice it to say that it consists of 11 separate queries – with one of them, the insert into the @FrequencyTable, accounting for 99 percent of the computational cost of the entire batch. I’m not sure at this point how to go about optimizing that particular query, given that it starts with an Index Seek, which is normally what we want to see; there are also a couple of Nested Loops operators and a Hash Match within that query, but together they only account for about 12 percent of its internal costs. Almost all of the performance hit comes on two Sort operators, which a better-trained T-SQL aficionado might be able to dispose of with a few optimizations.<br />
Efficiency is something I’ll sorely need for next week’s article, in which I tackle the Shapiro-Wilk Test. Many sources I’ve stumbled upon while researching this series indicate that it has better statistical power than most of the competing goodness-of-fit tests, but it has many limitations which severely crimp its usability, at least for our purposes. First, it can apparently only be calculated on just 50 values, although I’ve seen figures as high as a couple of hundred. Either way, that’s about a few hundred million rows short; the sheer sizes of datasets available to DBAs and data miners today are one of their strengths, and we shouldn’t have to sacrifice that hard-won advantage by taking Lilliputian slices of it. Worst of all, the calculations are dogged by a form of combinatorial explosion, which can be the kiss of death for Big Analysis. I have learned to fear the dreaded factorial symbol n! and the more insidious menace posed by calculations upon infinitesimal reciprocals, of the kind that afflicted the Hosmer-Lemeshow Test in last week’s article. My implementation of the Shapiro-Wilk Test will sink or swim depending on whether or not I can find a reasonable workaround for the covariance matrices, which are calculated based on a cross product of rows. In a table of a million rows, that means 1 trillion calculations just to derive an intermediary statistic. A workaround might be worthwhile, however, given the greater accuracy most sources ascribe to the Shapiro-Wilk Test.</p>
<p>[1] See National Institute for Standards and Technology, 2014, “1.3.5.15 Chi-Square Goodness-of-Fit Test,” published in the online edition of the Engineering Statistics Handbook. Available at <a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm">http://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm</a></p>
<p>[2] <em>IBID.</em></p>
<p>[3] p. 316 D’Agostino, Ralph B.; Belanger, Albert and D’Agostino Jr., Ralph B, 1990, “A Suggestion for Using Powerful and Informative Tests of Normality,” pp. 316–321 in <u>The American Statistician</u>. Vol. 44, No. 4. Available online at <a href="http://www.ohio.edu/plantbio/staff/mccarthy/quantmet/D'Agostino.pdf">http://www.ohio.edu/plantbio/staff/mccarthy/quantmet/D’Agostino.pdf</a></p>
<p>[4] See the reply by the user named Glen_b in the <u>CrossValidated</u> thread “How to Get the Expected Counts When Computing a Chi-Squared Test?” dated March 14, 2013, which is available at the web address</p>
<p><a href="http://stats.stackexchange.com/questions/52209/how-to-get-the-expected-counts-when-computing-a-chi-squared-test">http://stats.stackexchange.com/questions/52209/how-to-get-the-expected-counts-when-computing-a-chi-squared-test</a></p>
<p>[5] See the StatTrek webpage titled “When to Use the Chi-Square Goodness of Fit Test” at <a href="http://stattrek.com/chi-square-test/goodness-of-fit.aspx">http://stattrek.com/chi-square-test/goodness-of-fit.aspx</a></p>
<p>[6] I may be a novice, but am apparently not alone in my reluctance to use tests that enforce either-or choices. See the reply by the user named Glen_b in the <u>CrossValidated</u> thread “How to Get the Expected Counts When Computing a Chi-Squared Test?” dated March 14, 2013, which is available at the web address <a href="http://stats.stackexchange.com/questions/52209/how-to-get-the-expected-counts-when-computing-a-chi-squared-test">http://stats.stackexchange.com/questions/52209/how-to-get-the-expected-counts-when-computing-a-chi-squared-test</a> as well as the reply by the same user to the thread “What Tests Do I Use to Confirm That Residuals are Normally Distributed?” posted Sept. 13, 2013 at the CrossValidated forum web address <a href="http://stats.stackexchange.com/questions/36212/what-tests-do-i-use-to-confirm-that-residuals-are-normally-distributed/36220#36220">http://stats.stackexchange.com/questions/36212/what-tests-do-i-use-to-confirm-that-residuals-are-normally-distributed/36220#36220</a> He makes several very good points about goodness-of-fit testing that are worth quoting here. In the first, he says that</p>
<blockquote><p>“No test will prove your data is normally distributed. In fact I bet that it isn’t. (Why would any distribution be exactly normal? Can you name anything that actually is?)<br />
2) When considering the distributional form, usually, hypothesis tests answer the wrong question<br />
What’s a good reason to use a hypothesis test for checking normality?<br />
I can think of a few cases where it makes some sense to formally test a distribution. One common use is in testing some random number generating algorithm for generating a uniform or a normal.</p></blockquote>
<p>In the second thread, he similarly points out that:</p>
<blockquote><p>“1.No test will tell you your residuals are normally distributed. In fact, you can reliably bet that they are not.”<br />
“2.Hypothesis tests are not generally a good idea as checks on your assumptions. The effect of non-normality on your inference is not generally a function of sample size*, but the result of a significance test is. A small deviation from normality will be obvious at a large sample size even though the answer to the question of actual interest (‘to what extent did this impact my inference?’) may be ‘hardly at all’. Correspondingly, a large deviation from normality at a small sample size may not approach significance…”<br />
“…If you must use a test, Shapiro-Wilk is probably as good as anything else. (But it’s answering a question you already know the answer to – and every time you fail to reject, giving an answer you can be sure is wrong.)”</p></blockquote>
<p>[7] Just a side note on terminology: I see both the tests and the distribution referred to as “Chi-Squared” with a final D as often as I do “Chi-Square” without one, which are sometimes mixed together in the same sources. I’ll stick with a closing D for the sake of consistency, even if it turns out to be semantically incorrect.</p>
<p>[8] For a readable explanation of the independence test, see Hopkins, Will G., 2001, “Contingency Table (Chi-Squared Test),” published at the <u>A New View of Statistics</u> website address <a href="http://www.sportsci.org/resource/stats/continge.html">http://www.sportsci.org/resource/stats/continge.html</a></p>
<p>[9] For a quick introduction to this sordid tale, see the Wikipedia page “Karl Pearson” at <a href="http://en.wikipedia.org/wiki/Karl_Pearson">http://en.wikipedia.org/wiki/Karl_Pearson</a></p>
<p>[10] See National Institute for Standards and Technology, 2014, “1.3.5.15 Chi-Square Goodness-of-Fit Test,” published in the online edition of the Engineering Statistics Handbook. Available at <a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm">http://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm</a> The formula for the goodness-of-fit test is widely available, but I depended mostly on this NIST webpage when writing my code because their equation was more legible.</p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/multidimensionalmayhem.wordpress.com/557/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/multidimensionalmayhem.wordpress.com/557/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=multidimensionalmayhem.wordpress.com&blog=27407452&post=557&subd=multidimensionalmayhem&ref=&feed=1" width="1" height="1" />Goodness-of-Fit Testing with SQL Server Part 4.2: The Hosmer–Lemeshow Test with Logistic Regression
http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/01/26/goodness-of-fit-testing-with-sql-server-part-42-the-hosmerlemeshow-test-with-logistic-regression/
Wed, 27 Jan 2016 05:03:13 UT/blogs/multidimensionalmayhem/2016/01/26/goodness-of-fit-testing-with-sql-server-part-42-the-hosmerlemeshow-test-with-logistic-regression/0http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/01/26/goodness-of-fit-testing-with-sql-server-part-42-the-hosmerlemeshow-test-with-logistic-regression/#comments<p><strong>By Steve Bolton</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>The last installment of this amateur series of self-tutorials was the beginning of a short detour into using SQL Server to perform goodness-of-fit testing on regression lines, rather than on probability distributions. These are actually quite simple concepts; any college freshman ought to be able to grasp the idea of a single regression line, since all they do is graph the values of one variable as it changes in tandem with another, whereas distributions merely summarize the distinct counts for each value of a variable, as the famous bell curve does. In the first case we’re dealing with how well a regression line models the relationship between two variables – assuming one exists – and in the second, we’re assessing whether the distribution of a single variable matches one of many distributions that constantly recur in nature, like the Gaussian “normal” distribution, or other common ones like the Poisson, binomial, uniform and exponential distributions.<br />
<span style="font-size:10pt;color:white;">…………</span>These topics can of course become quite complex rather quickly once we factor in modifications like multivariate cases, but I’ll avoid these topics for the sake of simplicity, plus the fact that there are an endless number of variants of them that could keep us busy to the end of time. I really can’t end this segment though without addressing a particular variant, since logistic regression is a widely used algorithm that addresses a distinctive set of use cases. There is likewise an endless array of alternative takes on the basic logistic regression algorithm, such as adaptations for multinomial cases, but I’ll stick to the old maxim, KISS: Keep It Simple, Stupid. The Hosmer-Lemeshow Test that is commonly used for fitness testing on logistic regression may not be applicable in some of these more advanced cases, but it is indispensable for the bare bones version of the algorithm.</p>
<p style="text-align:center;"><strong>Adapting Regression Lines to Use the Logistic Function</strong></p>
<p> This regression algorithm is a topic I became somewhat familiar with while writing <a href="https://multidimensionalmayhem.wordpress.com/2013/01/23/a-rickety-stairway-to-sql-server-data-mining-algorithm-4-logistic-regression/">A Rickety Stairway to SQL Server Data Mining, Algorithm 4: Logistic Regression</a> a while back, in a different tutorial series. Suffice it to say that the algorithm is ideal for situations in which you need to place bounds on the outputs that a regression line can generate; most commonly this is a Boolean yes-no choice that ranges between 0 and 1, but it can be adapted to other scales, such as 0 to 100. A linear regression line would produce nonsensical values in these instances, since the values would be free to go off the charts, but a logistic regression is guaranteed to stay within its assigned bounds. This is accomplished by using the logistic function, which is really not much more difficult to implement than linear regression (with one critical limitation). The formula is widely available on the Internet, so when writing the code for this article I retrieved it from the most convenient source as usual: Wikipedia.[1]<br />
<span style="font-size:10pt;color:white;">…………</span>In many ways, the logistic function behaves like the cumulative distribution functions (CDFs) I mentioned in <a href="https://multidimensionalmayhem.wordpress.com/2015/11/03/goodness-of-fit-testing-with-sql-server-part-2-1-implementing-probability-plots-in-reporting-services/">Goodness-of-Fit Testing with SQL Server, Part 2: Implementing Probability Plots in Reporting Services</a>, in that the probabilities assigned to the lowest value begin at 0 and accumulate up to 1 by the time we reach the final value in an ordered dataset. It also behaves in many ways like a weighted function, in that the pressure on it to conform to the bounds increases as it nears them; I think of it in terms of the way quarks are inextricably bound together by the strong force within hadrons, which increases as they come closer to breaking free. In between the upper and lower bounds the regression takes on the appearance of an S-curve rather than the lines seen in normal regression.<br />
<span style="font-size:10pt;color:white;">…………</span>Another easily readable and insightful commentary on logistic regression can be found at University of Toronto Prof. Saed Sayad’s website[2], in which he provides a succinct explanation of the logistic function equation and some alternative measures for the accuracy of the mining models it generates. Three of these are subspecies of the R<sup>2</sup> measure we discussed last week in connection with linear regression, which are discussed together under the rubric of Pseudo R<sup>2</sup>. Fitness testing of this kind is as necessary on regression lines as it is for distribution matching, because they rarely model the relationships between variables perfectly; as statistician George Box (one of the pioneers of Time Series) once put it so colorfully, “All models are wrong, but some are useful.”[3] Sayad also mentions alternative methods like Likelihood Ratio tests and the Wald Test. The measure of fit I’ve seen mentioned most often in connection with logistic regression goes by the memorable moniker of the Hosmer-Lemeshow Test. It apparently has its limitations, as we shall see, but it is not terribly difficult to implement – with some important caveats.</p>
<p style="text-align:center;"><strong>Banding Issues with Coding the Hosmer-Lemeshow Test</strong></p>
<p> In fact, the first three steps in the dynamic SQL in Figure 1 are almost identical to those used to calculate regression lines in last week’s article and a tutorial from a different series, <a href="https://multidimensionalmayhem.wordpress.com/2015/08/24/outlier-detection-with-sql-server-part-7-cooks-distance/">Outlier Detection with SQL Server Part 7: Cook’s Distance</a>, which I won’t waste time recapping here.[4] The fourth step just applies the logistic function to the results, in a single, simple line of T-SQL. After this, I simply insert the logistic regression values into a table variable for later retrieval, including returning it to the user towards the end of the procedure; there’s really no reason not to return the correlation, covariance, slope, intercept and standard deviations for both variables as well, given that we’ve already calculated them. Step 5 is where the meat and potatoes of the Hosmer-Lemeshow Test can be found. Its strategy is essentially to divide the values into bands, which are often referred to as “deciles of risk” when the test is employed in one of its most common applications, risk analysis.[5] The bands are then compared to the values we’d expect for them based on probabilistic calculations and the gap between them is quantified and summarized.<br />
<span style="font-size:10pt;color:white;">…………</span>It is now time for my usual disclaimer: I am writing this series in order to familiarize myself with these data mining tools, not because I have expertise in using them, so it would be wise to check my code thoroughly before putting it to use (or even, God forbid, a production environment). Normally I check my code against the examples provided in various sources, especially the original academic papers whenever possible. In this case, however, I couldn’t find any at a juncture where they would have come in handy, given that I am not quite certain that I am splitting the data into bands on the right axis. I am still learning how to decipher the equations that underpin algorithms of this kind and some of them differ significantly in notation and nomenclature, so it may be that I ought to be counting the observed and expected values differently, which affects how they are split; from the wording in a book inventors David W. Hosmer Jr. and Stanley Lemeshow wrote on <em>Applied Logistic Regression</em>, it seems that the counts between the bands are supposed to vary much more significantly[6] than they do in my version, which could be a side effect of incorrect banding. There are apparently many different ways of doing banding[7], however, including my method below, in which the @BandCount parameter is plugged into the NTILE windowing function in Step 1.<br />
<span style="font-size:10pt;color:white;">…………</span>I’ve seen two general rules of thumb mentioned in the literature for setting the @BandCount to an optimal level: using groups of fewer than five members leads to incorrect results, while using fewer than six groups “almost always” leads to passing the fitness test.[8] Averages for both the X and Y axes are calculated for each band, then the regression line is derived through the usual methods in Steps 3 through 5, with one crucial difference: some of the aggregates have to be derived from the bands rather than the original table, otherwise we’d end up with an apples vs. oranges comparison. This is one of several points where the procedure can go wrong, since the banding obscures the detail of the original table and can lead to a substantially different regression line. Keep in mind though that the literature mentions several alternative methods of banding, so there may be better ways of accomplishing this.</p>
<p><strong><u>Figure 1: T-SQL Code for the Hosmer–Lemeshow Test Procedure<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CREATE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">PROCEDURE</span> <span style="color:teal;">[Calculations]</span><span style="color:gray;">.</span><span style="color:teal;">[GoodnessOfFitLogisticRegressionHosmerLemsehowTestSP]<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DatabaseName</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">)</span> <span style="color:gray;">=</span> <span style="color:gray;">NULL,</span> <span style="color:teal;">@<span class="SpellE">SchemaName</span></span> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@<span class="SpellE">TableName</span></span> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span><span style="color:teal;">@ColumnName1</span> <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span><span style="color:teal;">@ColumnName2</span> <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@<span class="SpellE">BandCount</span></span> <span style="color:blue;">As</span><span> </span><span style="color:blue;">bigint<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">AS</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName</span></span> <span class="GramE"><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span></span>400<span style="color:gray;">),</span><span style="color:teal;">@<span class="SpellE">SQLString </span></span><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName</span></span> <span style="color:gray;">=</span> <span style="color:teal;">@DatabaseName</span> <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">SchemaName </span></span><span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">TableName</span></span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SQLString</span></span> <span style="color:gray;">=</span> </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">‘DECLARE </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">MeanX</span> float,@<span class="SpellE">MeanY</span> float, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">StDevX</span> <span class="GramE">float</span>, @<span class="SpellE">StDevY </span>float,<br />
</span><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Count<span> </span>float, </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Correlation<span> </span>float, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Covariance float, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Slope float, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Intercept float,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">ValueRange</span> float, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">HosmerLemeshowTest</span> float</span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— STEP #1 — GET THE RANGE OF VALUES FOR THE Y COLUMN<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @<span class="SpellE">ValueRange</span> = CASE WHEN <span class="SpellE">RecordCount </span>% 2 = 0 THEN <span class="SpellE">ValueRange</span> + 1 ELSE <span class="SpellE">ValueRange </span>END<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM (SELECT <span class="GramE">Max(</span>‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@ColumnName2</span> <span style="color:gray;">+</span> <span style="color:red;">‘) – Min(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName2</span> <span style="color:gray;">+</span> <span style="color:red;">‘) AS <span class="SpellE">ValueRange</span>, Count(*) AS <span class="SpellE">RecordCount<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName</span></span> <span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName2 </span><span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL) AS T1</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— STEP #2 — CREATE THE BANDS<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">DECLARE @<span class="SpellE">LogisticRegressionTable</span> table<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">(ID bigint, — ID is the decile identifier<br />
</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">CountForGroup</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> bigint,<br />
</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">AverageX</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> float,<br />
</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">AverageY</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> float,<br />
</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">RescaledY</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> float,<br />
</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">LogisticResult</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> float<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">)</span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">INSERT INTO @<span class="SpellE">LogisticRegressionTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">(ID, <span class="SpellE">CountForGroup</span>, <span class="SpellE">AverageX</span>, <span class="SpellE">AverageY</span>)</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT DISTINCT <span class="SpellE">DecileNumber</span>, <span class="GramE">COUNT(</span>*) OVER (PARTITION BY <span class="SpellE">DecileNumber</span>) AS <span class="SpellE">CountForGroup</span> , <span class="SpellE">Avg</span>(CAST(‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName1</span> <span style="color:gray;">+</span> <span style="color:red;">‘ AS float)) OVER<br />
(PARTITION BY <span class="SpellE">DecileNumber</span> ORDER BY <span class="SpellE">DecileNumber</span>) AS <span class="SpellE">AverageX</span>, <span class="SpellE">Avg</span>(CAST(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName2</span> <span style="color:gray;">+</span> <span style="color:red;">‘ AS float)) OVER (PARTITION BY <span class="SpellE">DecileNumber</span><br />
ORDER BY <span class="SpellE">DecileNumber</span>)<span> </span>AS <span class="SpellE">AverageY<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">(SELECT ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName1 </span><span style="color:gray;">+</span> <span style="color:red;">‘, ‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName2</span> <span style="color:gray;">+</span> <span style="color:red;">‘, <span class="GramE">NTILE(</span>‘ </span><span style="color:gray;">+</span> <span style="color:fuchsia;">CAST</span><span style="color:gray;">(</span><span style="color:teal;">@<span class="SpellE">BandCount </span></span><span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">))</span> <span style="color:gray;">+</span> <span style="color:red;">‘) OVER (ORDER BY ‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName1</span> <span style="color:gray;">+</span> <span style="color:red;">‘) AS <span class="SpellE">DecileNumber<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName</span></span> <span style="color:gray;">+ </span><span style="color:red;">‘) AS T1</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">UPDATE @<span class="SpellE">LogisticRegressionTable</span> — this could be done in one step, but <span class="SpellE">Im</span> leaving it this way for legibility purposes<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SET <span class="SpellE">RescaledY</span> = <span class="SpellE">AverageY</span> / @<span class="SpellE">ValueRange</span></span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— STEP #3 – <span class="GramE">RETRIEVE</span> THE GLOBAL AGGREGATES NEEDED FOR<br />
OTHER CALCULATIONS<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">— note that we <span class="SpellE"><span class="GramE">cant</span></span> operate on the original table here, otherwise the stats would be different from those of the bands<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @<span class="SpellE">MeanX</span> = <span class="SpellE"><span class="GramE">Avg</span></span><span class="GramE">(</span>CAST(X AS float)), @<span class="SpellE">MeanY</span> = <span class="SpellE">Avg</span>(CAST(Y AS float)), </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">StDevX</span> = <span class="GramE">StDev(</span>CAST(X AS float)), @<span class="SpellE">StDevY</span> = StDev(CAST(Y AS float))<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM (SELECT ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName1 </span><span style="color:gray;">+</span> <span style="color:red;">‘ AS X, ‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName2</span> <span style="color:gray;">+</span> <span style="color:red;">‘ AS Y<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName</span></span> <span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName1 </span><span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL AND ‘ </span><span style="color:gray;">+</span> <span style="color:teal;">@ColumnName2</span> <span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL) AS T1</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— STEP #4 – CALCULATE THE CORRELATION (BY FIRST GETTING THE COVARIANCE)<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Covariance = <span class="GramE">SUM(</span>(<span class="SpellE">AverageX </span>– @<span class="SpellE">MeanX</span>) * (<span class="SpellE">AverageY</span> – @<span class="SpellE">MeanY</span>)) / (‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:fuchsia;">CAST</span><span style="color:gray;">(</span><span style="color:teal;">@<span class="SpellE">BandCount</span></span> <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">))</span> <span style="color:gray;">+</span> <span style="color:red;">‘ – 1)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM @<span class="SpellE">LogisticRegressionTable</span></span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Correlation = @Covariance / (@<span class="SpellE">StDevX</span> * @<span class="SpellE">StDevY</span>)</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— STEP #5 – CALCULATE THE SLOPE AND INTERCEPT<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Slope = @Correlation * (@<span class="SpellE">StDevY</span> / @<span class="SpellE">StDevX</span>)<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Intercept = <span class="GramE">@<span class="SpellE">MeanY</span></span> – (@Slope * @<span class="SpellE">MeanX</span>)</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— STEP #6 – CALCULATE THE LOGISTIC FUNCTION<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">UPDATE T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SET <span class="SpellE">LogisticResult</span> = 1 / (1 + <span class="GramE">EXP(</span>-1 * (@Intercept + (@Slope * <span class="SpellE">AverageX</span>))))<br />
</span><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM<span> </span>@<span class="SpellE">LogisticRegressionTable </span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">AS T1</span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— RETURN THE RESULTS<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @<span class="SpellE">HosmerLemeshowTest</span> = <span class="GramE">SUM(</span>Power((<span class="SpellE">RescaledY</span> – <span class="SpellE">LogisticResult</span>), 2) / (<span class="SpellE">CountForGroup</span> * <span class="SpellE">LogisticResult </span>* (1- <span class="SpellE">LogisticResult</span>)))<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM @<span class="SpellE">LogisticRegressionTable</span> AS T1</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT * FROM @<span class="SpellE">LogisticRegressionTable</span></span><span style="font-size:9.5pt;font-family:Consolas;"><br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @<span class="SpellE">HosmerLemeshowTest</span> AS <span class="SpellE">HosmerLemeshowTest, </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Covariance AS Covariance, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Correlation AS Correlation, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">MeanX</span> <span class="GramE">As</span> <span class="SpellE">MeanX</span>, @<span class="SpellE">MeanY</span> As <span class="SpellE">MeanY, </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">StDevX</span> as <span class="SpellE">StDevX</span>, @<span class="SpellE">StDevY</span> AS <span class="SpellE">StDevY</span>, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Intercept AS Intercept, @Slope AS Slope</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">‘</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:green;">–SELECT @<span class="SpellE">SQLString</span> — <span class="GramE">uncomment this </span>to debug dynamic SQL errors<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@<span class="SpellE">SQLString</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">)</span></p>
<p><span style="font-size:10pt;color:white;">…………</span>Another potential problem arises when we derive the expected values of column Y for each band, using the logistic function in Step 6.[9] This is one of those maddening scaling issues that continually seem to arise whenever these older statistical tests are applied to Big Data-sized recordsets. This very simple and well-known function is implemented in the formula 1 / (1 + EXP(-1 * (@Intercept + (@Slope * AverageX)))), but when the result of the exponent operation is an infinitesimally small value, it gets truncated when adding the 1. This often occurs when the values for the @Intercept are high, particularly above 100. When 1 is divided by the resulting 1, the logistic function result is 1, which leads to division by zero errors when calculating the test statistic in the last SELECT assignment. There might be a way for a more mathematically adept programmer to rescale the @Intercept, @Slope and other variables so that this truncation doesn’t occur, but I’m not going to attempt to implement a workaround unless I’m sure it won’t lead to incorrect results in some unforeseen way.<br />
<span style="font-size:10pt;color:white;">…………</span>Yet another issue is that my implementation allows the second column to be Continuous rather than the usual Boolean either-or choice seen in simple logistic regression. That requires rescaling to the range of permissible values, but the way I’ve implemented it through the RescaledY table variable column and @ValueRange variable may be incorrect. The SELECT that assigns the value to @HosmerLemeshowTest can also probably be done a little more concisely in fewer steps, but I want to highlight the internal logic so that it is easier to follow and debug. The rest of the code follows much the same format as usual, in which null handling, SQL injection protection, bracket handling and validation code are all omitted for the sake of legibility and simplicity. Most of the parameters are designed to allow the user to perform the regression on any two columns in the same table or view, in any database they have sufficient access to. The next-to-last line allows programmers to debug the dynamic SQL, which will probably be necessary before putting this procedure to professional use. In the last two statements I return all of the bands in the regression table plus the regression stats, since the costs of calculating them have already been incurred. It would be child’s play for us to also calculate the Mean Squared Error from these figures with virtually no computational cost, but I’m not yet sure if it enjoys the same validity and significance with logistic regression as it does with linear.</p>
<p><strong><u>Figure 2: Sample Results from Duchennes and Higgs Boson Datasets<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">Calculations</span><span style="color:gray;">.</span><span style="color:teal;">GoodnessOfFitLogisticRegressionHosmerLemsehowTestSP<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"><span> </span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DatabaseName</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">=</span> <span style="color:red;">N’DataMiningProjects’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@SchemaName </span><span style="color:gray;">=</span> <span style="color:red;">N’Physics’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span></span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;color:teal;">@TableName</span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">=</span> <span style="color:red;">N’HiggsBosonTable’</span><span style="color:gray;">,<br />
</span></span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@ColumnName1</span> <span style="color:gray;">=</span> <span style="color:red;">N’Column1′</span><span style="color:gray;">,<br />
</span></span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@ColumnName2</span> <span style="color:gray;">=</span> <span style="color:red;">N’Column2′</span><span style="color:gray;">,<br />
</span></span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;"><span> </span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@BandCount</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">=</span> 12<br />
<a href="https://multidimensionalmayhem.wordpress.com/2016/01/26/goodness-of-fit-testing-with-sql-server-part-4-2-the-hosmer-lemeshow-test-with-logistic-regression/hosmerlemeshowresults/" rel="attachment wp-att-554"><img class="alignnone size-full wp-image-554" src="https://multidimensionalmayhem.files.wordpress.com/2016/01/hosmerlemeshowresults.jpg?w=604&h=219" alt="HosmerLemeshowResults" width="604" height="219" /></a><br />
</span></p>
<p><span style="font-size:10pt;color:white;">…………</span>I’ve tested most of the procedures for the last two tutorial series against a 9-kilobyte dataset on the Duchennes form of muscular dystrophy, which is made publicly available by <a href="http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets">Vanderbilt University’s Department of Biostatistics</a>. For this week’s article, however, the intercepts were too high for most combinations of comparisons between the PyruvateKinase, CreatineKinase, LactateDehydrogenase and Hemopexin columns, resulting in the aforementioned divide-by-zero errors. For the few combinations that worked, the test statistic was ridiculously inflated; for other databases I’m familiar with, it returned results in the single and double digits (which is apparently permissible, since I’ve seen professionals post Hosmer-Lemeshow results online that fall in that range) but for whatever reason, this was not the case with Duchennes dataset.<br />
<span style="font-size:10pt;color:white;">…………</span>That is why I derived the sample results in Figure 2 from the first two float columns of the Higgs Boson Dataset I downloaded from <a href="https://archive.ics.uci.edu/ml/datasets/HIGGS">University of California at Irvine’s Machine Learning Repository</a>, which I normally use for stress-testing because its 11 million rows occupy nearly 6 gigabytes in the SQL Server table I converted it into. Given that the first column is obviously abnormal and the second clearly follows a bell curve, I expected the results to indicate a serious lack of fit, but the test statistic was only a minuscule 1.30909988070185E-05. In fact, the values seem to shrink in tandem with the record count, which makes me wonder if another, more familiar scaling issue is operative here. As we’ve seen through the last two tutorial series, many common statistical measures were not designed with today’s Big Data table sizes in mind and thus end up distorted when we try to cram too much data into them. Given that there are so many other issues with my implementation, it is hard to tell if that is an additional problem or some inaccuracy in my code. Substituting the AverageY value for the RescaledY I used in the test statistic only seems to introduce new problems, without solving this one.</p>
<p style="text-align:center;"><strong>The Case Against Hosmer-Lemeshow</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>Regardless of the quality of my implementation, the usefulness of the Hosmer-Lemeshow Test is apparently still questionable, given that even professionals publish articles with inspiring titles like “Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression.”[10] University of Pennsylvania Sociology Prof. Paul Allison lists several drawbacks to the test even when it is correctly implemented, including the most common one mentioned on the Internet, the weaknesses inherent in dividing the dataset into groups. This can even lead to significantly different results when different stats software is run against the same data.[11] Hosmer and Lemeshow themselves point out that the choice of boundaries (“cut points”) for the groups can lead to significantly different results.[12] Furthermore, as Frank Harrell puts it in a thread in the CrossValidated forum, “The Hosmer-Lemeshow test is to some extent obsolete because it requires arbitrary binning of predicted probabilities and does not possess excellent power to detect lack of calibration. It also does not fully penalize for extreme overfitting of the model. …More importantly, this kind of assessment just addresses overall model calibration (agreement between predicted and observed) and does not address lack of fit such as improperly transforming a predictor.”[13] He recommends alternatives like a “generalized R<sup>2</sup>,” while Allison gets into “daunting” alternatives like “standardized Pearson, unweighted sum of squared residuals, Stukel’s test, and the information matrix test.”[14]<br />
<span style="font-size:10pt;color:white;">…………</span>Nevertheless, despite these well-known shortcomings, the Hosmer-Lemeshow Test remains perhaps the best-known goodness-of-fit measure for logistic regression. Fitness tests seem to have well-defined use cases in comparison to the outlier detection methods we covered in my last tutorial series, with the Hosmer-Lemeshow among those that occupy a very distinct niche. The other methods mentioned by these authors seem to be more advanced and less well-known, so I thought it still worthwhile to post the code, even if there are certain to be problems with it. On the other hand, it is not worth it at this point to optimize the procedures much until the accuracy issues can either be fixed or debunked. It performed well in comparison to other procedures in this series anyways, with a time of 3:08 for the trial run in Figure 2. Only 1 of the 8 queries accounted for 89 percent of the cost of the whole execution plan, and that one contained two expensive Nested Loops operators, which might mean there’s room for further optimization if and when the accuracy can be verified.<br />
<span style="font-size:10pt;color:white;">…………</span>Given the number of issues with my code as well as the inherent issues with the test itself, it might be fitting to write a rebuttal to my mistutorial titled titled like Allison article, such as “Why I Don’t Trust Steve Bolton’s Version of the Hosmer-Lemeshow Test for Logistic Regression.” It may still be useful in the absence of any other measures, but I’d assign it a lot less trust than some of the other code I’ve posted in the last two series. On the other hand, this also gives me an opportunity to jump into my usual spiel about my own lack of trust in hypothesis testing methods. I have a lack of confidence in confidence intervals and the like, at least as far as our use cases go, for multiple reasons. First and foremost, plugging test statistics into distributions just to derive a simple Boolean pass/fail measure sharply reduces the information content. Another critical problem is that most of the lookup tables for the tests and distributions they’re plugged into stop at just a few hundred values and are often full of gaps; furthermore, calculating the missing values yourself for millions of degrees of freedom can be prohibitively expensive. Once again, these measures were designed long before Big Data became a buzz word, in an era when most statistical tests were done against a few dozen or few hundred records at best. For that reason I have often omitted the hypothesis testing stage that accompanies many of the goodness-of-fit measures in this series, including the Hosmer-Lemeshow Test, which is normally plugged into a Chi-Squared distribution.[15] On the other hand, we can make use of the separate Chi-Squared goodness-of-fit measure, which as we shall see next week, is a versatile metric that can be adapted to assess the fit of a wide variety of probability distributions – with a much higher degree of confidence than we can assign to the Hosmer-Lemeshow Test results on logistic regression.</p>
<p> </p>
<p>[1] See the Wikipedia article “Logistic Regression” at <a href="http://en.wikipedia.org/wiki/Logistic_regression">http://en.wikipedia.org/wiki/Logistic_regression</a></p>
<p>[2] See Sayad, Saed, 2014, “Logistic Regression,” published at the <u>SaedSayad.com</u> web address <a href="http://www.saedsayad.com/logistic_regression.htm">http://www.saedsayad.com/logistic_regression.htm</a></p>
<p>[3] See the undated publication “Goodness of Fit in Linear Regression” retrieved from Lawrence Joseph’s course notes on Bayesian Statistics on Oct. 30, 2014, which are published at the website of the <u>McGill University Faculty of Medicine</u>. Available at the web address <a href="http://www.medicine.mcgill.ca/epidemiology/joseph/courses/EPIB-621/fit.pdf">http://www.medicine.mcgill.ca/epidemiology/joseph/courses/EPIB-621/fit.pdf</a>. No author is listed but I presume that Prof. Joseph wrote it.</p>
<p>[4] I derived some of the code for regular linear regression routine long ago from the Dummies.com webpage “How to Calculate a Regression Line” at <a href="http://www.dummies.com/how-to/content/how-to-calculate-a-regression-line.html">http://www.dummies.com/how-to/content/how-to-calculate-a-regression-line.html</a></p>
<p>[5] See the Wikipedia article Hosmer-Lemeshow Test at the web address <a href="http://en.wikipedia.org/wiki/Hosmer%E2%80%93Lemeshow_test">http://en.wikipedia.org/wiki/Hosmer%E2%80%93Lemeshow_test</a></p>
<p>[6] p. 160, Hosmer Jr., David W.; Lemeshow, Stanley and Sturdivan, Rodney X., 2013, <u>Applied Logistic Regression</u>. John Wiley & Sons: Hoboken, New Jersey.</p>
<p>[7] <em>IBID.</em>, pp. 160-163.</p>
<p>[8] <em>IBID.</em>, p. 161 for the second comment.</p>
<p>[9] I was initially confused about the assignment of the expected values (as well as the use of mean scores), but they are definitely derived from the logistic function, according to p. 2 of the undated manuscript, “Logistic Regression,” published at the Portland State University web address <a href="http://www.upa.pdx.edu/IOA/newsom/da2/ho_logistic.pdf">http://www.upa.pdx.edu/IOA/newsom/da2/ho_logistic.pdf</a> . It is part of the instructional materials for one of Prof. Jason Newsom’s classes so I assume he wrote it, but cannot be sure.</p>
<p>[10] Allison, Paul, 2013, “Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression,” published March 5, 2013 at the <u>Statistical Horizons</u> web address <a href="http://www.statisticalhorizons.com/hosmer-lemeshow">http://www.statisticalhorizons.com/hosmer-lemeshow</a></p>
<p>[11] Allison, Paul, 2014, untitled article published in March, 2014 at the <u>Statistical Horizons</u> web address <a href="http://www.statisticalhorizons.com/2014/04">http://www.statisticalhorizons.com/2014/04</a></p>
<p>[12] pp. 965-966, 968, Hosmer, D.W.; T. Hosmer; Le Cessie, S. and Lemeshow, S., 1997, “A Comparison of Goodness-of-Fit Tests for the Logistic Regression Model,” pp. 965-980 in <u>Statistics in Medicine</u>. Vol. 16.</p>
<p>[13] Harrell, Frank, 2011, “Hosmer-Lemeshow vs AIC for Logistic Regression,” published Nov. 22, 2011 at the <u>CrossValidated</u> web address <a href="http://stats.stackexchange.com/questions/18750/hosmer-lemeshow-vs-aic-for-logistic-regression">http://stats.stackexchange.com/questions/18750/hosmer-lemeshow-vs-aic-for-logistic-regression</a></p>
<p>[14] See Allison, 2014.</p>
<p>[15] For more on the usual implementation involving the Chi-Squared distribution, see p. 977. Hosmer et al., 1997 and p. 158, Hosmer et al. 2013.</p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/multidimensionalmayhem.wordpress.com/553/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/multidimensionalmayhem.wordpress.com/553/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=multidimensionalmayhem.wordpress.com&blog=27407452&post=553&subd=multidimensionalmayhem&ref=&feed=1" width="1" height="1" />Goodness-of-Fit Testing with SQL Server Part 4.1: R2, RMSE and Regression-Related Routines
http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/01/13/goodness-of-fit-testing-with-sql-server-part-41-r2-rmse-and-regression-related-routines/
Thu, 14 Jan 2016 06:46:00 UT/blogs/multidimensionalmayhem/2016/01/13/goodness-of-fit-testing-with-sql-server-part-41-r2-rmse-and-regression-related-routines/0http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2016/01/13/goodness-of-fit-testing-with-sql-server-part-41-r2-rmse-and-regression-related-routines/#comments<p><strong>By Steve Bolton</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>Throughout most of this series of amateur self-tutorials, the main topic has been and will continue to be in using SQL Server to perform goodness-of-testing on probability distributions. Don’t let the long syllables (or the alliteration practice in the title) fool you, because the underlying concept really isn’t all that hard; all these statistical tests tells us is whether the distinct counts of our data points approximate shapes like the famous bell curve, i.e. the Gaussian or “normal” distribution. While researching the topic, I found out that the term “goodness-of-fit” is also used to describe how much confidence we can assign to a particular regression line. Recall that in regression, we’re trying to learn something about the relationships between one or more variables, whereas in the case of probability distributions, we’re normally talking about univariate cases, so we’re really trying to learn something about the internal structure of a single variable (or in our case, a database column). Once again, don’t be intimidated by the big words though, because regression is really a very simple idea that every college freshman has been exposed to at some point.<br />
<span style="font-size:10pt;color:white;">…………</span>As I explain in more detail in a post from an earlier mistutorial series, <a href="http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2013/01/08/a-rickety-stairway-to-sql-server-data-mining-algorithm-2-linear-regression/">A Rickety Stairway to SQL Server Data Mining, Algorithm 2: Linear Regression</a>, regression in its simplest form is just the graph of a line that depicts how much one variable increases or decreases as another changes in value. There are certainly complex variants of regression that could blow up someone’s brain like that poor guy in the horror film <a href="http://www.imdb.com/title/tt0081455/?ref_=fn_al_tt_1">Scanners</a>, but the fundamentals are not that taxing on the mind. Thankfully, coding simple regression lines in T-SQL is that difficult either. There are some moderate performance costs, as can be expected whenever we have to traverse a whole dataset, but the actual calculations aren’t terribly difficult to follow or debug (presuming, that is, that you understand set-based languages like SQL). That is especially true for the metrics calculated upon those regression lines, which tell us how well our data mining model might approximate the true relationships between the variables.</p>
<p style="text-align:center;"><strong>The Foibles and Follies of Calculating R2 and RMSE</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span> Once we’ve incurred the cost of a traversing a dataset, there’s really little incentive not to squeeze all the benefit out of the trip by computing all of the relevant goodness-of-fit regression stats afterwards. For that reason, plus the fact that they’re not terribly challenging to explain, I’ll dispense with them all in a single procedure, beginning with R<sup>2</sup>. In my earlier article <a href="https://multidimensionalmayhem.wordpress.com/2015/08/24/outlier-detection-with-sql-server-part-7-cooks-distance/">Outlier Detection with SQL Server Part 7: Cook’s Distance</a> we already dealt with the coefficient of determination (also known as R<sup>2</sup>), which is simply the square of the correlation coefficient. This is a long name for the very simple process of quantifying the relationship between two variables, by computing the difference for each value of the first (usually labeled X) from the average of the second (usually labeled Y) and vice-versa, then multiplying them together. This gives us the covariance, which is then transformed into the correlation by comparing the result to the product of the two standard deviations. All we need to do is implement the same code from the Cook’s Distance article, beginning with the regression calculations, then add a new step: squaring the results of the correlation. That changes all negative values to positives and thus scales the result for easier interpretation. The higher the R<sup>2</sup>, the more closely the two variables are related, and the closer to 0, the less linkage there is between their values.<br />
<span style="font-size:10pt;color:white;">…………</span>One of the few pitfalls to watch out for is that the values are often below 1 but can exceed it in some circumstances. End users don’t need to know all of the implementation details and intermediate steps I just mentioned, but the must be able to read the result, which is highly intuitive and can be easily depicted in a visualization like a Reporting Services gauge; they don’t need to be burdened with the boring, technical internals of the computations any more than a commuter needs to give a dissertation on automotive engineering, but they should be able to interpret the result easily, and R<sup>2</sup> is as mercifully simple as a gas gauge. The same is true of stats like covariance and correlation that it is built on, which costs us nothing to return to the users within the same queries.<br />
Mean Square Error (MSE) is a little more difficult to calculate, but not much harder to interpret, since all end users need to know is that zero represents “perfect accuracy”[i] and values further from it less fitness; the only catch might be that the goodness-of-fit moves in the opposite direction as R<sup>2</sup>, which might cause confusion unless a tooltip or some other handy reminder is given to end users. Root Mean Square Error (RMSE, a.k.a. Root-Mean-Square Deviation) is usually derived from it by squaring, which statisticians often do to rescale metrics so that they only have positive values. Keep in mind that SQL Server can easily calculate standard deviation through the T-SQL StDev function, which gives us a measure of how dispersed the values in a dataset are; practically all of the procedures I’ve posted in the last two tutorial series have made use of it. What RMSE does is take standard deviation to the next level, by measuring the dispersion between multiple variables instead of just one. I really can’t explain it any better than Will G. Hopkins does at his website <a href="http://www.sportsci.org/resource/stats/rmse.html">A New View of Statistics</a>, which I highly recommend to novices in the field of statistics like myself:</p>
<blockquote><p> “The RMSE is a kind of generalized standard deviation. It pops up whenever you look for differences between subgroups or for other effects or relationships between variables. It’s the spread left over when you have accounted for any such relationships in your data, or (same thing) when you have fitted a statistical model to the data. Hence its other name, residual variation. I’ll say more about residuals for models, about fitting models in general, and about fitting them to data like these much later.”<br />
<span style="font-size:10pt;color:white;">…………</span>“Here’s an example. Suppose you have heights for a group of females and males. If you analyze the data without regard to the sex of the subjects, the measure of spread you get will be the total variation. But stats programs can take into account the sex of each subject, work out the means for the boys and the girls, then derive a single SD that will do for the boys and the girls. That single SD is the RMSE. Yes, you can also work out the SDs for the boys and girls separately, but you may need a single one to calculate effect sizes. You can’t simply average the SDs.”[ii]</p></blockquote>
<p><span style="font-size:10pt;color:white;">…………</span>RMSE and R<sup>2</sup> can be used for goodness-of-fit because they are intimately related to the differences between the actual and predicted values for a regression line; they essentially quantify how much of the standard deviation or variance can be ascribed to these errors, i.e. residuals.[iii] There are many complex variants of these stats, just as there are for regression models as a whole; for example, Wikipedia provides several alternate formulas for RMSE , including some for biased estimators, which is a topic we needn’t worry as much about given the whopping sizes of the datasets the SQL Server community works with.[iv] We have unique cases in which the standard methods of hypothesis testing are less applicable, which is why I’ve generally shied away from applying confidence intervals, significance levels and the like to the stats covered in my last two tutorial series. Such tests sharply reduce the information provided by our hard-won calculations, from float or decimal data types down to simple Boolean, yes-or-no answers that a particular value is an outlier, or that subsets of values do not fit a particular distribution; retaining that information allows us to gauge <em>how much</em> a value qualifies as an outlier or a set of them follows a distribution, or a set of columns follows a regression line.<br />
<span style="font-size:10pt;color:white;">…………</span>For that reason, I won’t get into the a discussion of the F-Tests often performed on our last regression measure, Lack-of-Fit Sum-of-Squares, particularly in connection with Analysis of Variance (ANOVA). The core concepts with this measure are only slightly more advanced than with RMSE and R<sup>2</sup>. Once again, we’re essentially slicing up the residuals of the regression line in a way that separates the portion that can be ascribed to the inaccuracy of the model, just through alternate means. It is important here to note that with all three measures, the terms “error” and “residual” are often used interchangeably, although there is a strictly definable difference between them: a residual quantifies the difference between actual and predicted values, while errors refer to actual values and “the (unobservable) true function value.”[v] Despite this subtle yet distinguishable difference, the two terms are often used inappropriately even by experts, to the point that novices like myself can’t always discern which of the two is under discussion. Further partitioning of the residuals and errors occurs in the internal calculations of Lack-of-Fit Sum-of-Squares, but I can’t comment at length on the differences between such constituent components as Residual Sum-of-Squares and Sum-of-Squares for Pure Error, except to recommend the explanation by Mukesh Mahadeo, a frequent contributor on statistical concepts at Yahoo! Answers:</p>
<blockquote><p> “For certain designs with replicates at the levels of the predictor variables, the residual sum of squares can be further partitioned into meaningful parts which are relevant for testing hypotheses. Specifically, the residual sums of squares can be partitioned into lack-of-fit and pure-error components. This involves determining the part of the residual sum of squares that can be predicted by including additional terms for the predictor variables in the model (for example, higher-order polynomial or interaction terms), and the part of the residual sum of squares that cannot be predicted by any additional terms (i.e., the sum of squares for pure error). A test of lack-of-fit for the model without the additional terms can then be performed, using the mean square pure error as the error term. This provides a more sensitive test of model fit, because the effects of the additional higher-order terms is removed from the error.” It is important here to note that with all three measures, the terms “error” and “residual” are often used interchangeably, although there is a strictly definable difference between them: a residual quantifies the difference between actual and predicted values, while errors refer to actual values and “the (unobservable) true function value” chosen from a random sample.”[vi]</p></blockquote>
<p><span style="font-size:10pt;color:white;">…………</span>The important thing is that the code for the Lack-of-Fit Sum-of-Squares formulas[vii] gets the job done. Of course it always helps if a data mining programmer can write a dissertation on the logic and math of the equations they’re working with, but ordinarily, that’s best left to mathematicians; their assignment is analogous to that of an automotive engineers, while our role is that of a garage mechanic, whose main responsibility is to make sure that the car runs, one way or another. If the owner can drive it away without the engine stalling, then mission accomplished.<br />
<span style="font-size:10pt;color:white;">…………</span>We only need to add two elements to make the Lack-of-Fit Sum-of-Squares code below useful to end users, one of which is to simply interpret higher numbers as greater lack of fit. The second is to define what that is, since it represents the opposite of goodness-of-fit and therefore can cause the same kind of confusion in direction that’s possible when providing RMSE and R<sup>2</sup> side-by-side. The two terms are sometimes used interchangeably, but in a more specific sense they’re actually polar opposites, in which measures that rise as fit improves can be termed goodness-of-fit and those that rise as the fit of a model declines as lack-of-fit. CrossValidated forum contributor Nick Cox provided the best explanation of the difference I’ve seen yet to date: “Another example comes from linear regression. Here two among several possible figures of merit are the coefficient of determination R<sup>2</sup> and the root mean square error (RMSE), or (loosely) the standard deviation of residuals. R<sup>2</sup> could be described as a measure of goodness of fit in both weak and strict senses: it measures how good the fit is of the regression predictions to data for the response and the better the fit, the higher it is. The RMSE is a measure of lack of fit in so far as it increases with the badness of fit, but many would be happy with calling it a measure of goodness of fit in the weak or broad sense.”[viii] If end users need a little more detail in what the stat means, that constitutes the most succinct explanation I’ve yet seen. Ordinarily, however, they only need to that the return values for the @LackOfFitSumOfSquares variable below will rise as the accuracy of their model gets worse and vice-versa.</p>
<p><strong><u>Figure 1: T-SQL Code for the Regression Goodness-of-Fit Tests<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CREATE PROCEDURE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">[Calculations]</span><span class="GramE"><span style="color:gray;">.</span><span style="color:teal;">[</span></span><span class="SpellE"><span style="color:teal;">GoodnessOfFitRegressionTestSP</span></span><span style="color:teal;">]<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DatabaseName</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">)</span> <span style="color:gray;">=</span> <span style="color:gray;">NULL,</span> <span style="color:teal;">@<span class="SpellE">SchemaName</span></span> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@<span class="SpellE">TableName</span></span> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span><span style="color:teal;">@ColumnName1</span> <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span><span style="color:teal;">@ColumnName2</span> <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@DecimalPrecision </span><span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>50<span style="color:gray;">)</span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">AS</span></p>
<p><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName</span></span> <span class="GramE"><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span></span>400<span style="color:gray;">),</span><span style="color:teal;">@<span class="SpellE">SQLString </span></span><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName</span></span> <span style="color:gray;">=</span> <span style="color:teal;">@DatabaseName</span> <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">SchemaName </span></span><span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">TableName</span></span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SQLString</span></span> <span style="color:gray;">= </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">‘DECLARE @<span class="SpellE">MeanX</span> <span class="GramE">decimal(</span>‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘),@<span class="SpellE">MeanY </span>decimal(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision </span><span style="color:gray;">+</span> <span style="color:red;">‘), </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">StDevX</span> <span class="GramE">decimal(</span>‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘),<br />
@<span class="SpellE">StDevY </span>decimal(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision </span><span style="color:gray;">+</span> <span style="color:red;">‘), </span></span><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Count decimal</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">(‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘), </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Correlation <span class="GramE">decimal(</span>‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘),<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Covariance <span class="GramE">decimal(</span>‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+ </span><span style="color:red;">‘), </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Slope <span class="GramE">decimal(</span>‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+ </span><span style="color:red;">‘), </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Intercept <span class="GramE">decimal(</span>‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+ </span><span style="color:red;">‘),<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">MeanSquaredError</span> <span class="GramE">decimal(</span>‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘), </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">LackOfFitSumOfSquares</span> <span class="GramE">decimal(</span>‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘)</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— STEP #1 – <span class="GramE">RETRIEVE</span> THE GLOBAL AGGREGATES NEEDED FOR OTHER CALCULATIONS<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Count=<span class="GramE">Count(</span>CAST(‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@ColumnName1</span> <span style="color:gray;">+</span> <span style="color:red;">‘ AS Decimal(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘))), </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">MeanX</span> = <span class="SpellE"><span class="GramE">Avg</span></span><span class="GramE">(</span>CAST(‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName1</span> <span style="color:gray;">+</span> <span style="color:red;">‘ AS Decimal(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘))),<br />
@<span class="SpellE">MeanY</span> = <span class="SpellE">Avg</span>(CAST(‘ </span><span style="color:gray;">+</span> <span style="color:teal;">@ColumnName2</span> <span style="color:gray;">+</span> <span style="color:red;">‘ AS Decimal(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘))),<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">StDevX</span> = <span class="GramE">StDev(</span>CAST(‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName1</span> <span style="color:gray;">+</span> <span style="color:red;">‘ AS Decimal(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘))), @<span class="SpellE">StDevY </span>= StDev(CAST(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName2</span> <span style="color:gray;">+</span> <span style="color:red;">‘ AS Decimal(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘)))<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName</span></span> <span class="GramE"><span style="color:gray;">+</span> <span style="color:#ff0000;">‘<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span style="color:#ff0000;">WHERE ‘</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName1 </span><span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL AND ‘ </span><span style="color:gray;">+</span> <span style="color:teal;">@ColumnName2</span> <span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— STEP #2 – CALCULATE THE CORRELATION (BY FIRST GETTING THE COVARIANCE)<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Covariance = <span class="GramE">SUM(</span>(‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@ColumnName1</span> <span style="color:gray;">+</span> <span style="color:red;">‘ – @<span class="SpellE">MeanX</span>) * (‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName2</span> <span style="color:gray;">+</span> <span style="color:red;">‘ – @<span class="SpellE">MeanY</span>)) / (@Count – 1<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName</span></span> <span class="GramE"><span style="color:gray;">+</span> <span style="color:#ff0000;">‘<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span style="color:#ff0000;">WHERE ‘</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName1 </span><span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL AND ‘ </span><span style="color:gray;">+</span> <span style="color:teal;">@ColumnName2</span> <span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— <span class="GramE">once</span> <span class="SpellE">wee</span> got the covariance, its trivial to calculate the correlation<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Correlation = @Covariance / (@<span class="SpellE">StDevX</span> * @<span class="SpellE">StDevY</span>)</span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— STEP #3 – CALCULATE THE SLOPE AND INTERCEPT<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Slope = @Correlation * (@<span class="SpellE">StDevY</span> / @<span class="SpellE">StDevX</span>)<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Intercept = <span class="GramE">@<span class="SpellE">MeanY</span></span> – (@Slope * @<span class="SpellE">MeanX</span>)</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— STEP #4 – CALCULATE THE MEAN SQUARED ERROR AND LACK OF FIT SUM OF SQUARES TOGETHER<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @<span class="SpellE">MeanSquaredError</span> = <span class="GramE">SUM(</span>Power((<span class="SpellE">PredictedValue</span> – ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@ColumnName2</span> <span style="color:gray;">+</span> <span style="color:red;">‘), 2)) * (1 / @Count), @<span class="SpellE">LackOfFitSumOfSquares </span>= SUM(<span class="SpellE">LackofFitInput</span>)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM (SELECT ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName1 </span><span style="color:gray;">+</span> <span style="color:red;">‘, ‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName2</span> <span style="color:gray;">+</span> <span style="color:red;">‘, <span class="SpellE">PredictedValue</span>, <span class="GramE">Count(</span>CAST(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName2</span> <span style="color:gray;">+</span> <span style="color:red;">‘ AS Decimal(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘))) OVER (PARTITION BY ‘</span> <span style="color:gray;">+</span><br />
<span style="color:teal;">@ColumnName1</span> <span style="color:gray;">+</span> <span style="color:red;">‘ ORDER BY ‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName1</span> <span style="color:gray;">+</span> <span style="color:red;">‘) * </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">(<span class="GramE">Power(</span><span class="SpellE">Avg</span>(CAST(‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName2</span> <span style="color:gray;">+</span> <span style="color:red;">‘ AS Decimal(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘))) OVER (PARTITION BY ‘</span> <span style="color:gray;">+</span><br />
<span style="color:teal;">@ColumnName1</span> <span style="color:gray;">+</span> <span style="color:red;">‘ ORDER BY ‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName1</span> <span style="color:gray;">+</span> <span style="color:red;">‘) – <span class="SpellE">PredictedValue</span>, 2)) AS <span class="SpellE">LackofFitInput</span><br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> FROM (SELECT ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName1</span> <span style="color:gray;">+</span> <span style="color:red;">‘, ‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName2</span> <span style="color:gray;">+</span> <span style="color:red;">‘, (‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName1</span> <span style="color:gray;">+</span> <span style="color:red;">‘ * @Slope) + @Intercept AS <span class="SpellE">PredictedValue<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName </span></span><span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName1</span> <span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL AND ‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName2</span> <span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL) AS T1) AS T2</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @<span class="SpellE">MeanSquaredError</span> AS <span class="SpellE">MeanSquaredError</span>, <span class="GramE">Power(</span>@<span class="SpellE">MeanSquaredError</span>, 0.5) AS RMSE, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">LackOfFitSumOfSquares</span> AS <span class="SpellE">LackOfFitSumOfSquares</span>,<br />
P</span><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">ower(</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Correlation, 2) * 100 AS R2, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Covariance AS Covariance, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Correlation AS Correlation, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Slope AS Slope, @Intercept AS Intercept</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">‘</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:green;">–SELECT @<span class="SpellE">SQLString</span> — <span class="GramE">uncomment this </span>to debug dynamic SQL errors<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@SQLString</span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">)</span></p>
<p><span style="font-size:10pt;color:white;">…………</span>Most of the code for this procedure is identical to that of the aforementioned Cook’s Distance procedure, which requires regression, covariance and correlation computations.[ix] I won’t rehash here how to derive the slope, intercept and other such constituent calculations, for the sake of brevity. The really striking thing is how few lines of code it takes to derive all of these incredibly useful stats in one fell swoop, which we can thank the powerful T-SQL windowing functions introduced in SQL Server 2012 for. It is noteworthy though that the outer query in Step 4 is necessary because of T-SQL error 4109, “Windowed functions cannot be used in the context of another windowed function or aggregate,” which prevents us from performing the calculations in one big gulp and plug them into the SUM. Besides a few departures like that, the procedure closely follows the format used in the last two tutorial series, in which I start with a common set of parameters that allow users to perform the test on any table in any database they have sufficient access to. The first two lines of code in the procedure body help make this happen, while the rest is dynamic SQL that begins with declarations of the constants, stats and variables we need to make the procedure perform its calculations. As usual, the @DecimalPrecision parameter is provided to help users set their own precision and scale (to avoid errors like arithmetic overflows and still accommodate columns of all sizes) and the SELECT @SQLString near the end, which can be uncommented for debugging purposes.</p>
<p><strong><u>Figure 2: Sample Results from the Regression Goodness-of-Fit Test on the Duchennes Dataset</u></strong> (click to enlarge)<br />
<span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">[Calculations]</span><span class="GramE"><span style="color:gray;">.</span><span style="color:teal;">[</span></span><span class="SpellE"><span style="color:teal;">GoodnessOfFitRegressionTestSP</span></span><span style="color:teal;">]<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"> </span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DatabaseName</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’DataMiningProjects</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SchemaName</span></span> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’Health</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">TableName</span></span> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’DuchennesTable</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@ColumnName1 </span><span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’Hemopexin</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@ColumnName2 </span><span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’LactateDehydrogenase</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@DecimalPrecision </span><span style="color:gray;">=</span> <span style="color:red;">N’38<span class="GramE">,21′<br />
<a href="https://multidimensionalmayhem.wordpress.com/2016/01/13/goodness-of-fit-testing-with-sql-server-part-4-1-r2-rmse-and-regression-related-routines/goodnessoffitregressionresults-1/" rel="attachment wp-att-547"><img class="alignnone size-full wp-image-547" src="https://multidimensionalmayhem.files.wordpress.com/2016/01/goodnessoffitregressionresults-1.jpg?w=604&h=33" alt="GoodnessOfFitRegressionResults (1)" width="604" height="33" /></a><br />
</span></span></span></p>
<p><strong><u>Figure 3: Sample Results from the Regression Goodness-of-Fit Test on the Higgs Boson Dataset</u></strong> (click to enlarge)</p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">[Calculations]</span><span class="GramE"><span style="color:gray;">.</span><span style="color:teal;">[</span></span><span class="SpellE"><span style="color:teal;">GoodnessOfFitRegressionTestSP</span></span><span style="color:teal;">]<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"> </span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DatabaseName</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’DataMiningProjects</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SchemaName</span></span> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’Physics</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> </span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;color:teal;">@<span class="SpellE">TableName</span></span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’HiggsBosonTable</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@ColumnName1</span> <span style="color:gray;">=</span> <span style="color:red;">N’Column1′</span><span style="color:gray;">,<br />
</span></span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@ColumnName2</span> <span style="color:gray;">=</span> <span style="color:red;">N’Column2′</span><span style="color:gray;">,<br />
</span></span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;"> </span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DecimalPrecision</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">=</span> <span style="color:red;">N’38<span class="GramE">,29′</span></span></span><span style="font-size:10pt;"><br />
</span></p>
<p><span style="font-size:10pt;color:white;"><a href="https://multidimensionalmayhem.wordpress.com/2016/01/13/goodness-of-fit-testing-with-sql-server-part-4-1-r2-rmse-and-regression-related-routines/goodnessoffitregressionresults-2/" rel="attachment wp-att-548"><img class="alignnone size-full wp-image-548" src="https://multidimensionalmayhem.files.wordpress.com/2016/01/goodnessoffitregressionresults-2.jpg?w=604&h=31" alt="GoodnessOfFitRegressionResults (2)" width="604" height="31" /></a></span></p>
<p><span style="font-size:10pt;color:white;">…………</span>I’ve made it standard practice in the last two tutorial series to test my procedures first on the 209 rows of a dataset on the Duchennes form of muscular dystrophy provided by <a href="http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets">Vanderbilt University’s Department of Biostatistics</a>, then on the first couple of float columns in the 11-million-row Higgs Boson Dataset, which is made publicly available by the <a href="https://archive.ics.uci.edu/ml/datasets/HIGGS">University of California at Irvine’s Machine Learning Repository</a><u>.</u> As depicted in Figure 2, the regression line for the Hemopexin protein and Lactate Dehydrogenase enzyme data in the Duchennes table fits poorly, as indicated by the MeanSquaredError, RMSE, LackOfFitSumOfSquares and R<sup>2</sup> results. The graphic below it demonstrates clearly that the first two float columns in the Higgs Boson dataset don’t fit well on a regression line either. Neither is not surprising, given that the correlation coefficients for both are near zero, which indicates a lack of relationship between the variables (a strongly negative value would indicate a strongly inverse relationship, whereas positive values would do the opposite).<br />
<span style="font-size:10pt;color:white;">…………</span>What was truly surprising is how well the latter query performed on the Higgs Boson table, which takes up nearly 6 gigabytes in the DataMiningProjects database I assembled from various practice datasets. It only took 2:04 to execute on my <a href="https://www.youtube.com/watch?v=X-rkFaIPyL4">clunker</a> of a development machine, which hardly qualifies as a real database server. The execution plan in Figure 4 may provide clues as to why: most of the costs come in terms of three non-clustered index seeks, which is normally what we want to see. Nor are there any expensive Sort operators. Most of the others are parallelism and Compute Scalar operators that come at next to no cost. In last week’s article, I mentioned that it really doesn’t hurt to calculate both the Jarque-Bera and D’Agostino-Pearson Omnibus Test together, since the costs are incurred almost exclusively in traversing a whole table to derive the constituent skewness and kurtosis values. In the same way, it doesn’t cost us much to calculate the MSE, RMSE and Lack-of-Fit Sum-of-Squares together in Step 4, once we’ve already gone to the trouble of traversing the whole table by calculating one of them. It also costs us just a single operation to derive the R<sup>2</sup> once we’ve done the regression and have the correlation, and nothing at all to return the covariance, correlation, slope and intercept if we’re going to go to the trouble of getting the R<sup>2</sup>. The execution plan essentially bears this out, since the Index Seeks perform almost all the work.</p>
<p><strong><u>Figure 4: Execution Plan for the Regression Goodness-of-Fit Test on the Higgs Boson Dataset</u></strong> (click to enlarge)<br />
<a href="https://multidimensionalmayhem.wordpress.com/2016/01/13/goodness-of-fit-testing-with-sql-server-part-4-1-r2-rmse-and-regression-related-routines/goodnessoffitregressionexecutionplan/" rel="attachment wp-att-546"><img class="alignnone size-full wp-image-546" src="https://multidimensionalmayhem.files.wordpress.com/2016/01/goodnessoffitregressionexecutionplan.jpg?w=604&h=214" alt="GoodnessOfFitRegressionExecutionPlan" width="604" height="214" /></a></p>
<p><span style="font-size:10pt;color:white;">…………</span>There are of course limitations and drawbacks with this procedure and the formulas it is meant to reflect. It is always possible that I’m not implementing them accurately, since I’m writing this in order to learn the topic, not because I know it already; as usual, my sample code is more of a suggested means of implementation, not a well-tested workhorse ready to go into a production environment tomorrow. I still lack the level of understanding I wish I had of the internal mechanics of the equations; in fact, I’m still having trouble wrapping my head around such concepts as the difference between the coefficient of determination and variance explained, which seem to overlap quite closely.[x] Moreover, the MSE can place too much weight on outliers for some use cases, even when implemented accurately.[xi] The RMSE also can’t be used to compare regressions between different sets of columns, “as it is scale-dependent.”[xii]<br />
<span style="font-size:10pt;color:white;">…………</span>The values for the some of the stats returned above also suffer from a different scaling issue, in that they tend to increase too quickly as the number of records accumulates. They’re not in the same league as the truly astronomical values I’ve seen with other stats I’ve surveyed in the last two tutorial series, but the fact that the Lack-of-Fit Sum-of-Squares reaches eight digits above the decimal place is bothersome. That’s about the upper limit of what end users can read before they have to start counting the decimal places by hand, which rapidly degrades the legibility of the statistic.</p>
<p style="text-align:center;"><strong>Traditional Metrics and Tests in the “Big Data” Era</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>That just adds to my growing conviction that the vastly larger datasets in use today may require new statistical measures or rescaling of tried-and-true ones in order to accommodate their sheer size. We shouldn’t have to sacrifice the main strength of Big Data[xiii], which is the fact that we can now quickly derive very detailed descriptions of very large datasets, just to use these methods. As we have seen throughout the last two tutorial series, this issue has consistently thrown a monkey wrench into many of the established statistical procedures, which were designed decades or even centuries ago with datasets of a few dozen records in mind, not several million. We’ve seen it in the exponent and factorial operations required to derive many of well-established measures, which simply can’t be performed at all on values of more than a few hundred without leading to disastrous arithmetic overflows and loss of precision. We’ve seen it again this week and the last, in which the high record counts made the final statistics a little less legible.<br />
<span style="font-size:10pt;color:white;">…………</span>We’ve also seen it in some of the hypothesis testing methods, which require lookup tables that often only go up to record counts of a few hundred at best. That’s a problem that will rear its head again in a few weeks when I try, and fail, to implement the popular Shapiro-Wilk Test of normality, which supposedly has excellent statistical power yet is only usable up to about 50 records.[xiv] Such goodness-of-fit tests for probability distributions can also be applied to regression, to determine if the residuals are distributed in a bell curve; cases in point include the histograms discussed in<a href="https://multidimensionalmayhem.wordpress.com/2015/04/21/outlier-detection-with-sql-server-part-6-1-visual-outlier-detection-with-reporting-services/"> Outlier Detection with SQL Server, Part 6.1: Visual Outlier Detsection with Reporting Services</a> and the Chi-Squared Test, which I’ll cover in a few weeks.[xv] Rather than applying these tests to regression in this segment of the series, I’ll introduce the ones I haven’t covered yet separately. For the sake of simplicity, I won’t delve into complicated topics like lack-of-fit testing on variants like multiple regression at this point. It would be useful, however, to finish off this segment of the series next week by introducing the Hosmer–Lemeshow Test, which can be applied to Logistic Regression, which is one of the most popular alternative regression algorithms. As discussed in <a href="https://multidimensionalmayhem.wordpress.com/2013/01/23/a-rickety-stairway-to-sql-server-data-mining-algorithm-4-logistic-regression/">A Rickety Stairway to SQL Server Data Mining, Algorithm 4: Logistic Regression</a>, a logistic function is applied to produce an S-shaped curve that bounds outcomes between 0 and 1, which fits many user scenarios. Thankfully, the code will be much simpler to implement now that we’ve got this week’s T-SQL and concepts out of the way, so it should make for an easier read.</p>
<p> </p>
<p>[i] See the <u>Wikipedia</u> page “Mean Squared Error” at <a href="http://en.wikipedia.org/wiki/Mean_squared_error">http://en.wikipedia.org/wiki/Mean_squared_error</a></p>
<p>[ii] Hopkins, Will G., 2001, “Root Mean-Square Error (RMSE),” published at the <u>A New View of Statistics</u> web address <a href="http://www.sportsci.org/resource/stats/rmse.html">http://www.sportsci.org/resource/stats/rmse.html</a></p>
<p>[iii] For a more in depth explanation of the interrelationships between these stats and why they operate as they do, see Hopkins, Will G., 2001, “Models: Important Details,” published at the <u>A New View of Statistics</u> web address <a href="http://www.sportsci.org/resource/stats/modelsdetail.html#residuals">http://www.sportsci.org/resource/stats/modelsdetail.html#residuals</a></p>
<p>[iv] See the <u>Wikipedia</u> page “Root Mean Square Deviation” at <a href="http://en.wikipedia.org/wiki/Root-mean-square_deviation" rel="nofollow">http://en.wikipedia.org/wiki/Root-mean-square_deviation</a></p>
<p>[v] See the succinct explanation at the <u>Wikipedia</u> page “Errors and Residuals in Statistics” at <a href="http://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics">http://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics</a></p>
<p>[vi] Mukesh Mahadeo’s reply to the thread “What is Mean by Lack of Fit in Least Square Method?” at the Yahoo! Answers web address <a href="https://in.answers.yahoo.com/question/index?qid=20100401082012AAf0yXg">https://in.answers.yahoo.com/question/index?qid=20100401082012AAf0yXg</a></p>
<p>[vii] Which I derived from the formulas at the <u>Wikipedia</u> webpage “Lack-of-Fit Sum of Squares” at <a href="http://en.wikipedia.org/wiki/Lack-of-fit_sum_of_squares">http://en.wikipedia.org/wiki/Lack-of-fit_sum_of_squares</a></p>
<p>[viii] See Cox, Nick, 2013, reply to the <u>CrossValidated</u> thread “Are Goodness of Fit and Lack of Fit the Same?” on Aug. 2, 2013. Available at the web address <a href="http://stats.stackexchange.com/questions/66311/are-goodness-of-fit-and-lack-of-fit-the-same">http://stats.stackexchange.com/questions/66311/are-goodness-of-fit-and-lack-of-fit-the-same</a></p>
<p>[ix] As mentioned in that article, the originals sources for the internal calculations included Hopkins, Will G., 2001, “Correlation Coefficient,” published at the <u>A New View of Statistics</u> web address <a href="http://www.sportsci.org/resource/stats/correl.html">http://www.sportsci.org/resource/stats/correl.html</a>; the Dummies.Com webpage “How to Calculate a Regression Line” at <a href="http://www.dummies.com/how-to/content/how-to-calculate-a-regression-line.html">http://www.dummies.com/how-to/content/how-to-calculate-a-regression-line.html</a>, the <u>Wikipedia</u> page “Mean Squared Error” at <a href="http://en.wikipedia.org/wiki/Mean_squared_error">http://en.wikipedia.org/wiki/Mean_squared_error</a> and the <u>Wikipedia</u> page “Lack-of-Fit Sum of Squares” at <a href="http://en.wikipedia.org/wiki/Lack-of-fit_sum_of_squares">http://en.wikipedia.org/wiki/Lack-of-fit_sum_of_squares</a>.</p>
<p>[x] I’ve seen competing equations in the literature, one based on residual sum-of-squares calculations and the other on squaring of the correlation coefficient. The wording often leads be to believe that they arrive at the same results through different methods, but I’m not yet certain of this.</p>
<p>[xi] See the <u>Wikipedia</u> page “Mean Squared Error” at <a href="http://en.wikipedia.org/wiki/Mean_squared_error">http://en.wikipedia.org/wiki/Mean_squared_error</a></p>
<p>[xii] <a href="http://en.wikipedia.org/wiki/Root-mean-square_deviation" rel="nofollow">http://en.wikipedia.org/wiki/Root-mean-square_deviation</a> “Root-Mean-Square Deviation”</p>
<p>[xiii] It’s a buzzword, I know, but it’s the most succinct term I can use here.</p>
<p>[xiv] Some sources say up to a couple hundred records, and I’m not familiar enough with the topic to discern which limit applies in which cases. It’s a moot point, however, because we need such tests to work on datasets of several hundred million rows.</p>
<p>[xv] See the undated publication “Goodness of Fit in Linear Regression” retrieved from Lawrence Joseph’s course notes on Bayesian Statistics on Oct. 30, 2014, which are published at the website of the <u>McGill University Faculty of Medicine</u>. Available at the web address <a href="http://www.medicine.mcgill.ca/epidemiology/joseph/courses/EPIB-621/fit.pdf">http://www.medicine.mcgill.ca/epidemiology/joseph/courses/EPIB-621/fit.pdf</a>. No author is listed but I presume that Prof. Joseph wrote it. This is such a good source of information for the topic of this article that I couldn’t neglect to work in a mention of it.</p>
<p> </p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/multidimensionalmayhem.wordpress.com/545/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/multidimensionalmayhem.wordpress.com/545/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=multidimensionalmayhem.wordpress.com&blog=27407452&post=545&subd=multidimensionalmayhem&ref=&feed=1" width="1" height="1" />Goodness-of-Fit Testing with SQL Server, part 3.2: D’Agostino’s K-Squared Test
http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2015/12/21/goodness-of-fit-testing-with-sql-server-part-32-dagostinos-k-squared-test/
Tue, 22 Dec 2015 01:07:49 UT/blogs/multidimensionalmayhem/2015/12/21/goodness-of-fit-testing-with-sql-server-part-32-dagostinos-k-squared-test/0http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2015/12/21/goodness-of-fit-testing-with-sql-server-part-32-dagostinos-k-squared-test/#comments<p><strong>By Steve Bolton</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>In the last edition of this amateur series of self-tutorials on goodness-of-fit testing with SQL Server, we discussed the Jarque-Bera Test, a measure that unfortunately doesn’t scale well on datasets of the size that DBAs are accustomed to using. The problem is not with the usefulness of the statistics that it is composed of, because skewness and kurtosis are easy to interpret and valuable in their own right as measures of shape and for purposes of outlier detection. Usually scaling problems signify performance issues, but the resource consumption and execution of the Jarque-Bera Test aren’t bad by any means; the issue is that the statistic itself increases to ungodly numbers that are difficult to interpret, precisely because it was designed with smaller datasets in mind. In this week’s installment, I’ll provide an alternative measure that also builds upon skewness and kurtosis and can be calculated in almost exactly the same amount of time as Jarque-Bera, but without the cumbersome scaling issue.<br />
<span style="font-size:10pt;color:white;">…………</span>The improved interpretability of D’Agostino’s K-Squared Test comes at the cost of more complicated internal calculations, which turn out to be trivial in comparison to the main computational costs, which consist almost exclusively of index seeks and sorts in the execution plan issued by the SQL Server query optimizer. This added complexity is only a problem if one wants to check to see what’s going on under the hood in these calculations, which is rarely necessary in most use cases after the code has been validated. As I pointed out at every opportunity in my earlier mistutorial series <a href="http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2012/11/28/a-rickety-stairway-to-sql-server-data-mining-part-01-data-in-data-out/">A Rickety Stairway to SQL Server Data Mining</a>, most end users have about as much need to understand how such statistics are derived as the average driver needs to know the engineering details of their car; in many cases it is a mistake to overload them with superfluous information like incomprehensible math equations. That is why I haven’t posted any such formulas in the last few tutorial series I’ve posted here. End users should understand enough to interpret the results in light of their domain knowledge, just as the average rush hour commuter needs to know how to read a gas gauge and transmission fluid stick properly. Those who write the computer code that implement these stats obviously need to grasp the inner workings at a much deeper level, but not to the point that they’re designing their own formulas; data mining programmers essentially occupy the middle zone halfway between end users and mathematicians, in the same way that garage mechanics reside in the niche between drivers and automotive engineers. It is my goal to learn the skills necessary to serve at this midpoint, but as I usually point out, I haven’t reached it yet; I hope to use blog posts of this kind to familiarize myself with these topics better, not because I already know the material well. And that is why I cannot explain in great detail <em>why</em> D’Agostino’s K-Squared Test (a.k.a. the D’Agostino-Pearson Omnibus Test) works as it does. Like a typical mechanic, I was able to get it running sufficiently well that it returns the expected results in a potentially reliable way, but I don’t have sufficient skill to comment on why it was designed as it was. Nevertheless I did pick up a few things while reading sources like D’Agostino, et al.’s 1990 paper in <em>The American Statistician</em>[1] and as usual, the Wikipedia[2] article on the topic, which may not be a professional source but qualifies as the most convenient repository for every math formula under the sun.<br />
<span style="font-size:10pt;color:white;">…………</span>As you can gather from the T-SQL code in Figure 1, the underlying equations I found in former source are fairly complicated and involve the derivation of several intermediate statistics in between the sample skewness and kurtosis and the final metric. Although the latter source is only an introduction to the topic, it did provide some invaluable insights into the aim of these calculations, albeit without explaining why those particular calculations satisfied those aims. Apparently, the @Z1 and @Z2 measures are meant to bring the skewness and kurtosis in line with the standard normal distribution to solve their “frustratingly slow” approach to the distribution limit, which is a scaling issue of sorts.[3] The SELECT statement towards the end that assigns the final value to the @K2Test combines these internal calculations into a single result so that the skewness and kurtosis can be measured together, in what technically known as an “omnibus test.”[4] After all these esoteric calculations, that final assignment is actually quite simple. I’m sure the nitty gritty details are in the original academic articles published by statisticians Ralph D’Agostino and E.S. Pearson in the early ‘70s, but I couldn’t find any publicly accessible copies; judging from the difficulty I had in following the 1990 paper, much of it would still have been over my head anyways. The important thing to know is that I was able to follow the equations sufficiently well that the code below returns the correct results for the examples provided by D’Agostino, et al.</p>
<p><strong><u>Figure 1: T-SQL Code for the D’Agostino-Pearson K-Squared Test<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CREATE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">PROCEDURE</span> <span style="color:teal;">[Calculations]</span><span class="GramE"><span style="color:gray;">.</span><span style="color:teal;">[</span></span><span class="SpellE"><span style="color:teal;">NormalityTestDAgostinosKSquaredSP</span></span><span style="color:teal;">]<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DatabaseName</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">)</span> <span style="color:gray;">=</span> <span style="color:gray;">NULL,</span> <span style="color:teal;">@<span class="SpellE">SchemaName</span></span> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@<span class="SpellE">TableName</span></span> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span><span style="color:teal;">@<span class="SpellE">ColumnName</span></span> <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@PrimaryKeyName</span> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>400<span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">AS</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName</span></span> <span class="GramE"><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span></span>400<span style="color:gray;">),</span><span style="color:teal;">@<span class="SpellE">SQLString </span></span><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName</span></span> <span style="color:gray;">=</span> <span style="color:teal;">@DatabaseName</span> <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">SchemaName </span></span><span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">TableName</span></span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SQLString</span></span> <span style="color:gray;">=</span> </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">‘DECLARE </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Mean float, @StDev float, @Count as bigint, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Alpha <span class="GramE">decimal(</span>38,37),<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@One float = 1, @Two float = 2, @Three float = 3, @Four float = 4, @Five float = 5,<br />
@Six float = 6, @Seven float = 7, @Eight float = 8, @Nine float = 9,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">TwentyFour</span> float = 24, @<span class="SpellE">TwentySeven </span>float = 27, @Seventy float = 70,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">RecpiprocalOfNSampleSize</span> float, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">DifferenceFromSampleMeanSquared</span> float, @<span class="SpellE">DifferenceFromSampleMeanCubed</span><br />
float, @<span class="SpellE">DifferenceFromSampleMeanFourthPower</span> float, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">SampleSkewness</span> float, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">SampleKurtosis</span> float,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Y float, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@B2 float, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">WSquared</span> float, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Z1 float, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Z2 float, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Sigma float, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@E float, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">VarianceKurtosis</span> float,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">StandardizedKurtosis</span> float, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">ThirdStandardizedMomentOfKurtosis</span> float, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@A float, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@K2Test float</span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Count = <span class="GramE">Count(</span>‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@<span class="SpellE">ColumnName</span></span> <span style="color:gray;">+</span> <span style="color:red;">‘), @Mean = <span class="SpellE">Avg</span>(CAST(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">ColumnName</span></span> <span style="color:gray;">+</span> <span style="color:red;">‘ AS float)), @StDev = StDev(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">ColumnName</span></span> <span style="color:gray;">+</span> <span style="color:red;">‘)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName</span></span> <span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">ColumnName</span></span> <span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @<span class="SpellE">RecpiprocalOfNSampleSize</span> = @One / @Count</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:red;">@<span class="SpellE">CountPlusOne</span> float = @Count + @One, </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">CountPlusThree</span> float = @Count + @Three, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">CountPlusFive</span> float = @Count + @Five,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">CountPlusSeven</span> float = @Count + @Seven, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">CountPlusNine</span> float = @Count + @Nine, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">CountMinusTwo</span> float = @Count – @Two,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">CountMinusThree</span> float = @Count – @Three</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">DECLARE @<span class="SpellE">CountPlusOneTimesCountPlusThree</span> float <span class="GramE">=<span> </span>(</span>@Count + @One) * (@Count + @Three)</span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT<span> </span>@<span class="SpellE">DifferenceFromSampleMeanSquared </span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">= SUM(Power(<span class="SpellE">DifferenceFromSampleMean</span>, 2)) OVER (ORDER BY ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@PrimaryKeyName </span><span style="color:gray;">+</span> <span style="color:red;">‘ ASC),<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span><span class="GramE">@<span class="SpellE">DifferenceFromSampleMeanCubed</span><span> </span>=</span> SUM(Power(<span class="SpellE">DifferenceFromSampleMean</span>, 3)) OVER (ORDER BY ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@PrimaryKeyName</span> <span style="color:gray;">+</span> <span style="color:red;">‘ ASC),<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span><span class="GramE">@<span class="SpellE">DifferenceFromSampleMeanFourthPower</span><span> </span>=</span>SUM(Power(<span class="SpellE">DifferenceFromSampleMean</span>, 4)) OVER (ORDER BY ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@PrimaryKeyName</span> <span style="color:gray;">+</span> <span style="color:red;">‘ ASC)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM (SELECT ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@PrimaryKeyName </span><span style="color:gray;">+</span> <span style="color:red;">‘, <span class="GramE">CAST(</span>‘ </span><span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">ColumnName </span></span><span style="color:gray;">+</span> <span style="color:red;">‘ AS float) – @Mean as <span class="SpellE">DifferenceFromSampleMean</span> — make a single pass across the table?<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName </span></span><span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">ColumnName </span></span><span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL) AS T1</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @<span class="SpellE">SampleSkewness</span> = (@<span class="SpellE">RecpiprocalOfNSampleSize </span>* @<span class="SpellE">DifferenceFromSampleMeanCubed</span><span class="GramE">)<span> </span>/</span>(Power((@<span class="SpellE">RecpiprocalOfNSampleSize </span>* @<span class="SpellE">DifferenceFromSampleMeanSquared</span>), 1.5))</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @<span class="SpellE">SampleKurtosis</span> = (@<span class="SpellE">RecpiprocalOfNSampleSize </span>* @<span class="SpellE">DifferenceFromSampleMeanFourthPower</span><span class="GramE">)<span> </span>/</span>(Power((@<span class="SpellE">RecpiprocalOfNSampleSize</span> * @<span class="SpellE">DifferenceFromSampleMeanSquared</span>), 2))</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— perform operations on the Skewness<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Y = @<span class="SpellE">SampleSkewness</span> * <span class="GramE">Power(</span>((@<span class="SpellE">CountPlusOneTimesCountPlusThree</span>) / (@<span class="SpellE">CountMinusTwo </span>* @Six)), 0.5) — do the brackets signify multiplication? ****<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @B2 = (@<span class="SpellE">CountPlusOneTimesCountPlusThree</span> * (@Three * ((<span class="GramE">Power(</span>@Count, 2) + (@<span class="SpellE">TwentySeven </span>* @Count)) -@Seventy))) / (@<span class="SpellE">CountMinusTwo</span> * @<span class="SpellE">CountPlusFive</span> * (@Count + @Seven) * (@Count + @Nine))<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @<span class="SpellE">WSquared</span> = <span class="GramE">Power(</span>@Two * (@B2 – @One), 0.5) – @One</span><span style="font-size:9.5pt;font-family:Consolas;"><br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Alpha = <span class="GramE">Power(</span>Abs(@Two / (@<span class="SpellE">WSquared </span>– @One)), 0.5)<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Sigma <span class="GramE">=<span> </span>@One</span> / (Power(Abs((Log(Abs(Power(@<span class="SpellE">WSquared</span>, 0.5))))), 0.5))<br />
— <span class="SpellE">Im</span> not sure if this sigma is related to StDev or not</span><span style="font-size:9.5pt;font-family:Consolas;"><br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Z1 = @Sigma * <span class="GramE">Log(</span>(@Y / @Alpha) + Power((Power((@Y / @Alpha), 2) + @One), 0.5))</span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SQLString</span></span> <span style="color:gray;">=</span> <span style="color:teal;">@<span class="SpellE">SQLString </span></span><span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">— perform operations on the kurtosis<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT <span class="GramE">@E<span> </span>=</span> (@Three * (@Count – @One)) / @<span class="SpellE">CountPlusOne</span> — according to the paper, this is the mean for the kurtosis<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT <span class="GramE">@<span class="SpellE">VarianceKurtosis</span><span> </span>=</span> (@<span class="SpellE">TwentyFour </span>* @Count * @<span class="SpellE">CountMinusTwo</span> * @<span class="SpellE">CountMinusThree</span>) / (Power(@<span class="SpellE">CountPlusOne</span>, 2) * @<span class="SpellE">CountPlusThree </span>* @<span class="SpellE">CountPlusFive<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT <span class="GramE">@<span class="SpellE">StandardizedKurtosis</span><span> </span>=</span> (@<span class="SpellE">SampleKurtosis </span>– @E) / Power(@<span class="SpellE">VarianceKurtosis</span>, 0.5)<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT <span class="GramE">@<span class="SpellE">ThirdStandardizedMomentOfKurtosis</span><span> </span>=</span> ((@Six * ((Power(@Count, 2) – (@Five * @Count)) + @Two)) / (@<span class="SpellE">CountPlusSeven</span> * @<span class="SpellE">CountPlusNine</span>)) *<br />
</span><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">Power(</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">(@Six * @<span class="SpellE">CountPlusThree</span> * @<span class="SpellE">CountPlusFive</span>) / (@Count * @<span class="SpellE">CountMinusTwo</span> *<span> </span>@<span class="SpellE">CountMinusThree</span>), 0.5)</span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT <span class="GramE">@A<span> </span>=</span> @Six + ((@Eight / @<span class="SpellE">ThirdStandardizedMomentOfKurtosis</span>) * ((@Two / @<span class="SpellE">ThirdStandardizedMomentOfKurtosis</span>) + Power(@One + (@Four / @<span class="SpellE">ThirdStandardizedMomentOfKurtosis</span>), 0.5)))<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Z2 = ((@One – (@Two / (@Nine * @A))) – <span class="GramE">Power(</span>(@One – (@Two / @A)) / (@One + (@<span class="SpellE">StandardizedKurtosis</span> * Power((@Two / (@A – @Four)), 0.5))), (@One / @Three))) / Power((@Two / (@Nine *<br />
@A)), 0.5)</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @K2Test = <span class="GramE">Power(</span>@Z1, 2) + Power(@Z2, 2)</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— uncomment this to debug the internal calculations SELECT @Alpha, @Sigma, @Y AS T, @B2 AS B2, @<span class="SpellE">WSquared</span> AS <span class="SpellE">WSquared</span>,<span> </span>@E AS E, @<span class="SpellE">VarianceKurtosis </span>AS <span class="SpellE">VarianceKurtosis</span>, @<span class="SpellE">StandardizedKurtosis </span>AS <span class="SpellE">StandardizedKurtosis</span>, @<span class="SpellE">ThirdStandardizedMomentOfKurtosis</span><br />
AS <span class="SpellE">ThirdStandardizedmMomentOfKurtosis</span>, @A AS A, @<span class="SpellE">DifferenceFromSampleMeanSquared</span>, @<span class="SpellE">RecpiprocalOfNSampleSize</span>, @<span class="SpellE">DifferenceFromSampleMeanSquared</span>, @<span class="SpellE">DifferenceFromSampleMeanCubed</span>, @<span class="SpellE">DifferenceFromSampleMeanFourthPower</span></span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @K2Test AS<span> </span><span class="SpellE">KSquaredTest</span>, @<span class="SpellE">SampleSkewness</span> AS <span class="SpellE">SampleSkewness</span>, @<span class="SpellE">SampleKurtosis</span> AS <span class="SpellE">SampleKurtosis</span>, @Z1 as Z1, @Z2 as Z2, @Mean AS<span> </span>Mean, @StDev AS StDev</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">‘</span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:green;">–SELECT @<span class="SpellE">SQLString</span> — <span class="GramE">uncomment this</span><br />
to debug dynamic SQL errors</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@<span class="SpellE">SQLString</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">)</span></p>
<p><span style="font-size:10pt;color:white;">…………</span>A few explanations of why the code was written as it was are in order. The five parameters allow users to run the test on any table or view in any database they have sufficient access to, while the first declaration assists in implementing this. The dynamic SQL differs from some of the procedures I’ve posted in the past by the sheer number of reciprocals and constants that need to be precalculated early on, in order to avoid performing the same operations over again. The length of the dynamic SQL also necessitates the use of the second SET statement on the @SQLString, since such strings can’t be assigned in one big gulp after a certain character limit, but can thankfully be added together easily; keep in mind that if this step is left out, the dynamic SQL may be unexpectedly truncated. This procedure also differs in the sense that I’ve chosen to use floats rather than the decimal data type, for the same reason I did in the article on Jarque-Bera: some of the internal calculations are performed on very small fractions, particularly the exponents and reciprocals, which SQL Server will sometimes convert to zeros in the case of the decimal data type. Secondly, I substituted named variables for many of the constants, such as @CountPlusOne, which are declared near the beginning of the dynamic SQL. This is due to the fact that SQL Server sometimes truncates the decimal points out of certain operations on integers; I haven’t yet determined precisely what causes this, although using integers as dividends seems to trigger it most often. Consider this an experiment in discerning whether using named variables is more legible than using countless CAST operations, some of which would have to be buried deep within subqueries. By all means, feel free to copy and paste the constants back in if you know the answers to those questions. As with the Jarque-Bera Test, I’m not certain whether this K<sup>2</sup> Test would retain its validity if we substituted the simpler calculations for the full population for the sample skewness and sample kurtosis, but those stats would be preferable if this was case. As usual, I’ve provided a couple of lines of debugging code that can be uncommented if you need to adjust or verify the code, both near the end of the procedure. Be aware that due my difficulty in reading the original equations, @StandardizedKurtosis may need to serve as the root instead of 0.5 (the square root) in my calculation for @Z2, and also in the calculation for the third standardized moment – but I doubt it, since this would throw off the calculations quite a bit. I also added several ABS function calls to avoid Invalid Floating Point Operation errors on T-SQL functions like POWER that can’t handle imaginary roots, but this departure doesn’t seem to affect the final results.<br />
<span style="font-size:10pt;color:white;">…………</span>The bottom line is that I tested this against the same stem-and-leaf plot cholesterol data from the Framingham Heart Study that D’Agostino, et al. assessed their equations with and get pretty much the same results.[5] They got 1.02 and 4.58 for their sample skewness and kurtosis and 14.75 for the final K2 test statistic, which was derived from Z1 and Z2 values of 3.14 and 2.21 respectively; my results were 1.0235482596477, 4.57738778764656 and 14.7958406166879 for the sample skewness, kurtosis and test statistic respectively, which were derived from values of 3.13939241925962 and 2.22262364213628 for the intermediate Z1 and Z2 stats. It is possible that the slight differences are due to undiscovered errors in my code, but some departure is expected given that I used variables and constants of much higher precision, which would lead to rounding discrepancies. I then tested it against two datasets I’ve been using throughout the last two tutorial series, one on the Duchennes form of muscular dystrophy made publicly available by the <a href="http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets">Vanderbilt University’s Department of Biostatistics</a> and another on the Higgs Boson that can be downloaded from the <a href="https://archive.ics.uci.edu/ml/datasets/HIGGS">University of California at Irvine’s Machine Learning Repository</a>. I derived the first resultset in Figure 2 from the query above it and the following two from queries like it on the first two float columns in the Higgs Boson dataset. Note that test statistic is much larger for the Higgs Boson results – mainly because that table has 11 million rows, compared to just 209 for the Duchennes table – but isn’t quite as inflated as in some of the Jarque-Bera results. One of them has seven digits to the left of the decimal point, which I’d wager is near the limit of numerical legibility for most people. Once you get past that, most people have trouble comparing numbers by eye without resorting to the ungainly strategy of counting digits and mentally interpolating commas between every set of three.</p>
<p><strong><u>Figure 2: Sample Results from the Duchennes and Higgs Boson Datasets<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span> <span style="color:teal;">[Calculations]</span><span style="color:gray;">.</span><span style="color:teal;">[NormalityTestDAgostinosKSquaredSP]<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"><span> </span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DatabaseName</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">=</span> <span style="color:red;">N’DataMiningProjects’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@SchemaName </span><span style="color:gray;">=</span> <span style="color:red;">N’Health’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@TableName </span><span style="color:gray;">=</span> <span style="color:red;">N’DuchennesTable’<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@ColumnName </span><span style="color:gray;">=</span> <span style="color:red;">N’LactateDehydrogenase’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@PrimaryKeyName </span><span style="color:gray;">=</span> <span style="color:red;">N’ID’<br />
<a href="https://multidimensionalmayhem.wordpress.com/2015/12/21/goodness-of-fit-testing-with-sql-server-part-3-2-dagostinos-k-squared-test/dagostino-pearson-result/" rel=" rel="attachment wp-att-538""><img class="alignnone size-full wp-image-538" src="https://multidimensionalmayhem.files.wordpress.com/2015/12/dagostino-pearson-result.jpg?w=604&h=56" alt="D'Agostino-Pearson Result" width="604" height="56" /></a><br />
<a href="https://multidimensionalmayhem.wordpress.com/2015/12/21/goodness-of-fit-testing-with-sql-server-part-3-2-dagostinos-k-squared-test/dagostino-pearson-result-2/" rel=" rel="attachment wp-att-539""><img class="alignnone size-full wp-image-539" src="https://multidimensionalmayhem.files.wordpress.com/2015/12/dagostino-pearson-result-2.jpg?w=604&h=55" alt="D'Agostino-Pearson Result 2" width="604" height="55" /></a> </span></span></p>
<p><span style="font-size:9.5pt;font-family:Consolas;"><span style="color:red;"><a href="https://multidimensionalmayhem.wordpress.com/2015/12/21/goodness-of-fit-testing-with-sql-server-part-3-2-dagostinos-k-squared-test/dagostino-pearson-result-3/" rel=" rel="attachment wp-att-540""><img class="alignnone size-full wp-image-540" src="https://multidimensionalmayhem.files.wordpress.com/2015/12/dagostino-pearson-result-3.jpg?w=604&h=58" alt="D'Agostino-Pearson Result 3" width="604" height="58" /></a><br />
</span></span></p>
<p><span style="font-size:10pt;color:white;">…………</span>The good news is that the procedure performed unexpectedly well; in fact, the first trial run took 3:43 on the first float column in the Higgs Boson table, i.e. exactly the same execution time as for the Jarque-Bera Test in the last tutorial. After all of those arcane calculations you’d expect to see a rather messy execution plan, but as Figure 3 shows, this procedure isn’t all that hard to follow. The main costs were incurred by two non-clustered index seeks and a sort. This is because almost all of the work occurs in retrieving the values and performing simple calculations for each row, not in the fancy math that occurs after they’ve been summarized, which turns out to have an inconsequential computation cost. The main burden of these calculations falls exactly where we want it: on the brains of the coders and testers, not on the end users, to whom the procedure will be a well-oiled black box after error-checking, validation and SQL injection protection code are added.</p>
<p><strong><u>Figure 3: Execution Plan for the D’Agostino-Pearson Omnibus Test<br />
<a href="https://multidimensionalmayhem.wordpress.com/2015/12/21/goodness-of-fit-testing-with-sql-server-part-3-2-dagostinos-k-squared-test/dagostino-pearson-execution-plan/" rel=" rel="attachment wp-att-541""><img class="alignnone size-full wp-image-541" src="https://multidimensionalmayhem.files.wordpress.com/2015/12/dagostino-pearson-execution-plan.jpg?w=604&h=147" alt="D'Agostino-Pearson Execution Plan" width="604" height="147" /></a></u></strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>There’s more good news: since most of the performance cost occurs in the same seeks, sorts and initial calculations of skewness and kurtosis that the Jarque-Bera Test uses, there’s no real penalty incurred by computing it together with the D’Agostino-Pearson Omnibus Test. If we had to sacrifice one, however, it would be the former, since I have heard anecdotes about statisticians preferring the latter, but not the other way around. One of the reasons the K<sup>2</sup> is favored is because of numerous studies (including some written by D’Agostino Sr.) demonstrating that it has better statistical power, which is a numerical measure of how often the actual effects of a variable are detected by a particular test.[6] This metric is applicable to large sample sizes, unlike the Shapiro-Wilk Test[7], and can be used for both one-sided and two-sided hypothesis tests.[8] As I learn more about the field I’m shying more and more away from hypothesis tests, on the grounds that the small sample sizes and narrow focus aren’t suited to typical SQL Server user scenarios, like exploratory data mining on large datasets. Nevertheless, it doesn’t hurt to know that the D’Agostino-Pearson Test is flexible enough to be used for these purposes. Moreover, it can apparently be applied to goodness-of-fit testing on datasets that don’t follow the Gaussian or “normal” distribution, i.e. the bell curve, which isn’t true of many of them. In fact, the authors of that 1990 study go so far as to say that “The extensive power studies just mentioned have also demonstrated convincingly that the old warhorses, the chi-squared test and the Kolmogorov test (1933), have poor power properties and should not be used when testing for normality.”[9] This is by no means the first time I’ve heard such sentiments expressed by statisticians about these two rival metrics, which still seem to be implemented far more frequently in practice despite such advice.<br />
<span style="font-size:10pt;color:white;">…………</span>Later on in this series I’ll explain how to implement both the Chi-Squared Test and Kolmogorov-Smirnov Test in T-SQL, but I’m going to skip over a couple of other measures related to skewness and kurtosis, at least for the time being. One of these is Mardia’s multivariate versions of skewness and kurtosis, which I will save for some far-flung future when grappling with the complexity added by dealing with multiple columns isn’t too overwhelming; perhaps someday I’ll tack a segment onto the end of this series for multivariate goodness-of-fit tests, like the Cox-Small Test and Smith and Jain’s Test.[10] I’ve organized this series in the order of how difficult the concepts and underlying code are, which brings us to the topic of regression-related methods of goodness-of-fit testing. As explained in the last article, skewness and kurtosis really aren’t that hard to grasp intuitively, and as I dealt with in <a href="http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2013/01/08/a-rickety-stairway-to-sql-server-data-mining-algorithm-2-linear-regression/">A Rickety Stairway to SQL Server Data Mining, Algorithm 2: Linear Regression</a>, the core concepts behind regression aren’t that difficult either. The variants of regression can get quite complicated, but drawing a line on a graph based on the relationship between two variables is something every college freshman has been exposed to. The stats based on these lines can also vary in their intricacy; there is apparently even a version of Jarque-Bera for multiple regression[11], which I’ll skip over for now to avoid the added complexity of dealing with three or more variables. The code required to implement regression stats for purposes of normality testing can also require differing levels of sophistication, as we’ll see shortly after New Year’s.</p>
<p>[1] D’Agostino, Ralph B.; Belanger, Albert and D’Agostino Jr., Ralph B, 1990, “A Suggestion for Using Powerful and Informative Tests of Normality,” pp. 316–321 in <u>The American Statistician</u>. Vol. 44, No. 4. Available online at <a href="http://www.ohio.edu/plantbio/staff/mccarthy/quantmet/D'Agostino.pdf">http://www.ohio.edu/plantbio/staff/mccarthy/quantmet/D’Agostino.pdf</a></p>
<p>[2] See the <u>Wikipedia</u> article “D’Agostino’s K-Squared Test” at <a href="http://en.wikipedia.org/wiki/D'Agostino's_K-squared_test">http://en.wikipedia.org/wiki/D’Agostino’s_K-squared_test</a></p>
<p>[3] <em>IBID.</em></p>
<p>[4] “D’Agostino and Pearson (1973) presented a statistic that combines….to produce an omnibus test of normality. By omnibus, we mean it is able to detect deviations from normality due to either skewness or kurtosis.” See p. 318, D’Agostino, et al., 1990.</p>
<p>[5] <em>IBID.</em>, p. 318.</p>
<p>[6] For a better explanation of the term than I can give, see Hopkins, Will G., 2001, “Generalizing to a Population: ESTIMATING SAMPLE SIZE continued,” published at the <u>A New View of Statistics</u> web address <a href="http://www.sportsci.org/resource/stats/ssdetermine.html">http://www.sportsci.org/resource/stats/ssdetermine.html</a>. I highly recommend his website for those who are new to the field of stats, like me.</p>
<p>[7] p. 319, D’Agostino, et al., 1990.</p>
<p>[8] <em>IBID.</em>, p. 318.</p>
<p>[9] <em>IBID.</em>, p. 316.</p>
<p>[10] I don’t know anything about these tests, but I’ve seen them mentioned in sources like the <u>Wikipedia</u> article “Multivariate Normal Distribution” at <a href="http://en.wikipedia.org/wiki/Multivariate_normal_distribution">http://en.wikipedia.org/wiki/Multivariate_normal_distribution</a></p>
<p>[11] See the Wikipedia page “Jarque-Bera Test” at <a href="http://en.wikipedia.org/wiki/Jarque%E2%80%93Bera_test">http://en.wikipedia.org/wiki/Jarque%E2%80%93Bera_test</a></p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/multidimensionalmayhem.wordpress.com/537/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/multidimensionalmayhem.wordpress.com/537/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=multidimensionalmayhem.wordpress.com&blog=27407452&post=537&subd=multidimensionalmayhem&ref=&feed=1" width="1" height="1" />Goodness-of-Fit Testing with SQL Server, part 3.1: Skewness, Kurtosis and the Jarque-Bera Test
http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2015/12/02/goodness-of-fit-testing-with-sql-server-part-31-skewness-kurtosis-and-the-jarque-bera-test/
Thu, 03 Dec 2015 05:49:22 UT/blogs/multidimensionalmayhem/2015/12/02/goodness-of-fit-testing-with-sql-server-part-31-skewness-kurtosis-and-the-jarque-bera-test/0http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2015/12/02/goodness-of-fit-testing-with-sql-server-part-31-skewness-kurtosis-and-the-jarque-bera-test/#comments<p><strong>By Steve Bolton</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>In the last installment of this series of amateur self-tutorials on using SQL Server to identify probability distributions, we saw how devices like probability plots can provide simple visual confirmation of a dataset’s shape. I considered doing a quick detour into Q-Q plots, but decided against it because of their simplicity; instead of putting values for the distribution being tested on the horizontal axis, Q-Q plots chop them up into partitions of equal size, a task that is obviously trivial to implement with NTILE. I’m more eager to discuss skewness and kurtosis, two of the oldest, tried-and-true measures of goodness-of-fit[1] – particularly for the normal or “Gaussian” distribution, i.e. the bell curve – precisely because they are often easy to spot with the naked eye. They are numerical measures rather than visualizations, but are often self-evident within graphics like histograms. For example, the third histogram in my recent post <a href="https://multidimensionalmayhem.wordpress.com/2015/04/21/outlier-detection-with-sql-server-part-6-1-visual-outlier-detection-with-reporting-services/">Outlier Detection with SQL Server Part 6.1 – Visual Outlier Detection with Reporting Services</a> is a striking examples of a highly skewed column, while the one below it obviously follows a bell curve more closely and has relatively low skewness and kurtosis; later in this article, I’ll run some sample T-SQL code against the same data to derive hard numbers for both. I’ve seen several good explanations of the meanings of skewness and kurtosis in sources at various sites on the Internet, including one of my favorites, the <a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h1.htm">National Institute for Standards and Technology’s Engineering Statistics Handbook</a>, which defines them thus: “Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point…Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. That is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak. A uniform distribution would be the extreme case.”[2] Another succinct explanation is given by Tompkins County Community College adjunct faculty member Stan Brown, who says that “The histogram can give you a general idea of the shape, but two numerical measures of shape give a more precise evaluation: skewness tells you the amount and direction of skew (departure from horizontal symmetry), and kurtosis tells you how tall and sharp the central peak is, relative to a standard bell curve.” [3]<br />
<span style="font-size:10pt;color:white;">…………</span>I already had some experience with both measures way back in <a href="https://multidimensionalmayhem.wordpress.com/2013/08/26/a-rickety-stairway-to-sql-server-data-mining-part-14-2-writing-a-bare-bones-plugin-algorithm/">A Rickety Stairway to SQL Server Data Mining, Part 14.2: Writing a Bare Bones Plugin Algorithm</a> and <a href="https://multidimensionalmayhem.wordpress.com/2013/11/28/a-rickety-stairway-to-sql-server-data-mining-part-14-6-custom-mining-functions/">A Rickety Stairway to SQL Server Data Mining, Part 14.6: Custom Mining Functions</a>, when I made crude attempts to implement skewness and kurtosis in SSDM in order to illustrate the capabilities of its custom algorithms and functions. That called for fairly simple stats which wouldn’t distract from the main mission; I didn’t really even make much of an effort to understand them, because it wasn’t germane to the lesson at hand. Since then I’ve discovered that it’s easier for me to grasp both stats by viewing them as numerical measures of lopsidedness on a histogram that is divided into imaginary stripes, in which skewness detects how uneven a distribution is from one vertical band to another, whereas kurtosis measures how squashed the distribution curve is on the horizontal axis. Either way you look at it, the measures are still simple enough to explain in layman’s terms, which is one of the strengths of the set of normality tests built from them.<br />
<span style="font-size:10pt;color:white;">…………</span>The most well-known extension of these somewhat forgotten stats is the Jarque-Bera Test, which only dates back to the 1970s despite being one of earliest examples of normality testing. All of these measures have fallen out of favor with statisticians to some extent, for reasons that will be apparent shortly, but one of the side effects of this is that it is a little more difficult to find variations on them that are more suited to the unique needs of the SQL Server community. One of the strengths of data mining on database servers like SQL Server is that you typically have such an enormous number of records to draw from that you can actually perform calculations on the full population, or a proportion close to it. In ordinary statistics, however, you’re often limited to making inferences based on small samples of just a few dozen or a few hundred rows, out of a much larger population that is often of unknown size; the results can still be logically valid, but often only if other preconditions are met on the data (including normality tests, which are often not performed). For that reason, I usually prefer to leverage SQL Server’s fast set-based retrieval methods to quickly calculate statistics on full populations whenever possible, especially when there are simpler versions of the mathematical formulas available for the full dataset. Skewness and kurtosis are among those measures that can be computed in a simpler way when using the whole population[4], but I’ve opted here to use the more intensive formulas for sample skewness and sample kurtosis for one reason only: it might be possible to substitute population skewness and kurtosis for their sampling counterparts in the formulas for the Jarque-Bera Test, but I can’t find any online sources that mention such a swap. I suspect that it probably would be logically valid, but I took the more conservative approach in Figure 1 by employing the usual Jarque-Bera formula, which really isn’t much more difficult to compute.[5]</p>
<p><strong><u>Figure 1: Code for the Jarque-Bera Test Procedure<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">ALTER</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">PROCEDURE</span> <span style="color:teal;">[Calculations]</span><span class="GramE"><span style="color:gray;">.</span><span style="color:teal;">[</span></span><span class="SpellE"><span style="color:teal;">NormalityTestJarqueBeraSkewnessAndKurtosisSP</span></span><span style="color:teal;">]<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DatabaseName</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">)</span> <span style="color:gray;">=</span> <span style="color:gray;">NULL,</span> <span style="color:teal;">@<span class="SpellE">SchemaName</span></span> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@<span class="SpellE">TableName</span></span> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span><span style="color:teal;">@<span class="SpellE">ColumnName</span></span> <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@PrimaryKeyName</span> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>400<span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">AS</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName</span></span> <span class="GramE"><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span></span>400<span style="color:gray;">),</span><span style="color:teal;">@<span class="SpellE">SQLString </span></span><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName</span></span> <span style="color:gray;">=</span> <span style="color:teal;">@DatabaseName</span> <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">SchemaName </span></span><span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">TableName</span></span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SQLString</span></span> <span style="color:gray;">=</span> </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">‘DECLARE<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Mean float,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@StDev float,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Count as bigint,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@One float = 1,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">RecpiprocalOfNSampleSize</span> float,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">DifferenceFromSampleMeanSquared</span> float,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">DifferenceFromSampleMeanCubed</span> float,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">DifferenceFromSampleMeanFourthPower</span> float,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">SampleSkewness</span> float,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">SampleKurtosis</span> float,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">JarqueBeraTest</span> float</span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Count = <span class="GramE">Count(</span>‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@<span class="SpellE">ColumnName</span></span> <span style="color:gray;">+</span> <span style="color:red;">‘), @Mean = <span class="SpellE">Avg</span>(CAST(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">ColumnName</span></span> <span style="color:gray;">+</span> <span style="color:red;">‘ AS float)), @StDev = StDev(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">ColumnName</span></span> <span style="color:gray;">+</span> <span style="color:red;">‘)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName</span></span> <span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘</span></span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">ColumnName</span></span> <span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @<span class="SpellE">RecpiprocalOfNSampleSize</span> = @One / @Count</span></p>
<p class="MsoNormal"><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT<span> </span>@<span class="SpellE">DifferenceFromSampleMeanSquared </span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">= SUM(Power(<span class="SpellE">DifferenceFromSampleMean</span>, 2)) OVER (ORDER BY ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@PrimaryKeyName </span><span style="color:gray;">+</span> <span style="color:red;">‘ ASC),<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span><span class="GramE">@<span class="SpellE">DifferenceFromSampleMeanCubed</span><span> </span>=</span> SUM(Power(<span class="SpellE">DifferenceFromSampleMean</span>, 3)) OVER (ORDER BY ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@PrimaryKeyName</span> <span style="color:gray;">+</span> <span style="color:red;">‘ ASC),<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span><span class="GramE">@<span class="SpellE">DifferenceFromSampleMeanFourthPower</span><span> </span>=</span>SUM(Power(<span class="SpellE">DifferenceFromSampleMean</span>, 4)) OVER (ORDER BY ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@PrimaryKeyName</span> <span style="color:gray;">+</span> <span style="color:red;">‘ ASC)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM (SELECT ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@PrimaryKeyName </span><span style="color:gray;">+</span> <span style="color:red;">‘, <span class="GramE">CAST(</span>‘ </span><span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">ColumnName </span></span><span style="color:gray;">+</span> <span style="color:red;">‘ AS float) – @Mean as <span class="SpellE">DifferenceFromSampleMean</span> — make a single pass across the table?<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">SchemaAndTableName </span></span><span class="GramE"><span style="color:gray;">+</span> ‘<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@<span class="SpellE">ColumnName </span></span><span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL) AS T1</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @<span class="SpellE">SampleSkewness</span> = (@<span class="SpellE">RecpiprocalOfNSampleSize </span>* @<span class="SpellE">DifferenceFromSampleMeanCubed</span><span class="GramE">)<span> </span>/</span>(Power((@<span class="SpellE">RecpiprocalOfNSampleSize </span>* @<span class="SpellE">DifferenceFromSampleMeanSquared</span>), 1.5))</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @<span class="SpellE">SampleKurtosis</span> = (@<span class="SpellE">RecpiprocalOfNSampleSize </span>* @<span class="SpellE">DifferenceFromSampleMeanFourthPower</span><span class="GramE">)<span> </span>/</span>(Power((@<span class="SpellE">RecpiprocalOfNSampleSize</span> * @<span class="SpellE">DifferenceFromSampleMeanSquared</span>), 2))</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @<span class="SpellE">JarqueBeraTest</span> = CAST((CAST(@Count AS float) / CAST(6 AS float)) AS Decimal(38,12)) * CAST((Power(@<span class="SpellE">SampleSkewness</span>, 2) + (0.25 * Power((@<span class="SpellE">SampleKurtosis</span> -3), 2))) AS Decimal(38,12))</span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @<span class="SpellE">JarqueBeraTest</span> AS <span class="SpellE">JarqueBeraTest</span>, @<span class="SpellE">SampleSkewness </span>AS <span class="SpellE">SampleSkewness</span>, @<span class="SpellE">SampleKurtosis </span>AS <span class="SpellE">SampleKurtosis</span>, @<span class="SpellE">SampleKurtosis </span>– 3 AS <span class="SpellE">ExcessKurtosis</span>, @Mean AS<span> </span>Mean, @StDev AS StDev<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">— to debug the internal calculations, uncomment the rest of this line, @<span class="SpellE">DifferenceFromSampleMeanSquared</span>, @<span class="SpellE">RecpiprocalOfNSampleSize</span>, @<span class="SpellE">DifferenceFromSampleMeanSquared</span>, @<span class="SpellE">DifferenceFromSampleMeanCubed</span>, @<span class="SpellE">DifferenceFromSampleMeanFourthPower<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">‘</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:green;">–SELECT @<span class="SpellE">SQLString</span> — <span class="GramE">uncomment this </span>to debug dynamic SQL errors<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@<span class="SpellE">SQLString</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">)</span></p>
<p><span style="font-size:10pt;color:white;">…………</span>The end of the final SELECT can be uncommented to debug the internal calculations. I’ve also reserved the second-to-last line for a SELECT that can be uncommented to debug the dynamic SQL string, as is standard in most of my procedures. Much of the initial code ought to be familiar to reader of this series and the one on outliers, since I use many of the same parameters and internal variables, and apply some of the usual preliminary SET operations on them. As usual, I calculate the values of some reusable internal stats and then cache them in dynamic SQL variables so that we don’t have to recalculate them again, as in the case of the reciprocal of the count and the deviation computation in the lowest-level subquery. I’m experimenting with declaring constants like 1 to high precision data types to prevent situations where SQL Server sometimes truncates the values during calculations, which can lead to erroneous results or at best, messy code full of casts deep within subqueries to avoid such errors. One departure from the norm is the use of floats rather than the decimal data type in the dynamic SQL calculations. The square, cube and quartic operations can result in really high and low values, which may in turn cause divide by zero errors if they’re rounded down to nothing or arithmetic overflows if they’re rounded too high, so I resorted to using float data types for the first time in any of my mistutorial series. This may entail some loss of precision in the internal calculations, but shouldn’t have much an effect on the final test statistic. It is not uncommon for this result to seem outlandishly high when the underlying distribution is abnormal.[6]</p>
<p><strong><u>Figure 2: Sample Results from the Duchennes Table and HiggsBosonTable<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span> <span style="color:teal;">[Calculations]</span><span style="color:gray;">.</span><span style="color:teal;">[NormalityTestJarqueBeraSkewnessAndKurtosisSP]<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"><span> </span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DatabaseName</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">=</span> <span style="color:red;">N’DataMiningProjects’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@SchemaName </span><span style="color:gray;">=</span> <span style="color:red;">N’Health’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@TableName </span><span style="color:gray;">=</span> <span style="color:red;">N’DuchennesTable’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@ColumnName </span><span style="color:gray;">=</span> <span style="color:red;">N’LactateDehydrogenase’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@PrimaryKeyName </span><span style="color:gray;">=</span> <span style="color:red;">N’ID’</span></span></p>
<p class="MsoNormal"><span style="font-size:10pt;"> </span></p>
<p><a href="https://multidimensionalmayhem.wordpress.com/2015/12/02/goodness-of-fit-testing-with-sql-server-part-3-1-skewness-kurtosis-and-the-jarque-bera-test/jarqueberaresult1/" rel=" rel="attachment wp-att-532""><img class="alignnone size-full wp-image-532" src="https://multidimensionalmayhem.files.wordpress.com/2015/12/jarqueberaresult1.jpg?w=604&h=82" alt="JarqueBeraResult1" width="604" height="82" /></a> <a href="https://multidimensionalmayhem.wordpress.com/2015/12/02/goodness-of-fit-testing-with-sql-server-part-3-1-skewness-kurtosis-and-the-jarque-bera-test/jarqueberaresult2/" rel=" rel="attachment wp-att-533""><img class="alignnone size-full wp-image-533" src="https://multidimensionalmayhem.files.wordpress.com/2015/12/jarqueberaresult2.jpg?w=604&h=82" alt="JarqueBeraResult2" width="604" height="82" /></a> <a href="https://multidimensionalmayhem.wordpress.com/2015/12/02/goodness-of-fit-testing-with-sql-server-part-3-1-skewness-kurtosis-and-the-jarque-bera-test/jarqueberaresult3/" rel=" rel="attachment wp-att-534""><img class="alignnone size-full wp-image-534" src="https://multidimensionalmayhem.files.wordpress.com/2015/12/jarqueberaresult3.jpg?w=604&h=77" alt="JarqueBeraResult3" width="604" height="77" /></a></p>
<p><span style="font-size:10pt;color:white;">…………</span>The query in Figure 2 produced the results in the graphic immediately below it, in which I tested the procedure on the LactateDehydrogenase column of a dataset on the Duchennes form of muscular dystrophy, which is made publicly available by <a href="http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets">Vanderbilt University’s Department of Biostatistics</a>. The procedure performed surprisingly well when deriving the other two result sets, clocking in at 3:43 and 3:42 minutes on the first two float columns of the Higgs Boson Dataset, which I downloaded from the <a href="https://archive.ics.uci.edu/ml/datasets/HIGGS">University of California at Irvine’s Machine Learning Repository</a> and converted into a SQL Server table. It has 11 million rows and takes up about 6 gigabytes of the DataMiningProjects database I created for these tutorial series, which makes it ideal for stress-testing. Keep in mind that my <a href="http://www.youtube.com/watch?v=X-rkFaIPyL4">clunker</a> of a development machine hardly qualifies as a professional database server, so your results will probably be spectacularly better – especially after it’s been subjected to query tuning by one of the countless DBA who knows the ins and outs of T-SQL a lot better than I do. As evinced in Figure 3, the execution plan turned out to be a lot easier to interpret than some of the more sophisticated code I posted in the last tutorial series, with two seeks and two sorts taking up the bulk of the computational effort.</p>
<p><strong><u>Figure 3: Execution Plan for the Jarque-Bera Procedure<br />
</u></strong><a href="https://multidimensionalmayhem.wordpress.com/2015/12/02/goodness-of-fit-testing-with-sql-server-part-3-1-skewness-kurtosis-and-the-jarque-bera-test/jarqueberaexecutionplan/" rel=" rel="attachment wp-att-531""><img class="alignnone size-full wp-image-531" src="https://multidimensionalmayhem.files.wordpress.com/2015/12/jarqueberaexecutionplan.jpg?w=604&h=161" alt="JarqueBeraExecutionPlan" width="604" height="161" /></a></p>
<p><span style="font-size:10pt;color:white;">…………</span>The results in Figure 2 are a powerful illustration of one of the weaknesses of the Jarque-Bera Test, i.e. its lack of scaling. The higher the values of the column accumulate in the internal calculations, the larger the test results may be; that is why the 209 rows of the LactateDehydrogenase column had much higher skewness and kurtosis scores than the results for Column1 and Column2 of the Higgs Boson table, yet had a Jarque Bera score that was several orders of magnitude smaller. I’m sure that by now some statistician has developed a scaling mechanism to get around this problem, but I question whether it is worth it for our purposes, for “…it is not without weakness. It has low power for distributions with short tails, especially for bimodal distributions. Other authors have declined to include its data in their studies because of its poor overall performance.”[7] The latter wasn’t as much of an issue as expected in this example, but another problem frequently encountered in the last couple of tutorial series reared its head again: the lack of hard-and-fast cut-off points. I couldn’t find a clear winner among the competing criteria for when the Jarque-Bera stat disqualifies a dataset from being Gaussian (although that doesn’t mean one doesn’t exist, given that I lack experience with this field). They seem to all boil down to “rules of thumb,” out of those I’m most inclined to favor M.G. Bulmer’s, that skewness values beyond -1 or +1 are highly skewed, those within 0.5 of zero are pretty much symmetric and the rest are moderately skewed.[8]<br />
<span style="font-size:10pt;color:white;">…………</span>It may be that we are better off without such hard limits though, given that they limit us to simplistic either-or choices. Confidence intervals are another common way of forcing the same kind of choice, when there might not be a real crying need for such a limit. If we use a continuous measure, we can ask questions about <em>how close</em> a dataset comes to a particular distribution, such as the Gaussian bell curve, but we lose all of that flexibility when we resort to arbitrary cut-off criteria. This is a problem we’ll probably see again as we work our way through the whole menagerie of goodness-of-fit tests, some of which blindly affix labels like “normal” and “not normal” in an almost manic depressive, all-or-nothing way. It’s always good to keep in mind that when we assign labels and test results in this way on a simple pass/fail basis, or perform binning and banding on the values within them, we’re sacrificing a lot of information. For our purposes, we’d probably be better of preserving the skewness and kurtosis values as measures of <em>how</em> skewed or kurtic a dataset is, as well as <em>how</em> normal it might be, rather than tossing out all the insights and details the full numbers provide. Skewness and kurtosis aren’t as useful in resolving the usual chicken-and-egg dilemma that accompanies outlier detection and goodness-of-fit testing, because we can’t determine whether or not a dataset follows a distribution closely but has too many outliers, or if those outliers signify that a different distribution is a better match. Yet they do occupy a substantial niche in the matrix of use cases I hope to develop for goodness-of-fit, as I did for outlier detection methods in my last mistutorial series. They’re simple enough for a layman to understand and easy to visualize, plus they represent really effective measures of the shape of a dataset, aside from whether or not that shape is applicable to goodness-of-fit testing. This makes them useful in their own right as primitive forms of data mining, in a sense. I’m not as enthused about the Jarque-Bera Test though, because it requires extra computational effort in order to derive results that lack adequate scaling, interpretation criteria and statistical power, even when implemented flawlessly by better programmers than myself. It may very well have valid uses in ordinary statistical applications, but its range of usefulness may be more constrained in the realm of databases servers and Big Data. Perhaps D’Agostino’s K-Squared Test, an alternative goodness-of-fit measure also built upon skewness and kurtosis, will prove more useful that the Jarque-Bera Test in next week’s article.</p>
<p> </p>
<p>[1] See the <u>Wikipedia</u> page “Normality Test” at <a href="http://en.wikipedia.org/wiki/Normality_test">http://en.wikipedia.org/wiki/Normality_test</a></p>
<p>[2] See National Institute for Standards and Technology, 2014, “1.3.5.11 Measures of Skewness and Kurtosis,” published in the online edition of the <u>Engineering Statistics Handbook</u>. Available online at <a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h1.htm">http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h1.htm</a></p>
<p>[3] Brown, Stan, 2012, “Measures of Shape: Skewness and Kurtosis,” published Dec. 27, 2012 at the <u>Tompkins Cortland Community College</u> website. Available online at <a href="http://www.tc3.edu/instruct/sbrown/stat/shape.htm">http://www.tc3.edu/instruct/sbrown/stat/shape.htm</a> .</p>
<p>[4] See Brown, Stan, 2012.</p>
<p>[5] I derived this code from the formulas at Brown’s webpage and the <u>Wikipedia</u> entry “Jarque-Bera Test” at <a href="http://en.wikipedia.org/wiki/Jarque%E2%80%93Bera_test">http://en.wikipedia.org/wiki/Jarque%E2%80%93Bera_test</a></p>
<p>[6] For example, see the thread posted by the user named ipadawan on Oct. 13, 2011 in the <u>CrossValidated</u> forums, titled “Appropriate Probability Threshold for Jarque-Bera Test,” which is available online at the web address <a href="http://stats.stackexchange.com/questions/16949/appropriate-probability-threshold-for-jarque-bera-test">http://stats.stackexchange.com/questions/16949/appropriate-probability-threshold-for-jarque-bera-test</a></p>
<p>[7] See the Wikipedia entry for “Normality Test” again, at <a href="http://en.wikipedia.org/wiki/Normality_test">http://en.wikipedia.org/wiki/Normality_test</a></p>
<p>[8] I’m paraphrasing Brown, 2012, who cites Bulmer, M. G., 1979, <u>Principles of Statistics</u>. Dover Publication: New York. I also agree with Brown when he says that “… GraphPad suggests a confidence interval for skewness….I would say, compute that confidence interval, but take it with several grains of salt — and the further the sample skewness is from zero, the more skeptical you should be.” I have no issue with GraphPad, which I’ve never used before, but am not inclined to much stock in hard confidence intervals anyways.</p>
<p> </p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/multidimensionalmayhem.wordpress.com/530/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/multidimensionalmayhem.wordpress.com/530/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=multidimensionalmayhem.wordpress.com&blog=27407452&post=530&subd=multidimensionalmayhem&ref=&feed=1" width="1" height="1" />Goodness-of-Fit Testing with SQL Server, part 2.1: Implementing Probability Plots in Reporting Services
http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2015/11/03/goodness-of-fit-testing-with-sql-server-part-21-implementing-probability-plots-in-reporting-services/
Tue, 03 Nov 2015 22:30:26 UT/blogs/multidimensionalmayhem/2015/11/03/goodness-of-fit-testing-with-sql-server-part-21-implementing-probability-plots-in-reporting-services/0http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2015/11/03/goodness-of-fit-testing-with-sql-server-part-21-implementing-probability-plots-in-reporting-services/#comments<p><strong>By Steve Bolton</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>In the first installment of this series of amateur self-tutorials, I explained how to implement the most basic goodness-of-fit tests in SQL Server. All of those produced simple numeric results that are trivial to calculate, but in terms of interpretability, you really can’t beat the straightforwardness of visual tests like Probability-Probability (P-P) and Quantile-Quantile (Q-Q) Plots. Don’t let the fancy names fool you, because the underlying concepts aren’t that difficult to grasp once the big words are subtracted. It is true that misunderstandings may sometimes arise over the terminology, since both types of visual goodness-of-fit tests are often referred by the generic term of “probability plots” – especially when we use the Q-Q Plot for the Gaussian or “normal” distribution, i.e. the bell curve, which is often called the “normal probability plot.”[1] Nevertheless, the meaning of either one is easy to grasp at a glance, even to an untrained eye: basically, we just build a scatter plot of data points, then compare it to a line that represents the ideal distribution of points for a perfect match. If they look like they follow the same path – usually a straight line – then we can conclude that the distribution we want to assess fits well. Visual analysis of this kind is of course does not provide the kind of detail or rigor that more sophisticated goodness-of-fit tests can, but it serves as an excellent starting point, especially since it is relatively straightforward to implement scatter plots of this kind in Reporting Services.<br />
<span style="font-size:10pt;color:white;">…………</span>As I found out the hard way, the difficult part with implementing these visual aids is not in representing the data in Reporting Services, but in calculating the deceptively short formulas in T-SQL. For P-P Plots, we need to compare two cumulative distribution functions (CDFs). That may be a mouthful, but one that is not particularly difficult to swallow once we understand how to calculate probability distribution functions. PDFs[2] are easily depicted in histograms, where we can plot the probability of the occurrence of each particular value in a distribution from left to right to derive such familiar shapes as the bell curve. Since probabilities in stochastic theory always start at 0 and sum to 1, we can plot them a different way, by summing them in succession for each associated value until we reach that ceiling. Q-Q Plots are a tad more difficult because they involve comparing the inverse of the CDFs, using what is alternately known as quantile or percent point functions[3], but not terribly so. Apparently the raison d’etre for these operations is to distill distributions like the Gaussian down to the uniform distribution, i.e. a flat line in which all outcomes are equally likely, for easier comparison.[4]</p>
<p style="text-align:center;"><strong>Baptism By Fire: Why the Most Common CDF is Also the Most Trying</strong></p>
<p> Most probability distributions have their own CDF and Inverse CDF, which means it would be time-consuming to implement them all in order to encompass all of the known distributions within a single testing program. The equations involved are not always terribly difficult – except, however, when it comes to the most important distribution of all, the Gaussian. No exact solutions are available (let alone mathematically possible) for our most critical, must-have use case, so we must rely on various approximations developed by mathematicians over the years. One of my key goals in writing self-tutorials of this kind is to acquire the ability to translate equations into T-SQL, Visual Basic and Multidimensional Expression (MDX) quickly, but I got a baptism by fire when trying to decipher one of the symbols used in the error functions the normal distribution’s CDF depends upon. The assistance I received from the folks at CrossValidated (StackOverlow’s machine learning and statistics forum) was also indispensable in helping me wrap my head around the formulas, which are apparently a common stumbling block for beginners like me.[5] For the Inverse CDFs I also had to learn the concept of order statistics, i.e. rankits, which I can probably explain a lot more succinctly than some of the unnecessarily prolix resources I waded through along the way. The mathematical operation is really no more difficult than writing down all of your values in order from lowest to highest, then folding the sheet of paper in half and adding the corresponding points together. The Wikipedia discussion page <a href="http://en.wikipedia.org/wiki/Talk%3ARankit#Expected_values_of_the_resulting_order_statistics">“Talk:Rankit”</a> helped tremendously; in fact, I ended up using the approximation for the R statistical package that is cited there in my implementation of the Gaussian Inverse CDF.[6]<br />
<span style="font-size:10pt;color:white;">…………</span>While slogging through the material, it began to dawn on me that it might not be possible to implement even a crude solution in T-SQL, at least for tables of the size SQL Server users encounter daily. Indeed, if it weren’t for a couple of workarounds like the aforementioned one for R I found scattered across the Internet, I wouldn’t have finished this article at all. Resorting to the use of lookup tables for known values really doesn’t help us in the SQL Server world, because they simply don’t go high enough. I was reunited with one of the same stumbling blocks I often encountered when writing my last mistutorial series, namely that fact that the available lookup tables for known rankit values simply don’t go anywhere near high enough for the size of the tables used in SQL Server databases and cubes. For example, one compendium of statistical tables I consulted could only accommodate up to 50 values.[7]</p>
<p style="text-align:center;"><strong>In the Absence of Lookup Tables, Plan on Writing Intricate SQL</strong></p>
<p> This is merely a subset of the much broader issue of scaling statistical tests that were designed generations ago for much smaller sample sizes, of a few dozen or a few hundred records, to the thousands or millions of rows routinely seen in modern databases. In this case, I was forced to calculate the missing rankit values myself, which opened up a whole new can of worms. Another critical problem with implementing the CDF and Inverse CDF in code is that many of the approximations involve factorials, but those can only be calculated up to values around 170 without reaching the limit of the T-SQL float data type; this is actually quite good compared to other languages and statistical packages, which can often handle values only up to around 20.[8] Thankfully, Peter John Acklam published a decent approximation algorithm online, which can calculate Inverse CDFs for the normal distribution without factorials. It’s only good to a precision of 1.15 x 10<sup>-9</sup>, which may not be sufficient for some Big Analysis use cases, but this code ought to be enough to get a novice data miner started.[9]<br />
<span style="font-size:10pt;color:white;">…………</span>The complexity of implemented probability plots is further increased when we factor in the need to write separate code for each distribution; most of them aren’t as difficult as the Gaussian, which has no closed-form solution, but providing code for each of them would require dozens of more articles. For that reason, I’ll stick to the bell curve for now; consequently, I also won’t get into a discussion of the lesser-known Probability Plot Correlation Coefficient Plot (PPCC), which is only applicable to distributions like the Weibull that have shape parameters, unlike the bell curve.[10] Another complication we have to deal with when using CDFs, inverse CDFs and PDFs is that different versions may be required for each, depending on whether you want to return a single value or a whole range, or whether such inputs as the mean, standard deviation and counts are already known or have to be computed on the fly. Later in this series we will probably have to make use of some of these alternate versions for more advanced fitness tests, so I’ve uploaded all 14 versions I’ve coded to date in one fell swoop to one central repository on DropBox, which are listed below:</p>
<p><a href="https://www.dropbox.com/s/79vqp4slf6otl2g/NormalDistributionCDFSP.sql?dl=0">NormalDistributionCDFSP.sql</a><br />
<a href="https://www.dropbox.com/s/x2lw7g57c2pwd2w/NormalDistributionCDFSupplyMeanAndStDevSP.sql?dl=0">NormalDistributionCDFSupplyMeanAndStDevSP.sql</a><br />
<a href="https://www.dropbox.com/s/yy5z4k307h3nqcd/NormalDistributionCDFSupplyMeanStDevAndRangeSP.sql?dl=0">NormalDistributionCDFSupplyMeanStDevAndRangeSP.sql</a><br />
<a href="https://www.dropbox.com/s/0cw31hcfyxzwmpu/NormalDistributionCDFSupplyTableParameterSP.sql?dl=0">NormalDistributionCDFSupplyTableParameterSP.sql</a><br />
<a href="https://www.dropbox.com/s/4gw7wp2bs78yciw/NormalDistributionInverseCDFFunction.sql?dl=0">NormalDistributionInverseCDFFunction.sql</a><br />
<a href="https://www.dropbox.com/s/tnld6780hags5ij/NormalDistributionPDFAndCDFSupplyMeanStDevAndRangeSP.sql?dl=0">NormalDistributionPDFAndCDFSupplyMeanStDevAndRangeSP.sql</a><br />
<a href="https://www.dropbox.com/s/1bcjwmmphncw2p6/NormalDistributionPDFSP.sql?dl=0">NormalDistributionPDFSP.sql</a><br />
<a href="https://www.dropbox.com/s/bxngq6jebtbybxy/NormalDistributionPDFSupplyMeanAndStDevSP.sql?dl=0">NormalDistributionPDFSupplyMeanAndStDevSP.sql</a><br />
<a href="https://www.dropbox.com/s/ejes2zoqz43kfgd/NormalDistributionPDFSupplyMeanStDevAndRangeSP.sql?dl=0">NormalDistributionPDFSupplyMeanStDevAndRangeSP.sql</a><br />
<a href="https://www.dropbox.com/s/i5ttvu839qqifbe/NormalDistributionRankitApproximationSP.sql?dl=0">NormalDistributionRankitApproximationSP.sql</a><br />
<a href="https://www.dropbox.com/s/mph0gmymvbxer3x/NormalDistributionSingleCDFFunction.sql?dl=0">NormalDistributionSingleCDFFunction.sql</a><br />
<a href="https://www.dropbox.com/s/3txwxq7g2hcp8ry/RankitApproximationFunction.sql?dl=0">RankitApproximationFunction.sql</a><br />
<a href="https://www.dropbox.com/s/1gg5ek7jmjktblf/RankitApproximationSP.sql?dl=0">RankitApproximationSP.sql</a><br />
<a href="https://www.dropbox.com/s/3y7zmiwjv01zaxy/RankitApproximationSupplyCountSP.sql?dl=0">RankitApproximationSupplyCountSP.sql</a><br />
<a href="https://www.dropbox.com/s/c9oeu1jlxlgicgk/SimpleFloatValueTableParameter.sql?dl=0">SimpleFloatValueTableParameter.sql</a></p>
<p><span style="font-size:10pt;color:white;">…………</span>Keep in mind that, as usual, I’ve only done very basic testing on these stored procedures and functions, so they’ll probably require some troubleshooting before putting them into a production environment; consider them an example of how a professional solution might be engineered, not as a finished product. I did some validation of the procedures against various CDF and Inverse CDF lookup tables and calculators I found on the Web, but only for a handful of values.[11] The .sql file names are pretty much self-explanatory: for example, NormalDistributionPDFSupplyMeanAndStDevSP returns the PDF function for the normal distribution if you supply the mean and standard deviation, whereas the NormalDistributionSingleCDFFunction does just what it says by returning one value out of a set of CDF results. A few take table variables as inputs, so I’ve included the SimpleFloatValueTableParameter I defined to supply them. I’ve followed my usual coding style by appending SP and Function to the ends of the names to denote what type of object they are. The NormalDistributionRankitApproximationSP, RankitApproximationSP and RankitApproximationSupplyCountSP procedures use the aforementioned approximation from R, while my implementation of Acklam’s approximation can be found in the NormalDistributionInverseCDFFunction.sql file. Some of the objects are dependent on the others, like the RankitApproximationFunction, which utilizes the NormalDistributionInverseCDFFunction.<br />
<span style="font-size:10pt;color:white;">…………</span>Some of the other procedures will be of use later in this tutorial series, but in this week’s installment, we’ll be feeding the output from DataMiningProjects.Distributions.NormalDistributionSingleCDFFunction listed above into a couple of SSRS line charts. As I pointed out in three previous articles from the tail end of my last tutorial series, there are plenty of better explanations of how to write reports and do other basic tasks in RS, so I won’t clutter this post with those extraneous details. Basically, the sample procedure below derives the CDF values for the horizontal axis and another set of values for the vertical axis called the Empirical Distribution Function (EDF), which is just a fancy way of saying the values actually found in the dataset. Anyone familiar with the style of sample code I’ve posted on this blog can tell that we’re just using dynamic SQL to calculate distinct counts, with the difficult computations hidden inside the CDF function; I reuse most of the same parameters, intermediate variable declarations and other code seen in past articles, like the SELECT @SQLString for debugging the procedure.</p>
<p><strong><u>Figure 1: Sample T-SQL to Build a Probability-Probability Plot<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CREATE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">PROCEDURE</span> [<span class="SpellE">GoodnessOfFit</span>]<span class="GramE"><span style="color:gray;">.</span>[</span><span class="SpellE">PPPlot]<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">@Database1 <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">)</span> <span style="color:gray;">=</span> <span style="color:gray;">NULL,</span> @Schema1 <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> @Table1 <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span>@Column1 <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">AS<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE </span><span style="font-size:9.5pt;font-family:Consolas;">@SchemaAndTable1 <span class="GramE"><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span></span>400<span style="color:gray;">),</span>@<span class="SpellE">SQLString</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET </span><span style="font-size:9.5pt;font-family:Consolas;">@SchemaAndTable1 <span style="color:gray;">=</span> @Database1 <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> @Schema1 <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> @Table1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">SQLString</span> <span style="color:gray;">=</span> </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">‘DECLARE @Mean as float,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@StDev as float,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Count <span class="GramE">bigint</span></span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal" style="text-align:left;" align="left"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Count=<span class="GramE">Count(</span>CAST(‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> @Column1 <span style="color:gray;">+</span> <span style="color:red;">‘ as float)), @Mean = <span class="SpellE">Avg</span>(CAST(‘</span> <span style="color:gray;">+ </span>@Column1 <span style="color:gray;">+</span> <span style="color:red;">‘ as float)), @StDev = StDev(CAST(‘</span> <span style="color:gray;">+ </span>@Column1 <span style="color:gray;">+</span> <span style="color:red;">‘ as float))<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> @SchemaAndTable1 <span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> @Column1 <span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL</p>
<p></span></span></p>
<p class="MsoNormal" style="text-align:left;" align="left"><span style="font-size:9.5pt;font-family:Consolas;color:red;">DECLARE @<span class="SpellE">EDFTable</span> table<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">(ID bigint IDENTITY (1<span class="GramE">,1</span>),<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">Value float,<br />
</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">ValueCount</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> bigint,<br />
</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">EDFValue</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> float,<br />
</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">CDFValue</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> <span class="GramE">decimal(</span>38,37),<br />
</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">EDFCDFDifference</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> <span class="GramE">decimal(</span>38,37))</span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal" style="text-align:left;" align="left"><span style="font-size:9.5pt;font-family:Consolas;color:red;">INSERT INTO @<span class="SpellE">EDFTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">(Value, <span class="SpellE">ValueCount</span>, <span class="SpellE">EDFValue</span>)<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT Value, <span class="SpellE">ValueCount</span>, <span class="GramE">CAST(</span>SUM(<span class="SpellE">ValueCount</span>) OVER (ORDER<br />
BY Value ASC) as float) / @Count AS <span class="SpellE">EDFValue<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>FROM </span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> (SELECT DISTINCT ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span>@Column1 <span style="color:gray;">+</span> <span style="color:red;">‘ AS Value, <span class="GramE">Count(</span>‘</span> <span style="color:gray;">+</span> @Column1 <span style="color:gray;">+</span> <span style="color:red;">‘) OVER (PARTITION BY ‘</span><br />
<span style="color:gray;">+</span> @Column1 <span style="color:gray;">+</span> <span style="color:red;">‘ ORDER BY ‘</span> <span style="color:gray;">+</span> @Column1 <span style="color:gray;">+</span> <span style="color:red;">‘) AS <span class="SpellE">ValueCount<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span>@SchemaAndTable1 <span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span>@Column1 <span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL) AS T1</span></span></p>
<p class="MsoNormal" style="text-align:left;" align="left"><span style="font-size:9.5pt;font-family:Consolas;color:red;">UPDATE T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SET <span class="SpellE">CDFValue</span> = T3.CDFValue, <span class="SpellE">EDFCDFDifference</span> = <span class="SpellE">EDFValue</span> – T3.CDFValue<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM @<span class="SpellE">EDFTable</span> AS T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>INNER JOIN</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>(SELECT <span class="SpellE">DistinctValue</span>, DataMiningProjects.Distributions.NormalDistributionSingleCDFFunction<br />
(<span class="SpellE">DistinctValue</span>, @Mean, @StDev) AS <span class="SpellE">CDFValue</span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>FROM (SELECT<br />
DISTINCT Value AS <span class="SpellE">DistinctValue<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>FROM @<span class="SpellE">EDFTable</span>) AS T2) AS T3<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span> </span>ON T1.Value = T3.DistinctValue</span></p>
<p class="MsoNormal" style="text-align:left;" align="left"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT ID, ROW_<span class="GramE">NUMBER(</span>) OVER (ORDER BY ID) AS RN, Value, <span class="SpellE">ValueCount</span>, <span class="SpellE">EDFValue</span>, <span class="SpellE">CDFValue</span>, <span class="SpellE">EDFCDFDifference</span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">EDFTable</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">‘</span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal" style="text-align:left;" align="left"><span style="font-size:9.5pt;font-family:Consolas;color:green;">–SELECT @<span class="SpellE">SQLString</span> — <span class="GramE">uncomment this</span> to debug dynamic SQL errors</span></p>
<p class="MsoNormal" style="text-align:left;" align="left"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ResultTable</span> <span style="color:blue;">table<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;">PrimaryKey</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE"><span style="color:blue;">sql_variant</span></span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">RN <span style="color:blue;">bigint</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">Value <span style="color:blue;">float</span><span style="color:gray;">,<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;">ValueCount</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">bigint</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">EDF <span style="color:blue;">float</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;">CDF <span style="color:blue;">float</span><span style="color:gray;">,<br />
</span></span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;">EDFCDFDifference</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">float<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">)</span></p>
<p class="MsoNormal" style="text-align:left;" align="left"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">INSERT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">INTO</span> @<span class="SpellE">ResultTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;">@<span class="SpellE">SQLString</span><span style="color:gray;">)</span></span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal" style="text-align:left;" align="left"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SELECT</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE">PrimaryKey</span><span style="color:gray;">,</span> RN<span style="color:gray;">,</span> Value<span style="color:gray;">,</span> <span class="SpellE">ValueCount</span><span style="color:gray;">,</span> EDF<span style="color:gray;">,</span> CDF<span style="color:gray;">,</span> <span class="SpellE">EDFCDFDifference<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">FROM</span><span style="font-size:9.5pt;font-family:Consolas;"> @<span class="SpellE">ResultTable</span></span><span style="font-size:10pt;"> </span></p>
<p><span style="font-size:10pt;color:white;">…………</span>If the distribution being tested by the CDF is a good match then the coordinates ought to come as close to an imaginary center line cutting across from (0,0) to (1,1), which are the boundaries of any EDF or CDF calculation. That’s obviously not the case in the first plot in Figure 2, where the coordinates are shifted far to the left and top despite the fact that the horizontal axis is skewed, with most of the values lopped off. The other three all have standard 0.1 intervals, including the second plot, which seems to be a good match. This is not surprising, given that I’ve already performed much more sophisticated goodness-of-fit tests on this data, which represents the second float column in the Higgs Boson Dataset I downloaded from <a href="https://archive.ics.uci.edu/ml/datasets/HIGGS">University of California at Irvine’s Machine Learning Repository</a> ages ago for practice data on this blog. The abnormal plot above it comes from the first float column in the same dataset, which I routinely fails tests for the Gaussian/normal distributon. Note how thick the lines are in both: this is because there are 11 million rows in the practice dataset, with 5,001 distinct values for the second column alone. Most of the tests I’ll survey in this series perform well in the databae engine, but trying to depict that many values in an SSRS report can obviously lead to congestion in the user interface. The first plot was particularly slow in loading on my development machine. The third plot loaded quickly because it came from the Duchennes muscular dystrophy dataset[12] I’ve also been using for demonstration purposes, which has a mere 209 rows. The Lactate Dehyrogenase enzyme data embodied in the column I plugged into my procedure is probably not normally distributed, given how erratic it is at the tails and bowed at the center. The fourth plot comes from a time dataset that may be Gaussian despite its jagged appearance, which is caused by the discrete 15-minute intervals it tracks. It is in situations like this where knowing your data is an immense help in successful interpretation, i.e. the end goal of any data mining endeavor. In many other contexts, serrated shapes are often an indicator of abnormality; in this instance, it is dictated by the fixed width of data type intervals chosen.</p>
<p><strong><u>Figure 2: Four Sample Probability-Probability Plots Derived from T-SQL<br />
<a href="https://multidimensionalmayhem.files.wordpress.com/2015/11/higgs-column-1-pp-plot.jpg"><img class="alignnone size-full wp-image-524" src="https://multidimensionalmayhem.files.wordpress.com/2015/11/higgs-column-1-pp-plot.jpg?w=604&h=407" alt="Higgs Column 1 PP Plot" width="604" height="407" /></a> </u></strong></p>
<p> </p>
<p><strong><u><a href="https://multidimensionalmayhem.files.wordpress.com/2015/11/higgs-column-2-pp-plot.jpg"><img class="alignnone size-full wp-image-525" src="https://multidimensionalmayhem.files.wordpress.com/2015/11/higgs-column-2-pp-plot.jpg?w=604&h=397" alt="Higgs Column 2 PP Plot" width="604" height="397" /></a></u></strong></p>
<p><strong><u> <a href="https://multidimensionalmayhem.files.wordpress.com/2015/11/jagged-plot.jpg"><img class="alignnone size-full wp-image-526" src="https://multidimensionalmayhem.files.wordpress.com/2015/11/jagged-plot.jpg?w=604&h=392" alt="Jagged Plot" width="604" height="392" /></a></u></strong></p>
<p><strong><u> <a href="https://multidimensionalmayhem.files.wordpress.com/2015/11/lactate-dehyrogenase-pp-plot.jpg"><img class="alignnone size-full wp-image-527" src="https://multidimensionalmayhem.files.wordpress.com/2015/11/lactate-dehyrogenase-pp-plot.jpg?w=604&h=393" alt="Lactate Dehyrogenase PP Plot" width="604" height="393" /></a><br />
</u></strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>It should be fairly obvious just from glancing at the results that P-P can serve as outlier detection methods in and of themselves; as the <a href="http://www.itl.nist.gov/div898/handbook/index.htm">National Institute for Standards and Technology’s Engineering Statistics Handbook</a> (one of my favorite online statistical references) points out, “In addition to checking the normality assumption, the lower and upper tails of the normal probability plot can be a useful graphical technique for identifying potential outliers. In particular, the plot can help determine whether we need to check for a single outlier or whether we need to check for multiple outliers.”[13] Nevertheless, I omitted them from my last tutorial series because they’re simply too crude to be effective in this capacity. If we were going to spot aberrant data points by eye in this manner, we might be better off comparing histograms like the ones I introduced in <a href="https://multidimensionalmayhem.wordpress.com/2015/04/21/outlier-detection-with-sql-server-part-6-1-visual-outlier-detection-with-reporting-services/">Outlier Detection with SQL Server Part 6.1: Visual Outlier Detection with Reporting Services</a> with the PDFs of the distributions we want to compare. Even then, we still run into the same chicken-and-egg problem that we encountered through the series on outliers: without goodness-of-fit testing, we can’t determine what the underlying distribution should be and therefore can’t tell if <em>any</em> records are outliers. If we force these fitness tests to do double-duty, we end up sitting between two stools, as the old German proverb says, because then we can’t be sure of either the distribution or the aberrance of underlying data points. Moreover, like most other outlier methods, it doesn’t provide any information whatsoever on <em>why</em> a record is aberrant. Furthermore, some of the approximations the underlying functions use also intrinsically discount outliers, as Acklam’s does.[14] In the case of P-P Plots and Q-Q Plots, we’re more often than not better off using them in their original capacity as fitness tests. No harm is done if we spot an aberrant data point in the scatter plots and flag them for further investigation, but scaling up this approach to full-fledged automatic outlier detection would become problematic once we get into the thousands or millions of data points.<br />
<span style="font-size:10pt;color:white;">…………</span>This size issue also places a built-in limitation on the usefulness of these visual methods for fitness testing purposes. If all of the data points from a single table are crammed into one thick black line that obscures all of the underlying detail, then we can still draw a reasonable conclusion that it fits the distribution we’re comparing it against. That approach is no longer tenable once we’re talking about one thousand out of a million records being off that line, which forces us to make a thousand judgment calls. Once we try to scale up these visual methods, we run into many of the same problems we encountered with the visual outlier detection methods surveyed in the last series, such as the use of binning and banding – not to mention the annoying restriction in Reporting Services against consuming more than a single resultset from each stored procedure, which forces us to discard any summary data that really ought to be calculated in T-SQL, MDX or DAX rather than in RS. These methods also have some specific inherent limitations, such as the inapplicability of P-P plots when the two distributions don’t have roughly simple center points (as measured by means, medians, modes, etc.).[15] At a much broader level, these tests don’t provide much information on <em>how well</em> a dataset fits a particular distribution, because that would involve half-conscious visual assessments of how much each outlier counts for or against the final verdict. For example, how are we to weigh seven outliers that are two quantiles off the mark, compared to three that are a half a quantile away? These tests are conveniences that allow users to make spot assessments of the fitness of distributions at a glance, with the minimum of interpretation and computational costs, but they simply don’t have much muscle. That is the unavoidable drawback of simplistic tests of this type. They amount to brute force, unconscious assessments that “if nothing looks out of place, the fitness of the distribution is not an issue we need to be worried about” – i.e. the flip side of visual outlier detection methods, which boil down to “if it looks out of place, we’ll look at it more closely.” Once the need arises for more definite confirmation of a dataset’s fit to a particular distribution, we have to resort to tests of greater sophistication, which invariably churn out numeric results rather than eye candy. If I don’t take a quick detour into Q-Q Plots next time around, then in the next installment we’ll climb another rung up this ladder of sophistication as we discuss skewness and kurtosis, which can provide greater detail about how closely a dataset fits its target distribution.</p>
<p> </p>
<p>[1] See the <u>Wikipedia</u> articles “P-P Plot” and “Normal Probability Plot” respectively at <a href="http://en.wikipedia.org/wiki/P%E2%80%93P_plot">http://en.wikipedia.org/wiki/P%E2%80%93P_plot</a> and <a href="http://en.wikipedia.org/wiki/Normal_probability_plot">http://en.wikipedia.org/wiki/Normal_probability_plot</a> for mention of these conundrums.</p>
<p>[2] As pointed out in the last article, for the sake of convenience I’ll be using the term “probability distriubtion function” (PDF) to denote probability density functions and the equivalent concepts for distributions on discrete scales, probability mass functions (PMFs). This is sometimes done in the literature, but not often.</p>
<p>[3] See the <u>Wikipedia</u> article “Quantile Function” at <a href="http://en.wikipedia.org/wiki/Quantile_function">http://en.wikipedia.org/wiki/Quantile_function</a> for the terminology.</p>
<p>[4] See this comment at the <u>Wikipedia</u> page “Order Statistic” at <a href="http://en.wikipedia.org/wiki/Order_statistic">http://en.wikipedia.org/wiki/Order_statistic</a> :”When using probability theory to analyze order statistics of random samples from a continuous distribution, the cumulative distribution function is used to reduce the analysis to the case of order statistics of the uniform distribution.”</p>
<p>[5] See the <u>CrossValidated</u> thread “Cumulative Distribution Function: What Does t in \int\exp(-t^2)dt stand for?” at <a href="http://stats.stackexchange.com/questions/111868/cumulative-distribution-function-what-does-t-in-int-exp-t2dt-stand-for">http://stats.stackexchange.com/questions/111868/cumulative-distribution-function-what-does-t-in-int-exp-t2dt-stand-for</a></p>
<p>[6] Another source I found useful as Holmes, Susan, 1998, “Order Statistics 10/30,” published Dec. 7, 1998 at the Stanford Univeristy web address <a href="http://statweb.stanford.edu/~susan/courses/s116/node79.html">http://statweb.stanford.edu/~susan/courses/s116/node79.html</a></p>
<p>[7] pp. 59-61, Rohlf, F. James and Sokal, Robert R., 1995, <u>Statistical Tables</u>. Freeman: New York. Retrieved from the Google Books web address <a href="http://books.google.com/books?id=1ImWLlMxEzoC&pg=PA59&lpg=PA59&dq=rankits+example&source=bl&ots=fWnT_Gfhvy&sig=bXSLnrtWqlbmT07FXVnVKd5wqbY&hl=en&sa=X&ei=gNJFVJCmNIf2OqKNgMgF&ved=0CDkQ6AEwAg#v=onepage&q=rankits%20example&f=false">http://books.google.com/books?id=1ImWLlMxEzoC&pg=PA59&lpg=PA59&dq=rankits+example&source=bl&ots=fWnT_Gfhvy&sig=bXSLnrtWqlbmT07FXVnVKd5wqbY&hl=en&sa=X&ei=gNJFVJCmNIf2OqKNgMgF&ved=0CDkQ6AEwAg#v=onepage&q=rankits%20example&f=false</a></p>
<p>[8] Some sources I used when trying to implement the factorial formula include p. 410, Teichroew, D., 1956, “Tables of Expected Values of Order Statistics and Products of Order Statistics for Samples of Size Twenty and Less from the Normal Distribution,” pp. 410-426 in <u>The Annals of Mathematical Statistics</u>, Vol. 27, No. 2. Available at the <u>Project Euclid</u> web address <a href="http://projecteuclid.org/euclid.aoms/1177728266">http://projecteuclid.org/euclid.aoms/1177728266</a> as well as Weisstein, Eric W., 2014, “Order Statistic.” published t the Wolfram MathWorld web address <a href="http://mathworld.wolfram.com/OrderStatistic.html" rel="nofollow">http://mathworld.wolfram.com/OrderStatistic.html</a></p>
<p>[9] See Acklam, Peter John, 2010, “An Algorithm for Computing the Inverse Normal Cumulative Distribution Function,” published Jan. 21, 2010, at the <u>Peter’s Page</u> website. Available online at <a href="http://home.online.no/~pjacklam/notes/invnorm/">http://home.online.no/~pjacklam/notes/invnorm/</a> I made some corrections to my original implementation after consulting John Herrero’s VB example at <a href="http://home.online.no/~pjacklam/notes/invnorm/impl/herrero/inversecdf.txt" rel="nofollow">http://home.online.no/~pjacklam/notes/invnorm/impl/herrero/inversecdf.txt</a> and discovering that I had left off several minus signs from the constants; these might have been clipped off when I imported them.</p>
<p>[10] See the <u>Wikipedia</u> article “Probability Plot Correlation Coefficient Plot” at <a href="http://en.wikipedia.org/wiki/Probability_plot_correlation_coefficient_plot">http://en.wikipedia.org/wiki/Probability_plot_correlation_coefficient_plot</a></p>
<p>[11] I checked the inverse CDF values at p. 15, University of Glasgow School of Mathematics & Statistics, 2012, “Statistical Tables,” published June 21, 2012 at the <u>University of Glasgow School of Mathematics & Statistics</u> web address <a href="http://www.stats.gla.ac.uk/˜levers/software/tables/" rel="nofollow">http://www.stats.gla.ac.uk/˜levers/software/tables/</a></p>
<p>[12] I downloaded this long ago from <a href="http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets">Vanderbilt University’s Department of Biostatistics</a><u>.</u></p>
<p>[13] See National Institute for Standards and Technology, 2014, ““1.3.5.17 Detection of Outliers,” published in the online edition of the <u>Engineering Statistics Handbook</u>. Available online at <a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm">http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm</a> . Also see</p>
<p>“1.3.3.26.10. Scatter Plot: Outlier” at <a href="http://www.itl.nist.gov/div898/handbook/eda/section3/scattera.htm">http://www.itl.nist.gov/div898/handbook/eda/section3/scattera.htm</a></p>
<p>[14] See Acklam, Peter John, 2010.</p>
<p>[15] See the aforementioned <u>Wikipedia</u> article “P-P Plot” at <a href="http://en.wikipedia.org/wiki/P%E2%80%93P_plot">http://en.wikipedia.org/wiki/P%E2%80%93P_plot</a></p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/multidimensionalmayhem.wordpress.com/523/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/multidimensionalmayhem.wordpress.com/523/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=multidimensionalmayhem.wordpress.com&blog=27407452&post=523&subd=multidimensionalmayhem&ref=&feed=1" width="1" height="1" />Goodness-of-Fit Testing with SQL Server, part 1: The Simplest Methods
http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2015/10/17/goodness-of-fit-testing-with-sql-server-part-1-the-simplest-methods/
Sat, 17 Oct 2015 08:55:16 UT/blogs/multidimensionalmayhem/2015/10/17/goodness-of-fit-testing-with-sql-server-part-1-the-simplest-methods/1http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2015/10/17/goodness-of-fit-testing-with-sql-server-part-1-the-simplest-methods/#comments<p><strong>By Steve Bolton</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>In the last series of mistutorials I published in this amateur SQL Server blog, the outlier detection methods I explained were often of limited usefulness because of a chicken-and-egg problem: some of the tests could tell us that certain data points did not fit a particular set of expected values, but not whether those records were aberrations from the correct distribution, or if our expectations were inappropriate. The problem is akin to trying to solve an algebra problem with too many variables, which often can’t be done without further information. Our conundrum can be addressed by adding that missing information through goodness-of-fit tests, which can give us independent verification of whether or not our data ought to follow a particular distribution; only then can we apply batteries of other statistical tests that require particular distributions in order to make logically valid inferences, including many of the outlier identification methods discussed previously in this blog.<br />
<span style="font-size:10pt;color:white;">…………</span>As I touched on frequently in that series, it is not uncommon for researchers in certain fields to fail to perform distribution testing, which thereby renders many of their published studies invalid. It is really an obvious problem that any layman can grasp: if we don’t have an expected pattern in mind, then it is difficult to define departures from it, which is essentially what outliers are. Goodness-of-fit tests also provide insights into data that are useful in and of themselves, as a sort of primitive form of data mining, which can be leveraged further to help us make informed choices about which of the more advanced (and concomitantly costly in terms of performance and interpretation effort) algorithms ought to be applied next in a data mining workflow. In fact, SSDM provides a Distribution property allowing users to specify whether a mining column follows a Log Normal, Normal or Uniform pattern, as I touched on briefly in <a href="http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2012/11/15/a-rickety-stairway-to-sql-server-data-mining-part-00-an-introduction-to-an-introduction/">A Rickety Stairway to SQL Server Data Mining, Part 0.0: An Introduction to an Introduction</a>. In this series of mistutorials, I will be focusing more on the information that goodness-of-fit testing can give us about our data, rather than on the statistical tests (particularly on hypotheses) it typically serves as a prerequisite to. For all intents and purposes, it will be used as a ladder to future blog posts on more sophisticated data mining techniques that can be implemented in SQL Server, provided that we have some prior information about the distribution of the data.</p>
<p style="text-align:center;"><strong>Probability Distributions vs. Regression Lines</strong></p>
<p> Goodness-of-fit tests are also sometimes applicable to regression models, which I introduced in posts like <a href="http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2013/01/08/a-rickety-stairway-to-sql-server-data-mining-algorithm-2-linear-regression/">A Rickety Stairway to SQL Server Data Mining, Algorithm 2: Linear Regression</a> and <a href="https://multidimensionalmayhem.wordpress.com/2013/01/23/a-rickety-stairway-to-sql-server-data-mining-algorithm-4-logistic-regression/">A Rickety Stairway to SQL Server Data Mining, Algorithm 4: Logistic Regression</a>. I won’t rehash the explanations here for the sake of brevity; suffice it to say that regressions can be differentiated from probability distributions by looking at them as line charts which point towards the predicted values of one or more variables, whereas distributions are more often represented as histograms representing the full range of a variable’s actual or potential values. I will deal with methods more applicable to regression later in this series, but in this article I’ll explain some simple methods for implementing the more difficult concept of a probability distribution. One thing that sets them apart is that many common histogram shapes associated with them have been labeled, cataloged and studied intensively for generations, in a way that the lines produced by regressions have not. In fact, it may be helpful for people with programming backgrounds (like many SQL Server DBAs) to look at them as objects, in the same sense as object-oriented programming. For example, some of them are associated with Location, Scale and Shape parameters and characteristics like the mode (i.e. the peak of the histogram) and median that can be likened to properties. For an excellent explanation of location and scale parameters that any layman could understand, see the <a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda364.htm%20">National Institute for Standards and Technology’s Engineering Statistics Handbook</a>, which is one of the most readable sources of information on stats that I’ve found online to date. Statisticians have also done an enormous amount of work studying every conceivable geometrical subsection of distributions and devised measures for them, such as skewness and kurtosis for the left and right corners or “tails” of a histogram. Each distribution has an associated set of functions, such as the probability density function (PDF) in the case of Continuous data types (as explained in <a href="http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2012/11/15/a-rickety-stairway-to-sql-server-data-mining-part-00-an-introduction-to-an-introduction/">A Rickety Stairway to SQL Server Data Mining, Part 0.0: An Introduction to an Introduction</a>) or the probability mass function (PMF) in the case of Discrete types. “Probability distribution function” (PDF) is occasionally used for either one in the literature and will be used as a catch-all term throughout this series.[i] Other common functions associated with distributions include the cumulative distribution function (CDF); inverse cumulative distribution function (also known as the quantile function, percent point function, or ppf); hazard function; cumulative hazard function; survival function; inverse survival function; empirical distribution function (EDF); moment-generating function (MGF) and characteristic function (CF)[ii]. I’ll save discussions of more advanced functions for Fisher Information and Shannon’s Entropy that are frequently used in information theory and data mining for a future series, Information Measurement with SQL Server. Furthermore, many of these functions can have orders applied to them, such as rankits, which are a concept I’ll deal with in the next article. I don’t yet know what many of them do, but some of the more common ones like the PDFs and CDFs are implemented in the goodness-of-fit tests for particular distributions, so we’ll be seeing T-SQL code for them later in this series.<br />
<span style="font-size:10pt;color:white;">…………</span>I also don’t yet know what situations you’re liable to encounter particular data distributions in, although I aim to by the end of the series. I briefly touched on Student’s T-distribution in the last series, where it is used in some of the hypothesis-testing based outlier detection methods, but I’m not yet acquainted with some of the others frequently mentioned in the data mining literature, like the Gamma, Exponential, Hypergeometric, Poisson, Pareto, Tukey-Lambda, Laplace and Chernoff distributions. The Chi-Squared distribution is used extensively in hypothesis testing, the Cauchy is often used in physics[iii] and the Weibull “is used to model the lifetime of technical devices and is used to describe the particle size distribution of particles generated by grinding, milling and crushing operations.”[iv] What is important for our purposes, however, is that all of the above are mentioned often in the information theory and data mining literature, which means that we can probably put them to good use in data discovery on SQL Server tables.<br />
<span style="font-size:10pt;color:white;">…………</span>If you really want to grasp the differences between them at a glance, a picture is worth a thousand words: simply check out the page “<a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda366.htm">1.3.6.6 Gallery of Distributions</a>” at the aforementioned NIST handbook for side-by-side visualizations of 19 of the most common distributions. Perhaps the simplest one to grasp is the Uniform Distribution, which has a mere straight line as a histogram; in other words, all values are equally likely, as we would see in rolls of single dice. The runner-up in terms of simplicity is the Bernoulli Distribution, which is merely the distribution associated with Boolean yes-no questions. Almost all of the explanations I’ve seen for it to date have revolved around coin tosses, which any elementary school student can understand. Dice and coin tosses are invariably used to illustrate such concepts in the literature on probabilities because they’re so intuitive, but they also have an advantage in that we can calculate exactly what the results should be, in the absence of any empirical evidence. The problem we run into in data mining is that we’re trying to discover relationships that we can’t reason out in advance, using the empirical evidence provided by the billions of rows in our cubes and tables. Once we’ve used goodness-of-fit testing to establish that the data we’ve collected indeed follows a particular distribution, then we can use all of the functions, properties, statistical tests, data mining techniques and theorems associated with it to quickly make a whole series of new inferences.</p>
<p style="text-align:center;"><strong>The “Normal” Distribution (i.e. the Bell Curve)</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>This is especially true of the Gaussian or “normal” distribution, which is by far the most thoroughly studied of them all, simply because an uncanny array of physical processes approximate it. The reasons for its omnipresence are still being debated to this day, but one of the reasons is baked right into the structure of mathematics through such laws as the Central Limit Theorem. Don’t let the imposing name scare you, because the concept is quite simple – to the point where mobsters, I’m told, used to teach themselves to instantly calculate gambling odds from it in order to run book-making operations. Once again, dice are the classic example used to explain the concept: there are obviously many paths through which one could roll a total of six from two dice, but only one combination apiece for snake eyes or boxcars. The results thus naturally form the familiar bell curve associated with the normal distribution. The most common version of it is the “standard normal distribution,” in which a mean of zero and standard deviation of one are plugged into its associated functions, which force it to form a clean bell curve centered on the zero mark in a histogram. The frequency with which the normal distribution pops up in nature is what motivates the disproportionate amount of research poured into it; even the Student’s T-distribution and the Chi-Square Distribution, for example, are used more often in tests of the normal distribution than as descriptions of a dataset in their own right.<br />
<span style="font-size:10pt;color:white;">…………</span>Unfortunately, one side effect of this lopsided devotion to one particular distribution is that there are far fewer statistical tests associated with its competitors – which tempts researchers into foregoing adequate goodness-of-fit testing, which can also be bothersome, expensive and a bit inconvenient if it disproves their assumptions. Without it, however, there is a gap in the ladder of logic needed to prove anything with hypothesis testing, or to discover new relationships through data mining. This step is disregarded with unnerving frequency – particularly in the medical field, where it can do the most damage – but ought not be, when we can use set-based languages like T-SQL and modern data warehouse infrastructure to quickly perform the requisite goodness-of-fit tests. Perhaps some of the code I’ll provide in this series can even be used in automated testing on a weekly or monthly basis, to ensure that particular columns of interest still follow a particular distribution over time and don’t come uncoupled from it, as stocks, bonds, derivatives and other financial instruments do so frequently from other economic indicators. It is often a fair assumption that a particular dataset ought to follow a normal distribution, but it doesn’t always hold – nor can we say why in many of the cases where it actually does, since the physical processes captured in our billions of records is several orders of magnitude more complex than rolls of dice and coin tosses. Nor can we be certain that many of these complex processes will continue to follow a particular distribution over time, particularly when that most inscrutable of variables, human free will, is factored in.<br />
<span style="font-size:10pt;color:white;">…………</span>Luckily, there are many goodness-of-fit tests available for the normal distribution, which is fitting given that so much statistical reasoning is dependent on it. Most of the articles in this series will thus be devoted to normality testing, although we may encounter other distributions from time to time, not to mention the tangential topic of regression. I considered kick-starting this series with four incredibly easy methods of normality testing, but one of them turned out to be nowhere near as popular or simple to implement as I believed. The ratio between the min-max range of a column and its standard deviation is listed among the earliest normality tests at Wikipedia[v], but I decided against implementing it fully due to the lack of available comparison values. The concept is quite simple: you subtract the minimum value from a column’s maximum value, then divide by the standard deviation and compare it to a lookup table, but the only reference I could find (in Hartley, et al.’s original paper[vi] from 1954) only went up to 1,000 records and only supplied values for 30 of them. We frequently encountered the same twin problems in the outlier detection series with methods based on hypothesis-testing: most of the lookup tables have massive gaps and are applicable to only a few hundred or thousand records at best, which means they are unsuited to the size of typical SQL Server tables or that popular buzzword, “Big Data.” In the absence of complete lookup tables ranging to very high values, the only alternative is to calculate the missing values ourselves, but I have not yet deciphered these particular formulas sufficiently well yet. Nor is there much point, given that this particular measure is apparently not in common use and might not be applicable to big tables for other reasons, such as the fact that the two bookend values in a dataset of 10 million records probably don’t have much significance. The code in Figure 1 runs fast and is easy to follow, but lacks meaning in the absence of lookup tables to judge what the resulting ratio ought to be for a Gaussian distribution.</p>
<p><strong><u>Figure 1: Code to Derive the Ratio of the Range to Standard Deviation</u></strong></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CREATE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">PROCEDURE</span> <span style="color:teal;">[Calculations]</span><span style="color:gray;">.</span><span style="color:teal;">[NormalityTestRangeStDevSP]<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DatabaseName</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">)</span> <span style="color:gray;">=</span> <span style="color:gray;">NULL,</span> <span style="color:teal;">@SchemaName</span> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@TableName</span> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span><span style="color:teal;">@ColumnName</span> <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>50<span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">AS</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@SchemaAndTableName</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>400<span style="color:gray;">),</span><span style="color:teal;">@SQLString</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@SchemaAndTableName</span> <span style="color:gray;">=</span> <span style="color:teal;">@DatabaseName</span> <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> <span style="color:teal;">@SchemaName</span> <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> <span style="color:teal;">@TableName</span> <span style="color:green;">–I’ll change this value one time, mainly for legibility purposes<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@SQLString</span> <span style="color:gray;">= </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">‘DECLARE @Count bigint, </span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;color:red;">@StDev decimal(‘</span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+ </span><span style="color:red;">‘), </span></span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;color:red;">@Range decimal(‘</span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+ </span><span style="color:red;">‘)<br />
</span></span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Count=Count(CAST(‘</span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@ColumnName</span> <span style="color:gray;">+</span> <span style="color:red;">‘ AS Decimal(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘))), @StDev = StDev(CAST(‘</span> <span style="color:gray;">+ </span><span style="color:teal;">@ColumnName</span> <span style="color:gray;">+</span> <span style="color:red;">‘ AS Decimal(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘))),<br />
</span></span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;color:red;">@Range = Max(CAST(‘</span><span lang="PT" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@ColumnName</span> <span style="color:gray;">+</span> <span style="color:red;">‘ AS decimal(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘))) – Min(CAST(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName</span> <span style="color:gray;">+</span> <span style="color:red;">‘ AS decimal(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘)))<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@SchemaAndTableName</span> <span style="color:gray;">+ </span><span style="color:red;">‘<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName</span> <span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Range / @StDev AS RangeStDevRatio’</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:green;">–SELECT @SQLString — uncomment this to debug string errors<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@SQLString</span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">)</span><span style="font-size:10pt;"> </span></p>
<p><span style="font-size:10pt;color:white;">…………</span>Thankfully, we have better replacements available at the same low level of complexity. One of the most rudimentary normality tests that any DBA can easily implement and interpret is the 68-95-99.7 Rule, also known as the 3-Sigma Rule. The logic is very simple: if the data follows a normal distribution, then 68 percent of the values should fall within the first standard deviation, 95 percent within the second and 99.7 percent within the third. This can be verified with a simple histogram of distinct counts, of the kind I introduced at the tail end of the last tutorial series. To implement my version, all I did was tack the code in Figure 2 onto the last Select in the HistogramBasicSP stored procedure I posted in <a href="https://multidimensionalmayhem.wordpress.com/2015/04/21/outlier-detection-with-sql-server-part-6-1-visual-outlier-detection-with-reporting-services/">Outlier Detection with SQL Server, part 6.1: Visual Outlier Detection with Reporting Services</a>. I also changed the name to HistogramBasicPlusNormalPassFailSP to reflect the added capabilities; for brevity’s sake, I won’t repeat the rest of the code. A @NumberOfStDevsFromTheMean parameter can be added to this code and combined with a clause like SELECT 1 – (1 / POWER (@NumberOfStDevsFromTheMean, 1)) to calculate Chebyshev’s Rule, a less strict test that applies to almost any distribution, not just the normal. In practice, this signifies that half of all the values for any distribution will be one standard deviation from the mean, three-quarters will be within two standard deviations and 87.5 and 93.75 percent will fall within four and five standard deviations respectively. The 3-Sigma Rule is closely to the Law of Large Numbers and Chebyshev’s Rule to its poor cousin, the Weak Law of Large Numbers; if your data fails the first test there’s no reason to hit the panic button, since it might not naturally follow a normal distribution, but failing Chebyshev’s Rule is cause to raise more than one eyebrow.</p>
<p><strong><u>Figure 2: Code to Add to the HistogramBasicSP from the Outlier Detection Series</u></strong></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">WHEN</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@HistogramType</span> <span style="color:gray;">=</span> 4 <span style="color:blue;">THEN</span> <span style="color:red;">‘</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT *, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">”FirstIntervalTest” =<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> CASE WHEN FirstIntervalPercentage BETWEEN 68 AND 100 THEN ”Pass”<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">ELSE ”Fail” </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">END,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> ”SecondIntervalTest” = </span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> CASE WHEN SecondIntervalPercentage BETWEEN 95 AND 100 THEN ”Pass”<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">ELSE ”Fail” </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">END,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">”ThirdIntervalTest” = </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">CASE WHEN ThirdIntervalPercentage BETWEEN 99.7 AND 100 THEN ”Pass”<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">ELSE ”Fail” </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">END<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">(SELECT TOP 1 CAST(@PercentageMultiplier *<br />
(SELECT Sum(FrequencyCount) FROM DistributionWithIntervalsCTE WHERE<br />
StDevInterval BETWEEN -1 AND 1) AS decimal(6,2)) AS FirstIntervalPercentage,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">CAST(@PercentageMultiplier * (SELECT<br />
Sum(FrequencyCount) FROM DistributionWithIntervalsCTE WHERE StDevInterval<br />
BETWEEN -2 AND 2) AS decimal(6,2)) AS SecondIntervalPercentage,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">CAST(@PercentageMultiplier * (SELECT<br />
Sum(FrequencyCount) FROM DistributionWithIntervalsCTE WHERE StDevInterval<br />
BETWEEN -3 AND 3) AS decimal(6,2)) AS ThirdIntervalPercentage<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM DistributionWithIntervalsCTE) AS T1′</span><span style="font-size:10pt;"> </span></p>
<p><strong><u>Figure 3: Result on the HistogramBasicPlusNormalPassFailSP on the Hemopexin Column<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">Calculations</span><span style="color:gray;">.</span><span style="color:teal;">HistogramBasicPlusNormalPassFailSP<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"> </span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DatabaseName</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">= </span><span style="color:red;">N’DataMiningProjects’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@SchemaName </span><span style="color:gray;">=</span> <span style="color:red;">N’Health’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@TableName</span> <span style="color:gray;">=</span> <span style="color:red;">N’DuchennesTable’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@ColumnName </span><span style="color:gray;">=</span> <span style="color:red;">N’Hemopexin’</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@DecimalPrecision </span><span style="color:gray;">=</span> <span style="color:red;">‘38,21’,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@HistogramType </span><span style="color:gray;">=</span> 4</span></p>
<p><a href="https://multidimensionalmayhem.files.wordpress.com/2015/10/threesigmarule.jpg"><img class="alignnone size-full wp-image-520" src="https://multidimensionalmayhem.files.wordpress.com/2015/10/threesigmarule.jpg?w=604&h=55" alt="ThreeSigmaRule" width="604" height="55" /></a><br />
<span style="font-size:10pt;color:white;">…………</span>The results in Figure 3 are child’s play to interpret: the Hemopexin column (in a dataset on the Duchennes form of muscular dystrophy which I downloaded from the <a href="http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets">Vanderbilt University’s Department of Biostatistics</a> and converted to a SQL Server table) does not quite fit a normal distribution, since the count of values for the first two standard deviations falls comfortably within the 68-95-99.7 Rule, but the third does not. Whenever I needed to stress-test the code posted in the last tutorial series on something more substantial than the Duchennes dataset’s mere 209 rows, I turned to the Higgs Boson dataset made available by the <a href="https://archive.ics.uci.edu/ml/datasets/HIGGS">University of California at Irvine’s Machine Learning Repository</a>, which now occupies close to 6 gigabytes of the same DataMiningProjects database. Hopefully in the course of one of these tutorial series (which I plan to keep writing for years to come, till I actually <em>know</em> something about data mining) I will be able to integrate practice datasets from the <a href="http://en.wikipedia.org/wiki/Voynich_manuscript">Voynich Manuscript</a>, an inscrutable medieval tome encrypted so well that no one has been able to crack it for the last half-millennium – even the National Security Agency (NSA). The first float column of the Higgs Boson dataset probably makes for a better performance test though, given that the table has 11 million rows, far more than the tens or hundreds of thousands of rows in the tables that I’ve currently compiled from the Voynich Manuscript. The good news is that this simple procedure gave us a quick and dirty normality test in just 4 minutes and 16 seconds on my six-core <a href="https://www.youtube.com/watch?v=1WqazleR3FE">Sanford and Son</a> version of a development machine – which hardly qualifies as a real server, so the results in a professional setting will probably blow that away.</p>
<p><strong><u>Figure 4: Code to Add to Derive the Ratio of Mean Absolute Deviation to Standard Deviation<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CREATE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">PROCEDURE</span> <span style="color:teal;">[Calculations]</span><span style="color:gray;">.</span><span style="color:teal;">[NormalityTestMeanAbsoluteDeviationStDevRatioSP]<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DatabaseName</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">)</span> <span style="color:gray;">=</span> <span style="color:gray;">NULL,</span> <span style="color:teal;">@SchemaName</span> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@TableName</span> <span style="color:blue;">as</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span><span style="color:teal;">@ColumnName</span> <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>50<span style="color:gray;">)</span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">AS</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@SchemaAndTableName</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>400<span style="color:gray;">),</span><span style="color:teal;">@SQLString</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@SchemaAndTableName</span> <span style="color:gray;">=</span> <span style="color:teal;">@DatabaseName</span> <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> <span style="color:teal;">@SchemaName</span> <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> <span style="color:teal;">@TableName</span> <span style="color:green;">–I’ll change this ‘ + @ColumnName + ‘ one time, mainly for legibility purposes<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@SQLString</span> <span style="color:gray;">= ‘</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">DECLARE @Mean decimal(‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+ </span><span style="color:red;">‘), </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@StDev decimal(‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Mean = Avg(CAST(‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@ColumnName</span> <span style="color:gray;">+</span> <span style="color:red;">‘ AS Decimal(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘))), @StDev = StDev(CAST(‘</span> <span style="color:gray;">+ </span><span style="color:teal;">@ColumnName</span> <span style="color:gray;">+</span> <span style="color:red;">‘ AS Decimal(‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@DecimalPrecision</span> <span style="color:gray;">+</span> <span style="color:red;">‘)))<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@SchemaAndTableName</span> <span style="color:gray;">+ </span><span style="color:red;">‘<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName</span> <span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL</span></span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT MeanAbsoluteDeviation / @StDev AS Ratio, 0.79788456080286535587989211986877 AS RatioTarget, MeanAbsoluteDeviation, @StDev as StandardDeviation<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM (SELECT Avg(Abs(‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+ </span><span style="color:teal;">@ColumnName</span> <span style="color:gray;">+</span> <span style="color:red;">‘ – @Mean)) AS MeanAbsoluteDeviation<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@SchemaAndTableName</span> <span style="color:gray;">+ </span><span style="color:red;">‘<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@ColumnName</span> <span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL) AS T1’</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:green;">–SELECT @SQLString — uncomment this to debug dynamic SQL errors<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@SQLString</span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">)</span></p>
<p><strong><u>Figure 5: Results for the Mean Absolute Deviation to Standard Deviation Ratio Test<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@return_value</span> <span style="color:gray;">=</span> <span style="color:teal;">[Calculations]</span><span style="color:gray;">.</span><span style="color:teal;">[NormalityTestMeanAbsoluteDeviationStDevRatioSP]<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"> </span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;color:teal;">@DatabaseName</span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">= </span><span style="color:red;">N’DataMiningProjects’,<br />
</span></span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@SchemaName </span><span style="color:gray;">=</span> <span style="color:red;">N’Physics’</span><span style="color:gray;">,<br />
</span></span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@TableName </span><span style="color:gray;">=</span> <span style="color:red;">N’HiggsBosonTable’,<br />
</span></span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@ColumnName </span><span style="color:gray;">=</span> <span style="color:red;">N’Column1′</span><span style="color:gray;">,<br />
</span></span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;"> </span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@DecimalPrecision</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">= </span><span style="color:red;">N’33,29′</span></span></p>
<p><a href="https://multidimensionalmayhem.files.wordpress.com/2015/10/meanabsolutedeviationtostdevratio.jpg"><img class="alignnone size-full wp-image-521" src="https://multidimensionalmayhem.files.wordpress.com/2015/10/meanabsolutedeviationtostdevratio.jpg?w=604&h=67" alt="MeanAbsoluteDeviationToStDevRatio" width="604" height="67" /></a></p>
<p><span style="font-size:10pt;color:white;">…………</span>If HistogramBasicPlusNormalPassFailSP is still too slow for your needs, it may be relieving to know that the code in Figure 4 took only two seconds to run on Column1 on the same machine and a mere five seconds on Column 2, which wasn’t properly indexed at the time. The procedure really isn’t hard to follow, if you’ve seen some of the T-SQL code I posted in the last series of the tutorials. For consistency’s sake, I’ll be using many of the same parameters in this tutorial series as I did in the last, include @DecimalPrecision, which enables users to avoid arithmetic overflows by setting their own precision and scale for the internal calculations. As we saw in the Visual Outlier Detection with Reporting Services segment of the last series, this parameter can also be used to prevent a mystifying problem in which RS reports occasionally return blank results for some columns, if their precision and scale are set too high. The first four parameters allow users to perform the normality test on any numeric column in any database for which they have adequate access, while the next-to-last-line allows users to debug the dynamic SQL.<br />
<span style="font-size:10pt;color:white;">…………</span>In between those lines it calculates the absolute deviation – i.e. the value for each record vs. the average of the whole column, which was encountered in Z-Scores and other outlier detection methods in the last series – for each row, then takes the average and divides it by the standard deviation. I haven’t yet found a good guide as to how far the resulting ratio should be from the target ratio (which is always the square root of two divided pi) to disqualify a distribution from being Gaussian, but I know from experience that Column1 is highly abnormal, whereas Column2 pretty much follows a bell curve. The first had a ratio of 0.921093 as depicted in Figure 1, whereas Figure 2 scored 0.823127 in a subsequent test, so the ratio converged fairly close to the target as expected.[vii] In its current form, the test lacks precision because there is no definite cut-off criteria, which may have been published somewhere I’m unaware of – especially since I’m an amateur learning as I go, which means I’m unaware of <em>a lot</em> that goes on in the fields related to data mining. It is still useful, however, because as a general rule of thumb we can judge that the abnormality of a dataset is proportional to how far the ratio is from the constant target value.’</p>
<p style="text-align:center;"><strong>Climbing the Ladder of Sophistication with Goodness-of-Fit</strong></p>
<p> I’m fairly sure that the second float column in the Higgs Boson Dataset is Gaussian and certain that the first is not, given the shapes of the histograms provided for both in Outlier Detection with SQL Server, part 6.1: Visual Outlier Detection with Reporting Services. Histograms represent the easiest visual test of normality you can find; it make take someone with more statistical training than I have to interpret borderline cases, but any layman can detect at a glance when a distribution is definitely following some other shape besides a bell curve. In the next installment of the series, I hope to explain how to use a couple of other visual detection methods like probability plots and Q-Q plots which are more difficult to code and reside at the upper limit of what laymen can interpret at a glance. I had a particularly difficult time calculating the CDFs for the normal distribution, for example. After that I will most likely write something about skewness, kurtosis and the Jarque-Bera test, which are also still within the upper limit of what laymen can interpret; in essence, that group measures how lopsided a distribution is on the left or right side (or “tail”) of its histogram. I wrote code for some of those measures long ago, but after that I will be in uncharted territory with topics with imposing names like the Shapiro-Wilk, D’Agostino’s K-Squared, Hosmer–Lemeshow, Chi-Squared, G, Kolmogorov-Smirnov, Anderson-Darling, Kuiper’s and Lilliefors Tests. I have a little experience with the Likelihood Ratio Test Statistic, Coefficient of Determination (R<sup>2</sup>) and Lack-of-Fit Sum of Squares, but the rest of these are still a mystery to me.<br />
<span style="font-size:10pt;color:white;">…………</span>This brings me to my usual disclaimer: I’m publishing this series in order to learn the topic, since the act of writing helps me digest new topics a lot faster and forces me to think about them more explicitly. I mainly teach through misadventure; my posts often end up as cautionary tales that amount to, “Don’t go about this the way this guy did.” There’s still some value in that, but always take the word of a professional over mine if I say anything that contradicts them; my word may carry weight in topics I have expertise in (such as foreign policy history, which I could teach at the graduate school level at the drop of a hat) but data mining and the associated statistics are definitely not among them (yet). Hopefully by the end of the series I will have learned more about probability distributions and their associated tests and made some contributions towards coding them in T-SQL; I may post a coda at the end with a use case map that can help DBAs differentiate at a glance between the various goodness-of-fit tests and their proper applications for particular distributions. At present I plan to end the series with a group of six goodness-of-fit test with wickedly cool names like the Cramér–von Mises, the Deviance, Focused, Hannan-Quinn, Bayesian and Akaike Information Criterions. The last two of these are mentioned frequently in the information theory literature, which will help provide another springboard towards a much more interesting series I’ve been planning to write for some time, Information Measurement with SQL Server. I already built two bridges to this potentially useful but really advanced series at the tail end of Outlier Detection with SQL Server, with my posts on Cook’s Distance and Mahalanobis Distance. My earlier tutorial series on <a href="http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2012/11/28/a-rickety-stairway-to-sql-server-data-mining-part-01-data-in-data-out/">A Rickety Stairway to SQL Server Data Mining</a> also served as a bridge of sorts, since some of the algorithms are related to the methods of information measurement we’ll touch on in that future series. Understanding probability distributions and goodness-of-fit is a prerequisite of sorts to cutting edge topics like Shannon’s Entropy, Minimum Description Length (MDL) and Kolmogorov Complexity that I’ll deal with in that series, which may be quite useful to miners of SQL Server data.</p>
<p>[i] For a discussion, see the Wikipedia article “Probability Density Function” at <a href="http://en.wikipedia.org/wiki/Probability_density_function">http://en.wikipedia.org/wiki/Probability_density_function</a>. I have seen “probability distribution function” used to denote both mass and density functions in other data mining and statistical literature, albeit infrequently.</p>
<p>[ii] See the <u>Wikipedia</u> article “Characteristic Function” at</p>
<p><a href="http://en.wikipedia.org/wiki/Characteristic_function_(probability_theory)">http://en.wikipedia.org/wiki/Characteristic_function_(probability_theory)</a></p>
<p>[iii] See the <u>Wikipedia</u> article “Cauchy Distribution” <a href="http://en.wikipedia.org/wiki/Cauchy_distribution">http://en.wikipedia.org/wiki/Cauchy_distribution</a></p>
<p>[iv] See the <u>Wikipedia</u> article “List of Probability Distributions” at <a href="http://en.wikipedia.org/wiki/List_of_probability_distributions">http://en.wikipedia.org/wiki/List_of_probability_distributions</a></p>
<p>[v] See the <u>Wikipedia</u> article “Normality Test” at <a href="http://en.wikipedia.org/wiki/Normality_test">http://en.wikipedia.org/wiki/Normality_test</a></p>
<p>[vi] See David, H. A.; Hartley, H. O. and Pearson, E. S., 1954, “The Distribution of the Ratio, in a Single Normal Sample, of Range to Standard Deviation,” pp. 482-493 in Biometrika, December 1954. Vol. 41, No. 3/4. I found the .pdf at the web address <a href="http://webspace.ship.edu/pgmarr/Geo441/Readings/David%20et%20al%201954%20-%20The%20Distribution%20of%20the%20Ratio%20of%20Range%20to%20Standard%20Deviation.pdf">http://webspace.ship.edu/pgmarr/Geo441/Readings/David%20et%20al%201954%20-%20The%20Distribution%20of%20the%20Ratio%20of%20Range%20to%20Standard%20Deviation.pdf</a> but it is apparently also available online at the JSTOR web address <a href="http://www.jstor.org/stable/2332728">http://www.jstor.org/stable/2332728</a>. I consulted other sources as well, like Dixon, W.J., 1950, Analysis of Extreme Values,” pp. 488-506 in <u>The Annals of Mathematical Statistics</u>. Vol. 21, No. 4. Available online at the Project Euclid web address <a href="http://projecteuclid.org/euclid.aoms/1177729747">http://projecteuclid.org/euclid.aoms/1177729747</a> and p. 484, E.S. Pearson, E.S. and Stephens, M. A., 1964, “The Ratio Of Range To Standard Deviation In The Same Normal Sample,” pp. 484-487 in <u>Biometrika</u>, December 1964. Vol. 51, No. 3/4. Published online at the JSTOR web address <a href="http://www.jstor.org/discover/10.2307/2334155?uid=2129&uid=2&uid=70&uid=4&sid=21105004560473">http://www.jstor.org/discover/10.2307/2334155?uid=2129&uid=2&uid=70&uid=4&sid=21105004560473</a></p>
<p>[vii] I verified the internal calculations against the eight-value example at the <u>MathBits.com </u>page “Mean Absolute Deviation,” which is available at the web address <a href="http://mathbits.com/MathBits/TISection/Statistics1/MAD.html" rel="nofollow">http://mathbits.com/MathBits/TISection/Statistics1/MAD.html</a></p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/multidimensionalmayhem.wordpress.com/519/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/multidimensionalmayhem.wordpress.com/519/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=multidimensionalmayhem.wordpress.com&blog=27407452&post=519&subd=multidimensionalmayhem&ref=&feed=1" width="1" height="1" />Outlier Detection with SQL Server, part 8: A T-SQL Hack for Mahalanobis Distance
http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2015/09/12/outlier-detection-with-sql-server-part-8-a-t-sql-hack-for-mahalanobis-distance/
Sat, 12 Sep 2015 06:21:49 UT/blogs/multidimensionalmayhem/2015/09/12/outlier-detection-with-sql-server-part-8-a-t-sql-hack-for-mahalanobis-distance/0http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2015/09/12/outlier-detection-with-sql-server-part-8-a-t-sql-hack-for-mahalanobis-distance/#comments<p><strong>By Steve Bolton</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>Longer code and substantial performance limitations were the prices we paid in return for greater sophistication with Cook’s Distance, the topic of the last article in this series of amateur self-tutorials on identifying outliers with SQL Server. The same tradeoff was even more conspicuous in this final installment – until I stumbled across a shortcut to coding Mahalanobis Distance that really saved my bacon out of the fire. The incredibly cool moniker sounds intimidating, but the concepts and code required to implement it are trivial, as long as we sidestep the usual matrix math that ordinarily makes it prohibitively expensive to run on “Big Data”-sized tables. It took quite a while for me to blunder into a suitable workaround, but it was worthwhile, since Mahalanobis Distance merits a special place in the pantheon of outlier detection methods, by virtue of the fact that it is uniquely suited to certain use cases. Like Cook’s Distance, it can be used to find outliers defined by more than one column, which automatically puts both in a league the others surveyed in this series can’t touch; their competitors are typically limited to detecting unusual two-column values in cases where neither column is at the extreme low or high end, Cook’s D and Mahalanobis Distance sometimes flag unusual intermediate values. The latter, however, can also be extended to more than two columns. Better yet, it also accounts for distortions introduced by variance into the distances between data points, by renormalizing them on a consistent scale that is in many cases equivalent to ordinary Euclidean Distance. Best of all, efficient approximations can be derived through a shortcut that renders all of the complex matrix math irrelevant; since the goal in outlier detection is mainly to serve as an alarm bell to draw attention to specific data points that might warrant human intervention, we can sacrifice a little accuracy in return for astronomical performance gains.<br />
<span style="font-size:10pt;color:white;">…………</span>For both the cool name and the even cooler multidimensional outlier detection capabilities, we can thank Prasanta Chandra Mahalanobis (1893-1972), who was born in Calcutta at a time when Gandhi, Mother Teresa, distant tech support call centers, Bangladesh and the other things Westerners associate today with the region were still in the far-flung future. He and his grandfather may have acted as moderating influences in Brahmo Samaj, a 19<sup>th</sup> Century offshoot of Hinduism that has since apparently died out in Bangladesh; later in life he “served as secretary to Rabindranath Tagore, particularly during the latter’s foreign travels.”[i] Some of that polymath’s brilliance must have rubbed off, because Mahalanobis responded to a dilemma he encountered while trying to compare skull sizes by inventing an entirely new measure of similarity, which can be adapted to finding outliers based on how unalike they are. It has many applications, but for our purposes it is most appropriate for finding multidimensional outliers. If you want to find out how unusual a particular value is for a particular column, any of the detection methods presented earlier in this series may suffice, if all of their particular constraints are taken into account – save for Cook’s Distance, which is a comparison between two columns. Mahalanobis Distance takes the multicolumn approach one step further and represents one of the few means available for finding out whether a particular data point is unusual when compared to several columns at same time.</p>
<p style="text-align:center;"><strong>The Litmus Test: Comparing Outliers to the Shape of Your Data</strong></p>
<p> Think of it this way: instead of measuring the distance of a single data point or mean to a standard deviation or variance, as we do so often in statistics, we’re measuring several variables against an entire multidimensional matrix of multiple columns, as well as the variances, covariances and averages associated with them. These extra columns allow us to compare our data points against a shape that is more geometrically complex than the single center point defined by a simple average or median. That is why Mahalanobis Distance is intimately related to the field of Principal Components Analysis, i.e. the study of various axes that make up a multidimensional dataset. The metric also has distinctive interrelationships with the regression metric known as leverage[ii], the normal distribution (i.e. the bell curve)[iii], Fisher Information and Beta and Chi-Squared distributions[iv] that are still far above my head, but I was able to explain it to myself crudely in this way: the metric measures how many standard deviations[v] a data point is from a set of center points for all of the columns under consideration, which taken together form an ellipsoid[vi] which is transformed into a circle by the constituent mathematical operations. Don’t let the big word ellipsoid fool you, because it’s actually quite obvious that any normal scatter plot of data points will form a cloud in the shape of an irregular curve around the center of the dataset. It’s also obvious that the center will have a more complex shape than when we use a single variable, since if we have three means or medians taken from three columns we could make a triangle out of them, or a four-sided shape like a rectangle from similar measures applied to four columns, and so on; the only difference is that we have to do some internal transformations to the shape, which need not concern us. Suppose, for example, that you wanted to discover if a particular sound differs from the others in a set by its pitch; in this case, you could simply use a typical unidimensional outlier detection method that merely examines the values recorded for the pitch. You could get a more complete picture, however, by taking into account other variables like the length of the song it belongs to, the type of instrument that produced and so on.<br />
<span style="font-size:10pt;color:white;">…………</span>The price of this more subtle and flexible means of outlier detection would be quite high in terms of both performance and code maintenance, if our implementation consisted of translating the standard matrix math notation into T-SQL. I programmed an entire series of T-SQL matrix procedures to do just that, which seemed to perform reasonably well and with greater accuracy than the method in Figure 1 – until I hit the same SQL Server internals barrier I did with Cook’s Distance in the last article. To put it bluntly, we can’t use recursive calls to table-valued parameters to implement this sort of thing, because we’ll hit the internal limit of 32 locking classes rather quickly, leading to “No more lock classes available from transaction” errors. This long-standing limitation is <a href="https://connect.microsoft.com/SQLServer/feedback/details/589999/error-no-more-lock-classes-available-from-transaction-with-simple-recursion-in-table-valued-function">by design</a>, plus there are apparently no plans to change it and no widely publicized workarounds, so that’s pretty much the end of the line (unless factorization methods and similar matrix math workarounds I’m not familiar with might do the trick).</p>
<p style="text-align:center;"><strong>Bypassing the Matrix Math</strong></p>
<p> I had to put Mahalanobis Distance on the back burner for months until I stumbled across a really simple version expressed in ordinary arithmetic notation (summation operators, division symbols, that sort of thing) rather than matrix operations like transposes and inverses. Unfortunately, I can’t remember where I found this particular formula to give adequate credit, but it allowed me to cut two chop at least two lengthy paragraphs out of this article, which I originally included to explain how the inner working of the gigantic matrix math routines I wrote; otherwise, I might’ve set a SQL Server community record for the lengthiest T-SQL sample code ever crammed into a single blog post. Instead, I present Figure 1, which is short enough to be self-explanatory; the format ought to be familiar to anyone who’s been following this series, since it features similar parameter names and dynamic SQL operations. The good news is that the results derived from the Duchennes muscular dystrophy dataset I’ve been using for practice data throughout this series aren’t substantially different from those derived through the matrix math method. There are indeed discrepancies, but this approximation is good enough to get the job done without any noteworthy performance hit at all.<br />
<span style="font-size:10pt;color:white;">…………</span>Keep in mind that the results of outlier detection methods are rarely fed into other calculations for further refinement, so perfect accuracy is not mandatory as it might be with hypothesis testing or many other statistical applications. The point of all of the T-SQL sample code in this series is to automate the detection of outliers, whereas their handling requires human intervention; all we’re doing is flagging records for further attention, so that experts with domain knowledge can cast trained eyes upon them, looking for relevant patterns or perhaps evidence of data quality problems. The goal is inspection, not perfection. A few years back I read some articles on how the quality of being “good enough” is affecting software economics of late (although I can’t rustle up the citations for those either) and this hack for Mahalanobis Distance serves as a prime example. It’s not as pretty as a complete T-SQL solution that matches the more common matrix formula exactly, but it serves the purposes of end users just as well – or perhaps even better, considering the short code is more easily maintained and the performance is stellar. This sample code runs in about 20 milliseconds on desktop computer (which could hardly be confused with a real server), compared to 19 for the Cook’s D procedure in the last tutorial. The cool thing is that it scales much better. My implementation of Cook’s D can’t be run at all on the Higgs Boson Dataset I’ve been using to stress-test my code with in this series[vii], because the regression stats would have to be recalculated for each of the 11 million rows, thereby leading to exponential running times and the need to store 121 trillion regression rows in TempDB. That’s not happening on anyone’s server, let alone my wheezing Frankenstein of a desktop. My Mahalanobis hack ran in a respectable 3:43 on the same Higgs Boson data. The lesson I learned from coding Mahalanobis and Cook’s Distances in T-SQL is that arithmetic formulas ought to be preferred to ones defined in matrix notation, whenever possible, even if that entails resorting to approximations of this kind. The difficulty consists of finding them, perhaps hidden in the back of some blog post or journal article in the dark corners of the Internet.</p>
<p><strong><u>Figure 1: Code for the Mahalanobis Distance Procedure<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CREATE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">PROCEDURE</span> <span class="SpellE"><span style="color:teal;">Calculations</span><span style="color:gray;">.</span><span style="color:teal;">OutlierDetectionMahalanobisDistanceSP<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@Database1</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@Schema1</span><span> </span><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@Table1</span><span> </span><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@Column1</span> <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@Column2</span> <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">AS</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@SchemaAndTable1</span> <span class="GramE"><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span></span>400<span style="color:gray;">),</span><span style="color:teal;">@<span class="SpellE">SQLString </span></span><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@SchemaAndTable1</span> <span style="color:gray;">=</span> <span style="color:teal;">@Database1</span> <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> <span style="color:teal;">@Schema1</span> <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> <span style="color:teal;">@Table1</span> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@<span class="SpellE">SQLString</span></span> <span class="GramE"><span style="color:gray;">=</span> <span style="color:red;">‘</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">DECLARE @Var1 <span class="GramE">decimal(</span>38,32)</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Var1 = <span class="SpellE"><span class="GramE">Var</span></span><span class="GramE">(</span>CAST(‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@Column1</span> <span style="color:gray;">+</span> <span style="color:red;">‘ AS decimal(38,32)))<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@SchemaAndTable1 </span><span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@Column1 </span><span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL AND ‘ </span><span style="color:gray;">+</span> <span style="color:teal;">@Column2</span> <span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@PrimaryKeyName </span><span style="color:gray;">+</span> <span style="color:red;">‘ AS <span class="SpellE">PrimaryKey</span>, ‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@Column1</span> <span style="color:gray;">+</span> <span style="color:red;">‘ AS Value1, ‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@Column2</span> <span style="color:gray;">+</span> <span style="color:red;">‘ AS Value2,<br />
</span></span><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">Power(</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">Power(‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@Column1</span> <span style="color:gray;">+</span> <span style="color:red;">‘ – ‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@Column2</span> <span style="color:gray;">+</span> <span style="color:red;">‘, 2) / (@Var1), 0.5) AS <span class="SpellE">MahalanobisDistance<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@SchemaAndTable1 </span><span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@Column1 </span><span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL AND ‘ </span><span style="color:gray;">+</span> <span style="color:teal;">@Column2</span> <span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">ORDER BY <span class="SpellE">MahalanobisDistance</span> DESC’</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:green;">–SELECT @<span class="SpellE">SQLString</span> — <span class="GramE">uncomment this </span>to debug dynamic SQL errors</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@<span class="SpellE">SQLString</span></span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">)</span></p>
<p><strong><u>Figure 2: Results for the Mahalanobis Distance Procedure</u></strong></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE"><span style="color:teal;">Calculations</span><span style="color:gray;">.</span><span style="color:teal;">OutlierDetectionMahalanobisDistanceSP<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"><span> </span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@Database1</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’DataMiningProjects</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@Schema1 </span><span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’Health</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@Table1 </span><span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’DuchennesTable</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@PrimaryKeyName </span><span style="color:gray;">=</span> <span style="color:red;">N’ID’</span><span style="color:gray;">,</span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@Column1 </span><span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’CreatineKinase</span></span><span style="color:red;">‘</span><span style="color:gray;">,</span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;"><span> </span><span style="color:teal;">@Column2 </span><span style="color:gray;">=</span> <span style="color:red;">‘Hemopexin’</span></span></p>
<p><a href="https://multidimensionalmayhem.files.wordpress.com/2015/09/quick-version-of-mahalanobis-distance.jpg"><img class="alignnone size-full wp-image-516" src="https://multidimensionalmayhem.files.wordpress.com/2015/09/quick-version-of-mahalanobis-distance.jpg?w=604" alt="Quick Version of Mahalanobis Distance" /></a></p>
<p><span style="font-size:10pt;color:white;">…………</span>There are still some issues left to be worked out with this approximation of Mahalanobis Distance. I’m not yet sure under which conditions we can expect better accuracy, or conversely greater discrepancies from the matrix version. I know Mahalanobis Distance can also be extended to more than two columns, unlike Cook’s D, but I have yet to engineer a solution. Moreover, I have yet to wrap my head around all of the subtle cases where Mahalanobis is less applicable; for example, it apparently isn’t as appropriate when the relationships are nonlinear.[viii] As I’ve come to realize through reading up on statistical fallacies, these tricky situations can make all of the difference in the world between a mining model that helps end users to make informed decisions and ones that can mislead them into disastrous mistakes. Deriving the numbers certainly isn’t easy, but it is even harder to attach them to the wider scaffolding of hard logic in a meaningful way. As many statisticians themselves decry, that is precisely where a lot of science and public discourse go awry. Thankfully, these issues aren’t life-and-death matters in outlier detection, where the goal is to act as an alarm bell to alert decision-makers, rather than as a decision in and of itself; as I’ve pointed out throughout this series ad infinitum, ad nauseum, these detection methods only tell us that a data point is aberrant, but say nothing about <em>why</em>. This is why knee-jerk reactions like simply deleting outliers are not only unwarranted, but can and often are used for deceptive purposes, particularly when money, reputations and sacred cows are on the line. The frequency with which this sort of chicanery still happens is shocking, as I mentioned earlier in the series. As I’ve learned along the way, perhaps the second-most critical problem dogging outlier detection is the lack of methods capable of dealing with “Big Data”-sized databases, or even the moderately sized tables of a few thousand or millions rows as we see routinely in SQL Server. Most of them simply choke and a few are even invalid or incalculable. It might be useful to develop new ones more suited to these use cases, or track down and popularize any that might’ve already been published long ago in the math literature.<br />
<span style="font-size:10pt;color:white;">…………</span>Despite such subtle interpretive risks, Mahalanobis Distance is the only statistic I’m aware of in my minimal experience that can be applied to the case of multidimensional outliers, beyond the two columns Cook’s D is limited to. In this capacity it acts as a dissimilarity measure, but can also be used for the converse purpose as a measure of similarity. Its scale-invariance and status as a “dimensionless quantity,” i.e. a pure number attached to no particular system of unit measurement, apparently have their advantages as well.[ix] It can be used in other capacities in data mining, such as in feature selection in Bayesian analysis.[x] I don’t necessarily understand a good share of the data mining and machine learning literature I’ve read to date, but can tell by the frequency it crops up that Mahalanobis Distance has diverse uses beyond mere outlier detection. In a future mistutorial series, I intend to demonstrate just how little I know about several dozen other metrics commonly used in the field of data mining, like Shannon’s Entropy, Bregman’s Divergence, the Aikake Information Criterion, Sørensen’s Similarity Index and Lyapunov Exponent. I’ll also include a whole segment on probabilistic applications of distance measures, such as the popular Küllback-Leibler Divergence, all of which turned out to be easier to code and explain than Cook’s D and Mahalanobis Distance. It only gets easier from here on in, at least in terms of common distance measures. I have no timetable for finishing the dozens of such metrics I intend to survey (if all goes according to plan, I will be posting data mining code on this blog for many years to come) but by the time I’m finished with the series tentatively titled Information Measurement with SQL Server, it should be easier than ever before to quantify just how much information there is in every table. We’ll also be able to measure such properties randomness among a column’s values. Before diving into it, however, I might post a quick wrap-up of this series that includes a makeshift use diagram that classifies all of the outlier detection methods covered in this series, as well as a makeshift method of detecting interstitial outliers that I cooked up to meet some specific use cases, one that allowed me to spot a data quality issue in my own databases. I’ll also take a quick detour into coding goodness-of-fit tests in SQL Server, since these seemed to have quite a bit of overlap with some of the outlier detection methods mentioned earlier in this series. Knowing what probability distribution one is dealing can sometimes tell us an awful lot about the underlying processes that produced it, so they can be indispensable tools in DIY data mining on SQL Server.</p>
<p>[i] I glanced at the biography at the Wikipedia page “Prasanta Chandra Mahalanobis,” at the web address <a href="http://en.wikipedia.org/wiki/Prasanta_Chandra_Mahalanobis">http://en.wikipedia.org/wiki/Prasanta_Chandra_Mahalanobis</a></p>
<p>[ii] See the <u>Wikipedia</u> page “Mahalanobis Distance” at <a href="http://en.wikipedia.org/wiki/Mahalanobis_distance">http://en.wikipedia.org/wiki/Mahalanobis_distance</a></p>
<p>[iii] <em>IBID.</em></p>
<p>[iv] “When an infinite training set is used, the Mahalanobis distance between a pattern measurement vector of dimensionality D and the center of the class it belongs to is distributed as a chi2 with D degrees of freedom. However, the distribution of Mahalanobis distance becomes either Fisher or Beta depending on whether cross validation or resubstitution is used for parameter estimation in finite training sets. The total variation between chi2 and Fisher, as well as between chi2 and Beta, allows us to measure the information loss in high dimensions. The information loss is exploited then to set a lower limit for the correct classification rate achieved by the Bayes classifier that is used in subset feature selection.” Ververidis, D. and Kotropoulos, C., 2009, “Information Loss of the Mahalanobis Distance in High Dimensions: Application to Feature Selection,” pp. 2275-2281 in IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 31, No. 12. See the abstract available at <a href="http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4815271&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F34%2F5291213%2F04815271.pdf%3Farnumber%3D4815271">http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4815271&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F34%2F5291213%2F04815271.pdf%3Farnumber%3D4815271</a></p>
<p>[v] “One interesting feature to note from this figure is that a Mahalanobis distance of 1 unit corresponds to 1 standard deviation along both primary axes of variance.” See the <u>Jennes Enterprises</u> webpage titled “Description” at <a href="http://www.jennessent.com/arcview/mahalanobis_description.htm">http://www.jennessent.com/arcview/mahalanobis_description.htm</a>.</p>
<p>[vi] See the post by jjepsuomi titled “Distance of a Test Point from the Center of an Ellipsoid,” published Jun 24, 2013 in the StackExchange Mathematics Forum, as well as the the reply by Avitus on the same date.Available online at <a href="http://math.stackexchange.com/questions/428064/distance-of-a-test-point-from-the-center-of-an-ellipsoid">http://math.stackexchange.com/questions/428064/distance-of-a-test-point-from-the-center-of-an-ellipsoid</a><u>. </u>Also see jjepsuomi post titled “Bottom to Top Explanation of the Mahalanobis Distance,” published June 19, 2013 in the <u>CrossValidated</u> forums. Available online at <a href="http://stats.stackexchange.com/questions/62092/bottom-to-top-explanation-of-the-mahalanobis-distance">http://stats.stackexchange.com/questions/62092/bottom-to-top-explanation-of-the-mahalanobis-distance</a>. The folks at <u>CrossValidated</u> gave me some help on Aug. 14, 2014 with these calculations in the thread titled “Order of Matrix Operations in Mahalanobis Calculations,” which can be found at <a href="http://stats.stackexchange.com/questions/111871/order-of-matrix-operations-in-mahalanobis-calculations" rel="nofollow">http://stats.stackexchange.com/questions/111871/order-of-matrix-operations-in-mahalanobis-calculations</a></p>
<p>[vii] I downloaded from the <a href="https://archive.ics.uci.edu/ml/datasets/HIGGS">University of California at Irvine’s Machine Learning Repository</a> a long time ago and converted it into a SQL Server table of about 6 gigabytes.</p>
<p>[viii] See Rosenmai, Peter, 2013, “Using Mahalanobis Distance to Find Outliers,” posted Nov. 25, 2013 at the <u>EurekaStatistics.com </u>web address <a href="http://eurekastatistics.com/using-mahalanobis-distance-to-find-outliers">http://eurekastatistics.com/using-mahalanobis-distance-to-find-outliers</a></p>
<p>[ix] See the <u>Wikipedia</u> pages “Mahalanobis Distance” and “Scale Invariance” at <a href="http://en.wikipedia.org/wiki/Mahalanobis_distance">http://en.wikipedia.org/wiki/Mahalanobis_distance</a> and <a href="http://en.wikipedia.org/wiki/Scale_invariance">http://en.wikipedia.org/wiki/Scale_invariance</a></p>
<p>[x] See Ververidis and Kotropoulos.</p>
<p> </p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/multidimensionalmayhem.wordpress.com/515/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/multidimensionalmayhem.wordpress.com/515/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=multidimensionalmayhem.wordpress.com&blog=27407452&post=515&subd=multidimensionalmayhem&ref=&feed=1" width="1" height="1" />Outlier Detection with SQL Server, part 7: Cook’s Distance
http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2015/08/24/outlier-detection-with-sql-server-part-7-cooks-distance/
Tue, 25 Aug 2015 04:58:21 UT/blogs/multidimensionalmayhem/2015/08/24/outlier-detection-with-sql-server-part-7-cooks-distance/2http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2015/08/24/outlier-detection-with-sql-server-part-7-cooks-distance/#comments<p><strong>By Steve Bolton[</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>I originally intended to save Cook’s and Mahalanobis Distances to close out this series not only because the calculations and concepts are more difficult yet worthwhile to grasp, but also in part to serve as a bridge to a future series of tutorials on using information measures in SQL Server, including many other distance measures. The long and short of it is that since I’m learning these topics as I go, I didn’t know what I was getting myself into and ended finishing almost all of the other distance measures before Cook’s and Mahalanobis. Like the K-Means algorithm I recapped in the last tutorial and had already explained in greater depth in <a href="https://multidimensionalmayhem.wordpress.com/2013/02/12/a-rickety-stairway-to-sql-server-data-mining-algorithm-7-clustering/">A Rickety Stairway to SQL Server Data Mining, Algorithm 7: Clustering</a>, these two are intimately related to ordinary Euclidean Distance, so how hard could they be? Some other relatively common outlier detection methods are also based on K-Means relatives (like K-Nearest Neighbors) and from there to Euclidean Distance, so I won’t belabor the point by delving into them further here. There are also distance metrics in use today that are based on mind-bending alternative systems like affine, hyperbolic, elliptic and kinematic geometries in which these laws do not necessarily hold, after relaxing some of the Euclidean postulates; for example, the affine type of non-Euclidean geometry is useful in studying parallel lines, while the hyperbolic version is useful with circles.[1] Some of them are exotic, but others are quite useful in DIY data mining, as we shall see in a whole segment on probabilistic distances (like the Küllback-Leibler Divergence) in that future mistutorial series. What tripped me up in the case of Cook’s and Mahalanobis is that the most common versions of both rely on matrix math, which can present some unexpected stumbling blocks in SQL Server. In both cases I had to resort to alternative formulas, after running into performance and accuracy issues using the formulas based on standard notation. They’re entirely worthwhile to code in T-SQL, because they occupy an important niche in the spectrum of outlier detection methods. All of the methods introduced in this series allows us to automatically flag outliers for further inspection, which can be quite useful for ferreting out data quality issues, finding novel patterns and the like in large databases – where we don’t want to go around deleting or segregating thousands of records without some kind of intelligent examination first. Cook’s and Mahalanobis, however, stand out because they’re among the few standard ways of finding aberrant data points defined by more than one column. This also makes it capable of detecting unusual two-column values in cases where neither column is at the extreme low or high end, although that doesn’t happen often. These outlier detection methods are thus valuable to have on hand, despite the fact that “Cook’s D,” as it is often known, is still prohibitively costly to run on “Big Data”-sized databases, unlike my workaround for Mahalanobis Distance. The “D” may stand for “disappointing,” although it can still be quite useful on small and medium-sized datasets.<br />
<span style="font-size:10pt;color:white;">…………</span>Cook’s Distance is suitable as the next stepping stone because we can not only draw upon the concept of distances between data points drawn from the K-Means version of the SSDM Clustering algorithm, but also make use of the lessons learned in <a href="https://multidimensionalmayhem.wordpress.com/2013/01/08/a-rickety-stairway-to-sql-server-data-mining-algorithm-2-linear-regression/">A Rickety Stairway to SQL Server Data Mining, Algorithm 2: Linear Regression</a>. Like so many other metrics discussed in this series, it made its debut in the American Statistical Association journal <em>Technometrics</em>, in this case in a paper published in 1977 by University of Minnesota statistician R. Dennis Cook, which I was fortunate enough to find a copy of. [2] The underlying equation[3] is not necessarily trivial, but the concepts underpinning it really shouldn’t be too difficult for anyone who can already grasp the ideas underpinning regression and Z-Scores, which have been dealt with in previous posts. I found it helpful to view some of the difference operations performed in Cook’s Distance (and the Mean Square Error (MSE) it depends upon) as a sort of twist on Z-Scores, in which we subtract data points from the data points predicted by a simple regression, rather than data points from the mean as we would in the deviation calculations that Z-Scores depend upon. After deriving each of these differences, we square them and sum them – just as we would in many other outlier detection calculations performed earlier in this tutorial series – then finally multiply by the reciprocal of the count.[4] The deviation calculation in the dividend of a Z-Scores can in fact be seen as a sort of crude distance metric in its own right, in which we are measuring how far each data point is from the center of a dataset, as defined in a mean or median; in the MSE, we are performing a similar distance comparison, except between a predicted value and actual value for a data point. To calculate Cook’s Distance we multiply the MSE by the count of parameters – i.e. which for our purposes means the number of columns we’re predicting, which is limited to just one in my code for now. The result forms the divisor in the final calculations, but the dividend is more complex. Instead of comparing a prediction to an actual value, we recalculate a new prediction for each data point in which the regression has been recalculated with that specific data point omitted, then subtract the result from the prediction made for that data point by the full regression model with no points omitted. The dividend is formed by squaring each of those results and summing them, in a process quite similar to the calculation of MSE and Z-Scores. The end result is a measuring stick that we can compare two-column data points against, rather than just one as we have with all of the other outlier detection methods in this series.<br />
<span style="font-size:10pt;color:white;">…………</span>The difficulty in all of this is not the underlying concept, which is sound, but the execution, given that we have to recalculate an entirely new regression model for each data point. The dilemma is analogous to the one we faced in previous articles on the <a href="https://multidimensionalmayhem.wordpress.com/2015/02/14/outlier-detection-with-sql-server-part-3-5-the-modified-thompson-tau-test/">Modified Thompson Tau Test</a> and <a href="https://multidimensionalmayhem.wordpress.com/2015/02/28/outlier-detection-with-sql-server-part-3-6-chauvenets-criterion/">Chauvenet’s Criterion</a>, where we had to perform many of the computations recursively in order to recalculate the metrics after simulating the deletion of data points. Each of the difference operations we perform below tells us something about how important each record is within the final regression model, rather than how many outliers there might be if that record was eliminated, but it still presents a formidable performance problem. This drawback is magnified by the fact that we can’t use SQL Server’s exciting new windowing functions to solve the recursion issue with sliding windows, as we did in the articles on the Modified Thompson Tau test and Chauvenet’s Criterion. In fact, Step 5 in Figure 1 would be an ideal situation for the EXCLUDE CURRENT ROW clause that Itzik Ben-Gan, the renowned expert on SQL Server windowing functions, wants to see Microsoft add to the T-SQL specification.[5] As I discovered to my horror, you can’t use combine existing clauses like ROW UNBOUNDED AND ROWS 1 PRECEDING in conjunction with ROWS 1 FOLLOWING AND ROWS UNBOUNDED FOLLOWING to get the same effect. As a result, I had to perform the recalculations of the regressions in a series of table variables that are much less readable and efficient than an EXCLUDE CURRENT ROW clause might be, albeit more legible than the last remaining alternative, a zillion nested subqueries. I’m not yet fluent enough in T-SQL to say if these table variables cause more of a performance impact than subqueries in contexts like this, but this is one case in which they’re appropriate because readability is at a premium. It may also be worthwhile to investigate temporary tables as a replacement; so far, this method does seem to be faster than the common table expression (CTE) method I originally tried. I initially programmed an entire series of matrix math functions and stored procedures to derive both Cook’s and Mahalanobis Distances, since both are often defined in terms of matrix math notation, unlike many other distances used for data mining purposes. That method worked well, except that it ran into a brick wall: SQL Server has an internal limitation of 32 locking classes, which often leads to “No more lock classes available from transaction” error messages with recursive table-valued parameters. This is <a href="https://connect.microsoft.com/SQLServer/feedback/details/589999/error-no-more-lock-classes-available-from-transaction-with-simple-recursion-in-table-valued-function">by design</a> and I have yet to see any workarounds posted or any glimmer of hope that Microsoft intends to ameliorate it in future upgrades, which means no matrix math using table-valued parameters for the foreseeable future.<br />
<span style="font-size:10pt;color:white;">…………</span>Yet another issue I ran into was interpreting the notation for Cook’s Distance, which can be arrived at from two different directions: the more popular method seems to be the series of calculations outlined two paragraphs above, but the same results can be had by first calculating an intermediate figure known as Leverage. This can be derived from what is known as a Hat Matrix, which can be easily derived in the course of calculating standard regression figures like MSE, predictions, residuals and the like. Unlike most other regression calculations, which are defined in terms of standard arithmetic operations like divisions, multiplication, etc. the notation for deriving Leverage is almost always given in terms of matrices, since it’s derived from a Hat Matrix. It took me a lot of digging to find an equivalent expression of Leverage in terms of arithmetic operations rather than matrix math, which I couldn’t use due to the locking issue. It was a bit like trying to climb a mountain, using a map from the other side; I was able to easily code all of the stats in the @RegressionTable in Figure 1, alongside many other common regression figures, but couldn’t tell exactly which of them could be used to derive the Hat Matrix and Leverage from the opposite direction. As usual, the folks at CrossValidated (StackExchange’s data mining forum) saved my bacon out of the fire.[6] While forcing myself to learn to code the intermediate building blocks of common mining algorithms in T-SQL, one of the most instructive lessons I’ve learned is that translating notations can be a real stumbling block, one that even professionals encounter. Just consider that a word to the wise, for anyone who tries to acquire the same skills from scratch as I’m attempting to do. Almost all of the steps in Figure 1 revolve around common regression calculations, i.e. intercepts, slopes, covariance and the like, except that fresh regression models are calculated for each row. The actual Cook’s Distance calculation isn’t performed until Step #6. At that point it was trivial to add a related stat known as DFFITS, which can be converted back and forth from Cook’s D; usually when I’ve seen DFFITS mentioned (in what little literature I’ve read), it’s in conjunction with Cook’s, which is definitely a more popular means of measuring the same quantity.[7] For the divisor, we use the difference between the prediction for each row and the prediction when that row is left out of the model and for the dividend, we use the standard deviation of the model when that row is omitted, times the square root of the leverage. I also included the StudentizedResidual and the global values for the intercept, slope and the like in the final results, since it was already necessary to calculate them along the way; it is trivial to calculate many other regression-related stats once we’ve derived these table variables, but I’ll omit them for the sake of brevity since they’re not directly germane to Cook’s Distance and DFFITS.</p>
<p><strong><u>Figure 1: T-SQL Sample Code for the Cook’s Distance Procedure<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">CREATE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">PROCEDURE</span> <span class="SpellE"><span style="color:teal;">Calculations</span><span style="color:gray;">.</span><span style="color:teal;">CooksDistanceSP<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@Database1</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@Schema1</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@Table1</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@Column1</span> <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">),</span> <span style="color:teal;">@Column2</span> <span style="color:blue;">AS</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span>128<span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">AS</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">DECLARE</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@SchemaAndTable1</span> <span class="GramE"><span style="color:blue;">nvarchar</span><span style="color:gray;">(</span></span>400<span style="color:gray;">),</span><span style="color:teal;">@SQLString1</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">),</span><span style="color:teal;">@SQLString2</span> <span style="color:blue;">nvarchar</span><span style="color:gray;">(</span><span style="color:fuchsia;">max</span><span style="color:gray;">)<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@SchemaAndTable1</span> <span style="color:gray;">=</span> <span style="color:teal;">@Database1</span> <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> <span style="color:teal;">@Schema1</span> <span style="color:gray;">+</span> <span style="color:red;">‘.’</span> <span style="color:gray;">+</span> <span style="color:teal;">@Table1</span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@SQLString1</span> <span class="GramE"><span style="color:gray;">=</span> <span style="color:red;">‘</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">DECLARE<br />
@<span class="SpellE">MeanX</span> <span class="GramE">decimal(</span>38,21),@<span class="SpellE">MeanY</span> decimal(38,21), </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">StDevX</span> <span class="GramE">decimal(</span>38,21), @<span class="SpellE">StDevY</span> decimal(38,21), </span><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Count bigint</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Correlation <span class="GramE">decimal(</span>38,21),<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Covariance <span class="GramE">decimal(</span>38,21),<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Slope <span class="GramE">decimal(</span>38,21),<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@Intercept <span class="GramE">decimal(</span>38,21),<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">MeanSquaredError</span> <span class="GramE">decimal(</span>38,21),<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">NumberOfFittedParameters</span> bigint</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SET @<span class="SpellE">NumberOfFittedParameters</span> = 2<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">DECLARE @<span class="SpellE">RegressionTable</span> table<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">(ID bigint IDENTITY (1<span class="GramE">,1</span>),<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">Value1 <span class="GramE">decimal(</span>38,21),<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">Value2 <span class="GramE">decimal(</span>38,21),<br />
L</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">ocalSum</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> bigint,<br />
</span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;color:red;">LocalMean1 decimal(38,21),<br />
</span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;color:red;">LocalMean2 decimal(38,21),<br />
</span><span lang="ES" style="font-size:9.5pt;font-family:Consolas;color:red;">LocalStDev1 <span class="GramE">decimal(</span>38,21),<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">LocalStDev2 <span class="GramE">decimal(</span>38,21),<br />
</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">LocalCovariance</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> <span class="GramE">decimal(</span>38,21),<br />
</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">LocalCorrelation</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> <span class="GramE">decimal(</span>38,21),<br />
</span><span class="SpellE"><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">LocalSlope</span></span></span><span class="GramE"><span style="font-size:9.5pt;font-family:Consolas;color:red;"> AS </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span class="SpellE">LocalCorrelation</span> * (LocalStDev2 / LocalStDev1),<br />
</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">LocalIntercept</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> <span class="GramE">decimal(</span>38,21),<br />
</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">PredictedValue</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> <span class="GramE">decimal(</span>38,21),<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">Leverage <span class="GramE">decimal(</span>38,21),<br />
</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">AdjustedPredictedValue </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"><span class="GramE">decimal(</span>38,21),<br />
</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">GlobalPredictionDifference </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">AS Value2 – <span class="SpellE">PredictedValue</span>,<br />
</span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">AdjustmentDifference </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">AS <span class="SpellE">PredictedValue</span> – <span class="SpellE">AdjustedPredictedValue<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">)</span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">INSERT INTO @<span class="SpellE">RegressionTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">(Value1, Value2)<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@Column1 </span><span style="color:gray;">+</span> <span style="color:red;">‘, ‘</span> <span style="color:gray;">+</span> <span style="color:teal;">@Column2</span> <span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@SchemaAndTable1 </span><span class="GramE"><span style="color:gray;">+</span> <span style="color:red;">‘<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">WHERE ‘</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">+</span> <span style="color:teal;">@Column1 </span><span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL AND ‘ </span><span style="color:gray;">+</span> <span style="color:teal;">@Column2</span> <span style="color:gray;">+</span> <span style="color:red;">‘ IS NOT NULL</span></span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— STEP #1 – <span class="GramE">RETRIEVE</span> THE GLOBAL AGGREGATES NEEDED FOR OTHER CALCULATIONS<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Count=<span class="GramE">Count(</span>CAST(Value1 AS Decimal(38,21))),<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">MeanX</span> = <span class="SpellE"><span class="GramE">Avg</span></span><span class="GramE">(</span>CAST(Value1 AS Decimal(38,21))), @<span class="SpellE">MeanY </span>= <span class="SpellE">Avg</span>(CAST(Value2 AS Decimal(38,21))),<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">@<span class="SpellE">StDevX</span> = <span class="GramE">StDev(</span>CAST(Value1 AS Decimal(38,21))), @<span class="SpellE">StDevY</span> = StDev(CAST(Value2 AS Decimal(38,21)))<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM @<span class="SpellE">RegressionTable</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— STEP #2 – CALCULATE THE CORRELATION (BY FIRST GETTING THE COVARIANCE)<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Covariance = <span class="GramE">SUM(</span>(Value1 – @<span class="SpellE">MeanX</span>) * (Value2 – @<span class="SpellE">MeanY</span>)) / (@Count – 1)<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM @<span class="SpellE">RegressionTable</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— <span class="GramE">once</span> <span class="SpellE">weve</span> got the covariance, its trivial to calculate the correlation<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Correlation = @Covariance / (@<span class="SpellE">StDevX</span> * @<span class="SpellE">StDevY</span>)</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— STEP #3 – CALCULATE THE SLOPE AND INTERCEPT AND MAKE PREDICTIONS<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Slope = @Correlation * (@<span class="SpellE">StDevY</span> / @<span class="SpellE">StDevX</span>)<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @Intercept = <span class="GramE">@<span class="SpellE">MeanY</span></span> – (@Slope * @<span class="SpellE">MeanX</span>)<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">UPDATE @<span class="SpellE">RegressionTable</span></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SET <span class="SpellE">PredictedValue</span> = (Value1 * @Slope) + @Intercept<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">— STEP #4 – CALCULATE THE MEAN SQUARED ERROR<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">— subtract the actual values from the <span class="SpellE">PredictedValues </span>and square them; add em together; then multiple the <span class="GramE">result </span>by the reciprocal of the count<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">— <span class="GramE">as</span> defined at the Wikipedia page “Mean Squared Error” <a href="http://en.wikipedia.org/wiki/Mean_squared_error" rel="nofollow">http://en.wikipedia.org/wiki/Mean_squared_error</a></span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT @<span class="SpellE">MeanSquaredError</span> = <span class="GramE">SUM(</span>Power((<span class="SpellE">PredictedValue</span> – Value2), 2)) / CAST(@Count – @<span class="SpellE">NumberOfFittedParameters </span>AS float<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> FROM @<span class="SpellE">RegressionTable</span></span><span style="font-size:9.5pt;font-family:Consolas;"><br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">— STEP #5 – NOW CALCULATE A SLIDING WINDOW<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">— recalculate alternate regression models for each row, plus the leverage from intermediate steps<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">— <span class="GramE">none</span> of this is terribly complicated; <span class="SpellE">theres</span> just a lot to fi<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">— <span class="GramE">the</span> outer select is needed here because aggregates <span class="SpellE">arent</span> allowed in the main UPDATE statement (silly limitation)</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">UPDATE T0<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SET LocalMean1 = T3.LocalMean1, LocalMean2 = T3.LocalMean2, </span><span style="font-size:9.5pt;font-family:Consolas;color:red;">LocalStDev1 = T3.LocalStDev1, LocalStDev2 = T3.LocalStDev2<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM @<span class="SpellE">RegressionTable</span> AS T0<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">INNER JOIN<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> (SELECT T1.ID AS ID, <span class="SpellE"><span class="GramE">Avg</span></span><span class="GramE">(</span>T2.Value1) AS LocalMean1,</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> <span class="SpellE"><span class="GramE">Avg</span></span><span class="GramE">(</span>T2.Value2) AS LocalMean2,</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> <span class="GramE">StDev(</span>T2.Value1) AS LocalStDev1, StDev(T2.Value2) AS LocalStDev2<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> FROM @<span class="SpellE">RegressionTable</span> AS T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> INNER JOIN @<span class="SpellE">RegressionTable </span>AS T2<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> ON T2.ID > T1.ID OR T2.ID < T1.ID<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> GROUP BY T1.ID) AS T3<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> ON T0.ID = T3.ID<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">‘</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@SQLString2</span> <span class="GramE"><span style="color:gray;">=</span> <span style="color:red;">‘</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">UPDATE T0<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SET <span class="SpellE">LocalCovariance</span> = T3.LocalCovariance, <span class="SpellE">LocalCorrelation</span> = T3.LocalCovariance / (LocalStDev1 * LocalStDev2), <span class="SpellE">LocalSum</span> = T3.LocalSum<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM @<span class="SpellE">RegressionTable</span> AS T0<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> INNER JOIN (SELECT T1.ID AS ID, <span class="GramE">SUM(</span>(T2.Value1 – T2.LocalMean1) * (T2.Value2 – T2.LocalMean2)) / (@Count – 1) AS <span class="SpellE">LocalCovariance,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> <span class="GramE">SUM(</span>Power(T2.Value1 – T2.LocalMean1, 2)) AS <span class="SpellE">LocalSum<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> FROM @<span class="SpellE">RegressionTable</span> AS T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> INNER JOIN @<span class="SpellE">RegressionTable </span>AS T2<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> ON T2.ID > T1.ID OR T2.ID < T1.ID<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> GROUP BY T1.ID) AS T3<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> ON T0.ID = T3.ID</span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">UPDATE T0<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SET Leverage = T3.Leverage<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM @<span class="SpellE">RegressionTable</span> AS T0<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> INNER JOIN (SELECT ID, Value1<span class="GramE">, 1</span> / CAST(@Count AS float) + (CASE WHEN Dividend1 = 0 THEN 0 ELSE Divisor1 / Dividend1 END) AS Leverage<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> FROM (SELECT ID, Value1, <span class="GramE">Power(</span>Value1 – LocalMean1, 2) AS Divisor1, <span class="SpellE">LocalSum</span> AS Dividend1, Power(Value2 – LocalMean2, 2) AS Divisor2<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> FROM @<span class="SpellE">RegressionTable</span>) AS T2) AS T3<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;"> ON T0.ID = T3.ID</span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">UPDATE @<span class="SpellE">RegressionTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SET <span class="SpellE">LocalIntercept</span> = LocalMean2 – (<span class="SpellE">LocalSlope </span>* LocalMean1)</span><span style="font-size:9.5pt;font-family:Consolas;"> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">UPDATE @<span class="SpellE">RegressionTable<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SET <span class="SpellE">AdjustedPredictedValue</span> = (Value1 * <span class="SpellE">LocalSlope</span>) + <span class="SpellE">LocalIntercept</span> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— #6 <span class="GramE">RETURN</span> THE RESULTS<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">SELECT ID, Value1, Value2, <span class="SpellE">StudentizedResidual<span class="GramE">,Leverage,CooksDistance,DFFITS<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM (SELECT ID, Value1, Value2, </span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">GlobalPredictionDifference </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">/ LocalStDev1 AS <span class="SpellE">StudentizedResidual</span>, Leverage,<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">(<span class="GramE">Power(</span><span class="SpellE">GlobalPredictionDifference</span>, 2) / (@<span class="SpellE">NumberOfFittedParameters</span> * @<span class="SpellE">MeanSquaredError</span>)) * (Leverage / <span class="GramE">Power(</span>1 – Leverage, 2)) AS <span class="SpellE">CooksDistance</span>, </span><span class="SpellE"><span style="font-size:9.5pt;font-family:Consolas;color:red;">AdjustmentDifference </span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">/ (LocalStDev2 * <span class="GramE">Power(</span>Leverage, 0.5)) AS DFFITS<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">FROM @<span class="SpellE">RegressionTable</span>) AS T1<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">ORDER BY <span class="SpellE">CooksDistance</span> DESC</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:red;">— <span class="GramE">also</span> return the global stats<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:red;">— SELECT @<span class="SpellE">MeanSquaredError</span> AS <span class="SpellE">GlobalMeanSquaredError</span>, @Slope AS <span class="SpellE">GlobalSlope</span>, @Intercept AS <span class="SpellE">GlobalIntercept</span>, @Covariance AS <span class="SpellE">GlobalCovariance</span>, @Correlation AS <span class="SpellE">GlobalCorrelation<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;color:red;">‘</span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:blue;">SET</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@SQLString1</span> <span style="color:gray;">=</span> <span style="color:teal;">@SQLString1</span> <span style="color:gray;">+</span> <span style="color:teal;">@SQLString2</span> </span></p>
<p class="MsoNormal"><span style="font-size:9.5pt;font-family:Consolas;color:green;">–SELECT @SQLString1 — <span class="GramE">uncomment this</span> to debug dynamic SQL errors<br />
</span><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC </span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">(</span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@SQLString1</span><span style="font-size:9.5pt;font-family:Consolas;color:gray;">)</span></p>
<p><span style="font-size:10pt;color:white;">…………</span>Each of the procedures I’ve posted in previous articles has made use of dynamic SQL similar to that in Figure 1, but in this case there’s simply a lot more of it; in this case, it helps to a least have the operations presented sequentially in a series of updates to the @RegressionTable variable rather than bubbling up from the center of a set of nested subqueries. The first three steps in Figure 1 are fairly straightforward: we retrieve the global aggregates we need as usual, then calculate the covariance (a more expensive operation that involves another scan or seek across the table) from them, followed by the slope and intercept in succession.[8] The MSE calculation in Step 4 requires yet another scan or seek across the whole table. Step 5 accounts for most of the performance costs, since we cannot use the aggregates derived in Step 1 for the new regression models we have to build for each data point. It was necessary to break up the dynamic SQL into two chunks via the second SET @SQLString = @SQLString + ‘ statement, which prevents a bug (or “feature”) that apparently limits the size of strings that can be assigned at any one time, even with nvarchar(max).[9] Various thresholds are sometimes baked into the algorithm to flag “influential points” but I decided to allow users to add their own, in part to shorten the code and in part because there’s apparently not a consensus on what those thresholds ought to be.[10]<br />
<span style="font-size:10pt;color:white;">…………</span>Aside from the lengthy computations, the Cook’s Distance procedure follows much the same format as other T-SQL solutions I’ve posted in this series. One of the few differences is that there is an extra Column parameter so that the user can compare two columns in any database for which they requisite access, since Cook’s Distance involves a comparison between two columns rather than a test of a single column as in previous tutorials. The @DecimalPrecision parameter is still available so that users can avoid arithmetic overflows by manually setting a precision and scale appropriate to the columns they’ve selected. To decomplicate things I omitted the usual @OrderByCode for sorting the results and set a default of 2 for @NumberOfFittedParameters. As usual, the procedure resides in a Calculations schema and there is no code to handle validation, SQL injection or spaces in object names. Uncommenting the next-to-last line allows users to debug the dynamic SQL.</p>
<p><strong><u>Figure 2: Results for the Cook’s Distance Query<br />
</u></strong><span style="font-size:9.5pt;font-family:Consolas;color:blue;">EXEC</span><span style="font-size:9.5pt;font-family:Consolas;"> <span class="SpellE"><span style="color:teal;">Calculations</span><span style="color:gray;">.</span><span style="color:teal;">CooksDistanceSP<br />
</span></span></span><span style="font-size:9.5pt;font-family:Consolas;color:blue;"> </span><span style="font-size:9.5pt;font-family:Consolas;color:teal;">@Database1</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’DataMiningProjects</span></span><span style="color:red;">‘,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@Schema1 </span><span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’Health</span></span><span style="color:red;">‘,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@Table1 </span><span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’DuchennesTable</span></span><span style="color:red;">‘</span><span style="color:gray;">,<br />
</span></span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@Column1 </span><span style="color:gray;">=</span> <span class="SpellE"><span style="color:red;">N’PyruvateKinase</span></span><span style="color:red;">‘</span><span style="color:gray;">,</span><br />
</span><span style="font-size:9.5pt;font-family:Consolas;"> <span style="color:teal;">@Column2 </span><span style="color:gray;">=</span> <span style="color:red;">‘Hemopexin’</span></span></p>
<p><a href="https://multidimensionalmayhem.files.wordpress.com/2015/08/cooks-distance-results.jpg"><img class="alignnone size-full wp-image-511" src="https://multidimensionalmayhem.files.wordpress.com/2015/08/cooks-distance-results.jpg?w=604&h=425" alt="Cook's Distance Results" width="604" height="425" /></a></p>
<p><span style="font-size:10pt;color:white;">…………</span>As I have in many previous articles, I ran the first test query against a 209-row dataset on the Duchennes form of muscular dystrophy, which I downloaded from the <a href="http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets">Vanderbilt University’s Department of Biostatistics</a>. As the results in Figure 2 show, the protein Hemopexin had the greatest influence on the Pyruvate Kinase enzyme at the 126<sup>th</sup> record. Here the Cook’s Distance was 0.081531, which was about 4 times higher than the value for the sixth-highest Cook’s Distance, with a bigint primary key of 23, so we may safely conclude that this record is an outlier, unless existing domain knowledge suggests that this particular point is supposed to contain such extreme values. Be warned that for a handful of value pairs, my figures differ from those obtained in other mining tools (which believe it or not, also have discrepancies between each other) but I strongly suspect that depends on how nulls and divide-by-zeros are dealt with, for which there is no standard method in Cook’s D. These minor discrepancies are not of critical importance, however, since the outlier detection figures are rarely plugged into other calculations, nor is it wise to act on them without further inspection.<br />
<span style="font-size:10pt;color:white;">…………</span>The procedure executed in 19 milliseconds on the table I imported the Duchennes data into, but don’t let that paltry figure deceive you: on large databases, the cost rises exponentially to the point where it becomes prohibitive. There were only a handful of operators, including two Index Seeks which accounted for practically the entire cost of the query, which means that it may be difficult to gain much performance value from optimizing the execution plans. This brings us to the bad news: the procedure simply won’t run against datasets of the same size as the Higgs Boson dataset I downloaded from the <a href="https://archive.ics.uci.edu/ml/datasets/HIGGS">University of California at Irvine’s Machine Learning Repository</a> and have being using to stress-test my sample T-SQL throughout this series. Since we need to recalculate a new regression model for each of the 11 million rows, we’re at least talking about 11 million squared, or 121 trillion rows of regression data in order to derive 11 million separate Cook’s Distances. I believe that puts us in the dreaded EXPTIME and EXPSPACE computation complexity classes; without an EXCLUDE CURRENT ROW windowing clause or some other efficient method of calculating intermediate regression aggregates in one pass, I know of no other way to reduce this down from an exponential running time to a polynomial. I’m weak in GROUP BY operations, so perhaps another workaround can be derived through those – but if not, we’re up the proverbial creek without a paddle. Even if you can wait the lifetime of the universe or whatever it takes to run the 11,000,000<sup>2</sup> regression operations, it is unlikely that you’ll have enough spare room in TempDB for 121 trillion rows. The price to be paid for the more sophisticated insights Cook’s Distance provides is that it simply cannot be run against Big Data-sized datasets, at least in its current form.<br />
<span style="font-size:10pt;color:white;">…………</span>As we’ve seen so many times in this series, scaling up existing outlier detection methods to Big Data sizes doesn’t merely present performance issues, but logical ones; in the case of Cook’s Distance, omitting a single observation is only going to have an infinitesimal impact on a regression involving 11 million records, no matter how aberrant the data point might be. Since it is derived from linear least squares regression, Cook’s Distance shares some of its limitations, like “the shapes that linear models can assume over long ranges, possibly poor extrapolation properties, and sensitivity to outliers.”[11] We’re trying to harness that sensitivity when performing outlier detection, but the sheer size of the regression lines generated from Big Data made render it too insensitive to justify such intensive computations. When you factor in the performance costs of recalculating a regression model for that many rows the usefulness of this outlier identification method obviously comes into question. On the other hand, the procedure did seem to identify outliers with greater accuracy when run against other tables I’m very familiar with, which consisted of a few thousand rows apiece. There may be a happy medium at work here, in which Cook’s Distance is genuinely useful for a certain class of moderately sized tables in situations where the extra precision of this particular metric is needed. When deciding whether or not the extra computational costs is worth it for a particular table, keep in mind that the performance costs are magnified in my results because I’m running them on a wimpy eight-core semblance of an AMD workstation that has more in common with <a href="https://www.youtube.com/watch?v=1WqazleR3FE">Sanford and Son’s truck </a>than a real production environment server. Furthermore, the main uses in this field for outlier detection of any kind are in exploratory data mining and data quality examinations, which don’t require constant, ongoing combing of the database for outliers; these are issues of long-term importance, not short-term emergencies like a relational query that has to be optimized perfectly because it may have to run every day, or even every second. Tests like this can be left for off-peak hours on a weekly or monthly basis, so as not to interfere with normal operations. Cook’s Distance might also be preferred when searching for a specific type of outlier, i.e. those that could throw off predictive modeling, just as Benford’s Law is often selected when identifying data quality problems is paramount, especially the intentional data quality issue we call fraud. Cook’s Distance might also prove more useful in cases where the relationship between two variables is at the heart of the questions that the tester chooses to ask. Cook’s and DFFITS can also apparently be used to convert back and forth from another common stat I haven’t yet learned to use, the Wald Statistic, which is apparently used for ferreting out the values of unknown parameters.<a href="#_edn12" name="_ednref12">[12]</a>. If there’s one thing I’ve learned while writing this series, it’s that there’s a shortage of outlier detection methods appropriate to the size of the datasets that DBAs work with. Thankfully, the workaround I translated into T-SQL for my next column allows us to use Mahalanobis Distance to find outliers across columns, without the cosmic computational performance hit for calculating Cook’s D on large SQL Server databases. As with Cook’s D, there are some minor accuracy issues, but these are merely cosmetic when looking for outliers, where detection can be automated but handling ought to require human intervention.</p>
<p> </p>
<p>[1] For a quick run-down, see the <u>Wikipedia</u> page “Non-Euclidean Geometry” at <a href="http://en.wikipedia.org/wiki/Non-Euclidean_geometry" rel="nofollow">http://en.wikipedia.org/wiki/Non-Euclidean_geometry</a></p>
<p>[2] Cook, R. Dennis, 1977, “Detection of Influential Observations in Linear Regression,” pp. 15-18 in <u>Technometrics</u>, February 1977. Vol. 19, No. 1. A .pdf version is available at the Universidad de São Paulo’s Instituto de Matematica Estatística web address <a href="http://www.ime.usp.br/~abe/lista/pdfWiH1zqnMHo.pdf">http://www.ime.usp.br/~abe/lista/pdfWiH1zqnMHo.pdf</a></p>
<p>[3] I originally retrieved it from the <u>Wikipedia</u> page “Cook’s Distance” at <a href="http://en.wikipedia.org/wiki/Cook%27s_distance">http://en.wikipedia.org/wiki/Cook%27s_distance</a> , but there’s no difference between it and the one in Cook’s paper.</p>
<p>[4] I used the formula defined at the <u>Wikipedia</u> page “Mean Squared Error,” at the web address <a href="http://en.wikipedia.org/wiki/Mean_squared_error">http://en.wikipedia.org/wiki/Mean_squared_error</a>. The same page states that there are two more competing definitions, but I used the one that the Cook’s Distance page linked to (The <u>Wikipedia</u> page “Residual Sum of Squares” at <a href="http://en.wikipedia.org/wiki/Residual_sum_of_squares">http://en.wikipedia.org/wiki/Residual_sum_of_squares</a> may also be of interest.):</p>
<blockquote><p> “In regression analysis, the term mean squared error is sometimes used to refer to the unbiased estimate of error variance: the residual sum of squares divided by the number of degrees of freedom. This definition for a known, computed quantity differs from the above definition for the computed MSE of a predictor in that a different denominator is used. The denominator is the sample size reduced by the number of model parameters estimated from the same data, (n-p) for p regressors or (n-p-1) if an intercept is used.[3] For more details, see errors and residuals in statistics. Note that, although the MSE is not an unbiased estimator of the error variance, it is consistent, given the consistency of the predictor.”</p>
<p>“Also in regression analysis, “mean squared error”, often referred to as mean squared prediction error or “out-of-sample mean squared error”, can refer to the mean value of the squared deviations of the predictions from the true values, over an out-of-sample test space, generated by a model estimated over a particular sample space. This also is a known, computed quantity, and it varies by sample and by out-of-sample test space.”</p></blockquote>
<p>[5] p. 47, Ben-Gan, Itzik, 2012, <u>Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions</u> . O’Reilly Media, Inc.: Sebastopol, California.</p>
<p>[6] See the CrossValidated thread titled “Is It Possible to Derive Leverage Figures Without a Hat Matrix?”, posted by SQLServerSteve on June 26, 2015 at <a href="http://stats.stackexchange.com/questions/158751/is-it-possible-to-derive-leverage-figures-without-a-hat-matrix">http://stats.stackexchange.com/questions/158751/is-it-possible-to-derive-leverage-figures-without-a-hat-matrix</a> . Also see the reply by the user Glen_B to the CrossValidated thread titled “Which of these points in this plot has the highest leverage and why?” on July 9, 2014 at <a href="http://stats.stackexchange.com/questions/106191/which-of-these-points-in-this-plot-has-the-highest-leverage-and-why/106314#106314" rel="nofollow">http://stats.stackexchange.com/questions/106191/which-of-these-points-in-this-plot-has-the-highest-leverage-and-why/106314#106314</a></p>
<p>[7] See the formula at the Wikipedia page “DFFITS” at <a href="https://en.wikipedia.org/wiki/DFFITS">https://en.wikipedia.org/wiki/DFFITS</a></p>
<p>[8] I retrieved this formula from the most convenient source, the Dummies.com page “How to Calculate a Regression Line” at the web address <a href="http://www.dummies.com/how-to/content/how-to-calculate-a-regression-line.html" rel="nofollow">http://www.dummies.com/how-to/content/how-to-calculate-a-regression-line.html</a></p>
<p>[9] See the response by the user named kannas at the StackOverflow thread, “Nvarchar(Max) Still Being Truncated,” published Dec. 19, 2011 at the web address <a href="http://stackoverflow.com/questions/4833549/nvarcharmax-still-being-truncated" rel="nofollow">http://stackoverflow.com/questions/4833549/nvarcharmax-still-being-truncated</a></p>
<p>[10] See the <u>Wikipedia</u> page “Cook’s Distance” at <a href="http://en.wikipedia.org/wiki/Cook%27s_distance" rel="nofollow">http://en.wikipedia.org/wiki/Cook%27s_distance</a></p>
<p>[11] See National Institute for Standards and Technology, 2014, “4.1.4.1.Linear Least Squares Regression,” published in the online edition of the<u> Engineering Statistics Handbook</u>. Available at <a href="http://www.itl.nist.gov/div898/handbook/pmd/section1/pmd141.htm" rel="nofollow">http://www.itl.nist.gov/div898/handbook/pmd/section1/pmd141.htm</a></p>
<p>[12] See the Wikipedia pages “Cook’s Distance,” “DFFITS” and “Wald Test” at <a href="http://en.wikipedia.org/wiki/Cook%27s_distance">http://en.wikipedia.org/wiki/Cook%27s_distance</a>,</p>
<p><a href="http://en.wikipedia.org/wiki/DFFITS">http://en.wikipedia.org/wiki/DFFITS</a> and <a href="http://en.wikipedia.org/wiki/Wald_test">http://en.wikipedia.org/wiki/Wald_test</a> respectively.</p>
<p> </p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/multidimensionalmayhem.wordpress.com/510/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/multidimensionalmayhem.wordpress.com/510/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=multidimensionalmayhem.wordpress.com&blog=27407452&post=510&subd=multidimensionalmayhem&ref=&feed=1" width="1" height="1" />Integrating Other Data Mining Tools with SQL Server, Part 2.2: Minitab vs. SSDM and Reporting Services
http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2015/07/08/integrating-other-data-mining-tools-with-sql-server-part-22-minitab-vs-ssdm-and-reporting-services/
Thu, 09 Jul 2015 02:20:34 UT/blogs/multidimensionalmayhem/2015/07/08/integrating-other-data-mining-tools-with-sql-server-part-22-minitab-vs-ssdm-and-reporting-services/0http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2015/07/08/integrating-other-data-mining-tools-with-sql-server-part-22-minitab-vs-ssdm-and-reporting-services/#comments<p><strong>By Steve Bolton</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>Professional statistical software like Minitab can fill some important gaps in SQL Server’s functionality, as I addressed in the last post of this occasional series of pseudo-reviews. I’m only concerned here with assessing how well a particular data mining tool might fit into a SQL Server user’s toolbox, not with their usefulness in other scenarios; that is why I made comparisons solely on the ability of various SQL Server components to compete with Minitab’s functionality, whenever the two overlapped. Most of the use cases for Minitab (and possibly its competitors, most which I have yet to try) come under rubric of statistics, which falls in the cracks between T-SQL aggregates and the “Big Data”-sized number-crunching power of SQL Server Analysis Services (SSAS) and SQL Server Data Mining (SSDM). For example, as I mentioned last time around, Minitab implements many statistical functions, tests and workflows that are not available in SSAS or SSDM, but which can be coded in T-SQL; whether or not it is profitable to do so varies by the simplicity of each particular stat and the skill level of the coder in translating the math formulas into T-SQL (something I’m hell-bent on acquiring). In this installment, I’ll cover some Minitab’s implementations of more advanced algorithms that we’d normally use SSDM for, but which are sometimes simple enough to still be implemented in T-SQL. So far in this haphazard examination of Microsoft’s competitors, the general rule of thumb seems to be that SSDM is to be preferred, particularly on large datasets, except when it doesn’t offer a particular algorithm out-of-the-box. That happens quite often, given that there are literally so many thousands of algorithms that no single company can ever implement them all. Minitab offers a wider and more useful selection of these alternative algorithms than WEKA, an open source tool profiled in the first couple of articles. In cases when SQL Server and Minitab compete head-to-head, SSDM wins hands down in both performance and usability. As we shall see, the same is true in comparisons of Minitab’s visualizations to SQL Server Reporting Services (SSRS), where the main dividing line is between out-of-the-box functionality vs. customizable reports.<br />
<span style="font-size:10pt;color:white;">…………</span>Minitab’s data mining capabilities differ from SQL Server’s mainly by the fact that it implements algorithms of lower sophistication, but with a wider array of really useful variations and enhancements. The further we get from ordinary statistical tasks like hypothesis testing and analysis of variance (ANOVA) towards machine learning and other examples of “soft computing,” the more the balance shifts back to SSDM. I couldn’t find any reference in Minitab’s extensive Help files to topics that are often associated with pure data mining, like neural nets, fuzzy sets, entropy, decision trees, pattern recognition or the Küllback-Leibler Divergence. Nor is there any mention of information, at least as the term was used in the professional sense of information theory, or of the many measures associated with such famous names in the field as Claude Shannon or Andrey Kolmogorov.[1] Given that, it’s not surprising that there’s no mention of information geometry, which is apparently a bleeding edge topic in data mining and knowledge discovery. On the other hand, Minitab implements four of the nine algorithms found in SSDM, as discussed in my earlier amateur tutorial series, <a href="http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2012/11/28/a-rickety-stairway-to-sql-server-data-mining-part-01-data-in-data-out/">A Rickety Stairway to SQL Server Data Mining</a>. Out of these four, Minitab clearly has the advantage in terms of features when it comes to Linear Regression – but definitely not when it comes to performance.<br />
<span style="font-size:10pt;color:white;">…………</span>As depicted in Figure 1, many more types of regression are available in Minitab, like nominal, ordinal, orthogonal, nonlinear, partial least squares and Poisson. Each of these has its own set of options and parameters which greatly enhance their usefulness, most of which are not available in SSDM. For example, it is easier to access the resulting regression equations and related stats in the output of ordinary regression routines, which can return additional metrics like the Durbin-Watson Test that are not available in SQL Server at all. On top of these myriad enhancements, Minitab has entire classes of algorithms that SSDM does not provide out-of-the-box. As shown in Figure 3, many different functions can be plugged into Minitab’s version of nonlinear regression, thereby making it into an entire family of related algorithms, many of which can be quite useful in analysis. There’s no reason why Microsoft could not have implemented all of these algorithms in SSDM, but as I lamented often in the Rickety series, the top brass is slowly squandering an entire market through almost a decade of pointless neglect. It is a shame that Microsoft doesn’t know how good its own product is, given that SSDM still blows away its rivals, at least in areas where the same functionality competes head-to-head.<br />
<span style="font-size:10pt;color:white;">…………</span>As mentioned in the last article, Minitab worksheets are limited to just 10 million rows, which means that displaying all 11 million rows in the Higgs Boson dataset[2] I’ve been using for practice data for the last couple of tutorial series is out of the question. In SQL Server Management Studio (SSMS) this is no problem, but the real issue here is not a matter of display, but of the fact that we can’t perform calculations on this many records. When I tried to run a regression on the first 10 million rows, it ran on one core for 16 minutes and ended up gobbling up 2 gigs of memory. It crashed during the loading phase before even initiating the regression calculations, with the error message: “Insufficient memory to complete operation. Minitab ran out of memory and was unable to recover. Close other applications to reduce memory and then press Retry. If this error continues you may need to exit Minitab and restart your system. If you select Abort, Minitab will terminate and you may lose work you have not saved.” In contrast, SSDM was able to run a regression on the same dataset in just 3 minutes and 54 seconds. SSDM’s version of Logistic Regression was able to process the whole table in just 3:32. Given that Minitab can’t even load that many records into a worksheet, let alone compute the regressions, the edge in performance definitely goes to SQL Server. This was accomplished without any of the myriad server options that can be used to enhance performance in SQL Server, none of which are available in Minitab; the same rule essentially holds when we compare T-SQL relational solutions to Minitab’s functionality, which doesn’t offer any indexing, tracing, tuning or other such tweaks that we take for granted. Furthermore, SSDM can better handle marking columns as inputs, outputs or both in its mining models (i.e. Predict, PredictOnly, etc.). On the other hand, SSDM lacks a good regression viewer; we’re limited to the Decision Trees and Generic Content viewers, when what we really need is a regression plot of the kind that Minitab returns out-of-the-box, like the Fitted Line Plot in Figure 4.[3] Since SSDM doesn’t implement this, I would either write a plug-in visualization of the kind I wrote about in <a href="http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2014/02/11/a-rickety-stairway-to-sql-server-data-mining-part-15-the-grand-finale-custom-data-mining-viewers/">A Rickety Stairway to SQL Server Data Mining, Part 15, The Grand Finale: Custom Data Mining Viewers</a>, or write an SSRS report with a line graph. When mining large datasets using existing algorithms, I would first perform the calculations in SSDM, then display the regression lines in an SSRS report or custom mining viewer. I would integrate Minitab into this workflow by performing calculations on large samples of the data, in order to derive the extra regression stats it provides. In cases of small datasets, tight deadlines or algorithms that SSDM doesn’t have, I’d go with Minitab, at least in situations where T-SQL solutions would also be beyond my skill level or would take too much time to write and test.</p>
<p><strong><u>Figures 1 and 2: The Regression and Time Series Menus<br />
<a href="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-regression-menu.jpg"><img class="alignnone size-full wp-image-498" src="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-regression-menu.jpg?w=604" alt="Minitab Regression Menu" /></a><br />
<a href="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-time-series-menu.jpg"><img class="alignnone size-full wp-image-499" src="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-time-series-menu.jpg?w=604" alt="Minitab Time Series Menu" /></a></u></strong></p>
<p><strong><u>Figure 3: The Many Options for Nonlinear Regression<br />
<a href="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-nonlinear-regression-options.jpg"><img class="alignnone size-full wp-image-500" src="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-nonlinear-regression-options.jpg?w=604&h=256" alt="Minitab Nonlinear Regression Options" width="604" height="256" /></a></u></strong></p>
<p><strong><u>Figure 4: An Example of a Fitted Line Plot<br />
<a href="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-fitted-line-plot.jpg"><img class="alignnone size-full wp-image-501" src="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-fitted-line-plot.jpg?w=604" alt="Minitab Fitted Line Plot" /></a></u></strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>The same principles essentially apply to Minitab’s version of Time Series, which is also accessible through the Stat menu. Figure 2 shows that Minitab obviously provides a lot of functionality that SSDM does not, like Trend Analysis (which includes some useful Seasonal Analysis choices), Decomposition and Winters’ Method. Some of these options return accuracy measures like Mean Absolute Percentage Error (MAPE), Mean Absolute Deviation (MAD) and Mean Squared Deviation (MSD) and other stats that SSDM does not provide. One advantage is that Minitab can calculate Time Series using linear, quadratic, exponential growth and “S-Curve (Pearl-Reed logic)” models. The gap in functionality is not as wide as with regression, however, given that it is not terribly difficult to implement various types of lags, autocorrelations, differences and smoothing operations with T-SQL windowing functions that scale better. SSDM and Minitab have competing implementations of ARIMA, but I strongly prefer the Microsoft version on the strength of its user interface; the Minitab version is mainly useful for making some of the intermediate stats readily available, like the residuals and Modified Box-Pierce (Ljung-Box) results. Time Series in Minitab is hobbled, however, by the fact that it can only calculate one variable per Time Series, unlike SSDM, which can plot them all. The Minitab Time Series Plot is also bland in comparison to the Microsoft Time Series Viewer. Once again, I would use Minitab’s Time Series only to supplement SSDM with additional stats or for cases where there’s a need for alternative algorithms, like Winters’ Method. SSDM would be my go-to tool for any functionality they implement in common, especially when any serious heavy lifting is called for. For low-level stats like autocorrelation and moving averages, I would bypass Minitab altogether in favor of my homegrown T-SQL and SSRS reports.</p>
<p><strong><u>Figure 5: How to Access Minitab’s Clustering Algorithms<br />
<a href="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-multivariate-and-clustering.jpg"><img class="alignnone size-full wp-image-502" src="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-multivariate-and-clustering.jpg?w=604" alt="Minitab Multivariate and Clustering" /></a></u></strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>One of Minitab’s main strengths is that it meets some use cases tangential to data mining, such as Principal Components Analysis, Maximum Likelihood and other Multivariate items and subitems. SSDM doesn’t do any of that, but it does Clustering and it does it well. Minitab doesn’t implement the subtype I discussed in <a href="http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2013/02/21/a-rickety-stairway-to-sql-server-data-mining-algorithm-8-sequence-clustering/">A Rickety Stairway to SQL Server Data Mining, Algorithm 8: Sequence Clustering</a> or the Expectation Maximization (EM) method mentioned in <a href="http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2013/02/12/a-rickety-stairway-to-sql-server-data-mining-algorithm-7-clustering/">A Rickety Stairway to SQL Server Data Mining, Algorithm 7: Clustering</a>, but both implement the most common flavor, K-Means. There are literally thousands of extant clustering algorithms available in the research literature, each of which is useful for specific use cases, so no single product is going to be capable of implementing them all. Even if the top brass at Microsoft were fully committed to SSDM, they’d never be able to incorporate them all, which means that clustering software doesn’t necessarily compete head-to-head. In this case, Minitab has the advantage in the terms of enhancements, such as choices of Linkage Methods like Average, Centroid, Complete, McQuitty, Median, Single and Ward, or distance metrics like Euclidean, Manhattan, Pearson, Squared Euclidean and Squared Pearson. Aside from these options and a couple of related stats, however, SSDM outclasses Minitab. In terms of performance, processing a K-Means mining model on all of the columns of the 5-gigabyte Higgs Boson table only took 1:36:42 on my wheezing old development machine. As noted in the earlier discussion on regression, Minitab can’t even load datasets of this size without choking. That’s not surprising, giving that it’s intended mainly statistical analysis on datasets of small or moderate size, not heavy number-crunching on Big Data. In terms of visualization, the SSDM Cluster Viewer is light years ahead of the simple text output and dendrograms available in Minitab. Clustering is an inherently visual task, but the graphics in Figure 6 and 7 simply don’t convey information concisely and efficiently like the Cluster Viewer, which also has the advantage of being an interactive tool.</p>
<p><strong><u>Figures 6 and 7: Sample Session Output and Dendrogram for Minitab Clustering<br />
<a href="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-clustering-text-output.jpg"><img class="alignnone size-full wp-image-503" src="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-clustering-text-output.jpg?w=604&h=546" alt="Minitab Clustering Text Output" width="604" height="546" /></a> <a href="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-dendrogram.jpg"><img class="alignnone size-full wp-image-504" src="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-dendrogram.jpg?w=604&h=417" alt="Minitab Dendrogram" width="604" height="417" /></a></u></strong></p>
<p><strong><u> Figure 8: The Minitab Graph Menu<br />
<a href="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-graphics-menu.jpg"><img class="alignnone size-full wp-image-505" src="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-graphics-menu.jpg?w=604" alt="Minitab Graphics Menu" /></a></u></strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>Many of the individual statistical tests, functions, algorithms and Assistant workflows return various plots in separate windows, alongside the data returned in the worksheets and text output in the Session window. Most of these scattered visualizations are collected in the Graph menu depicted above, or can be found in the Graphical Analysis Assistant mentioned in my last post. Other common visualizations like run charts and Pareto charts are available from the Quality Tools menu, while the Control Charts item on the Stat menu provides access to plots for some simple stats like moving averages. The advantages of all of the above can be summed up in one word: convenience. They’re all implemented out-of-the-box, thereby eliminating the need to write your own reports. On the other hand, someone with the skill to code their own SSRS reports will quickly find themselves chafing at the limitations of these canned graphics, which offer less in the way of customization. For example, the Line Plot…command implements a graphic not available out-of-the-box in SQL Server, which allows users to view associations across the variables in dataset. It quickly becomes cluttered when there are many distinct values, which is an obstacle that SSRS could deal with far more efficiently by programmatically changing such colors, shapes, sizes and so forth of the graphic elements as needed. Users are basically stuck with the format Minitab provides, with some minor customizations available through such means as right-clicking the graphic elements, as in the sample histogram in Figure 9. Sometimes that’s good enough to get the job done; whether or not it suffices for a particular analyst’s needs is in part dictated by the data and problems at hand, and in part is a highly individual choice dependent on their skills.<br />
<span style="font-size:10pt;color:white;">…………</span>The Dotplot is rather ugly and the Stem-and-Leaf is output as text; coding the latter in T-SQL and hooking it up to Reporting Services isn’t terribly difficult but looks much better, as I’ve discovered first-hand. Histograms can be returned with many of the statistical functions mentioned in my last blog post, plus many of the mining algorithms mentioned here. As I demonstrated in <a href="https://multidimensionalmayhem.wordpress.com/2015/04/21/outlier-detection-with-sql-server-part-6-1-visual-outlier-detection-with-reporting-services/">Outlier Detection with SQL Server, part 6.1 – Visual Outlier Detection with Reporting Services</a> though, these can be implemented fairly quickly in SSRS with a lot more eye candy and customizability. Probability plots are also returned by many functions and tests, but only for certain distributions, like the Gaussian (i.e. “normal”), lognormal, smallest extreme value, largest extreme value and various takes on the log-logistic, exponential, gamma and Weibull. I will demonstrate how to include some of these in SSRS reports in a future article on goodness-of-fit testing with SQL Server. The concept of empirical distribution functions (EDFs) will also be introduced in articles on the Kolmogorov-Smirnov and Lilliefors Tests in that future series. I like their Matrix Plots, but it’s nothing that can’t be done in SSRS. The scatter, bubble, bar and pie charts are all definitely inferior to their SSRS counterparts, as are the 3D versions of the scatter plot. I prefer SSDM’s Time Series visualizations to Minitab’s, although that’s more of a judgment call. I figured that the box and interval plots would have an advantage over SQL Server reports in its ability to overcome the display issues I mentioned in <a href="https://multidimensionalmayhem.wordpress.com/2015/04/29/outlier-detection-with-sql-server-part-6-2-finding-outliers-visually-with-reporting-services-box-plots/">Outlier Detection with SQL Server, part 6.2: Visual Outlier Detection with Box Plots in Reporting Services</a>. Basically, SSRS only allows one resultset from each stored procedure, thereby limiting its ability to display summary statistics alongside individual records without doing client-side calculations – which just isn’t going to happen on cubes, mining models and Big Data-sized relational tables. Unfortunately, I received my first crashes on both, on a practice dataset of just 1,715 records; Minitab started running on one core (no others were in use), with no discernible disk activity and no growth in memory usage; in fact, I had to kill the process after a couple of minutes, given that the memory use wasn’t budging at all. There is apparently no Escape command in Minitab, which is something that really comes in handy in SQL Server for runaway queries. The area, marginal, probability distribution and individual value plots are just really simple special cases of some of these aforementioned plots, so I’ll skip over them. Perhaps the only two Minitab visualizations I’d use for any purpose other than convenience are the interval plots mentioned above, plus the contour and 3D surface plots depicted below. The latter has some cool features, such as wireframe display.</p>
<p><strong><u>Figure 9: An Example of a Minitab Histogram<br />
<a href="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-histogram-example.jpg"><img class="alignnone size-full wp-image-506" src="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-histogram-example.jpg?w=604&h=416" alt="Minitab Histogram Example" width="604" height="416" /></a></u></strong></p>
<p><strong><u>Figures 10 and 11: Examples of Contour and 3D Surface Plots in Minitab<br />
<a href="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-contour-plot-example.jpg"><img class="alignnone size-full wp-image-507" src="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-contour-plot-example.jpg?w=604&h=418" alt="Minitab Contour Plot Example" width="604" height="418" /></a><br />
<a href="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-surface-plot-example.jpg"><img class="alignnone size-full wp-image-508" src="https://multidimensionalmayhem.files.wordpress.com/2015/07/minitab-surface-plot-example.jpg?w=604&h=437" alt="Minitab Surface Plot Example" width="604" height="437" /></a></u></strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>It is good to keep in mind when reading these pseudo-reviews that I’m an amateur posting my initial reactions, not an expert with years of experience in these third-party tools. In the case of Minitab, we’re talking about an expensive professional tool with many nooks and crannies I never got to explore and a lot of functionality I’m not familiar with at all, like the Six Sigma and other engineering-specific tools I mentioned in the last article. I barely scratched the surface of a very big topic. That became crystal clear to me when writing these final paragraphs, when I discovered quite late in the game that more customization was available through that context menu in Figure 9. I’ve undoubtedly short-changed Minitab somewhere along the way, as I’m sure I did with WEKA a few articles ago. These articles are intended solely to provide introductions to these tools to SQL Server users, not expert advice to a general audience. Based on this limited experience, my general verdict is that I’d use Minitab as a go-to tool for functionality that SQL Server doesn’t provide out-of-the-box, like ANOVA, discriminant analysis, hypothesis testing and some of the alternative mining algorithms mentioned in this article. This is especially true when speaking of the helpful workflow Assistants Minitab provides for such tasks, particularly hypothesis testing and the unfamiliar engineering processes.<br />
<span style="font-size:10pt;color:white;">…………</span>The less complex the functionality is, the more I’d lean towards T-SQL solutions, while the more complicated the underlying formulas become, the more I’d lean towards SSDM. Whenever SQL Server competes with Minitab head-on, it wins hands down, except in the area of supplemental stats; if only Microsoft had updated SSDM regularly over the years instead of abandoning the market, it might have been able to extend this advantage over Minitab to additional areas. This advantage is twice as strong whenever performance, tracing, higher precision data types and tweaks like indexing are paramount. In terms of graphical capabilities, Minitab’s edge is in convenience, whereas SSRS definitely offers more power. Because the human mind processes most of its information visually, eye candy cannot be overlooked as a key step of conveying the complex information derived from mining tasks to end users. Perhaps Excel would be a worthy competitor in Minitab’s bread-and-butter, which is performing kinds of common statistical tests that lay somewhere between the simple aggregates of T-SQL and the sophistication of SSDM algorithms. I’m ignorant of a lot that goes on with Excel, but it seems like more of a general purpose spreadsheet than Minitab, which is a specialized program that just happens to use a spreadsheet interface; it’s no accident that I’ve so far found easier to use for statistical testing, given that this is its raison d’etre.<br />
<span style="font-size:10pt;color:white;">…………</span>Perhaps there are other statistical packages that would perform the same tasks in a SQL Server environment much better than Minitab; maybe I will run into a competitor that performs the same functions at half the price tomorrow. Until then, however, I will leave Minitab a big space in my toolbox in comparison to WEKA, which in turn outperformed the sloppy Windows versions of DB2 and Oracle, as I discussed in <a href="https://multidimensionalmayhem.wordpress.com/2012/03/31/thank-god-i-chose-sql-server-part-i-the-tribulations-of-a-db2-trial/">Thank God I Chose SQL Server part I: The Tribulations of a DB2 Trial</a> and <a href="https://multidimensionalmayhem.wordpress.com/2012/05/01/thank-god-i-chose-sql-server-part-ii-how-to-improperly-install-oracle-11gr2/">Thank God I Chose SQL Server part II: How to Improperly Install Oracle 11gR2</a>. Data mining is a taxing topic that simply doesn’t leave much time and mental energy left for the hassles of unprofessional interfaces. Usability is one of the many categories I will take into consideration throughout this occasional, open-ended series, along with performance, the quality and availability of algorithms, visualizations, documentation, error-handling and crashes and portability, not to mention security, extensibility, logging and tracing. I have many of Minitab’s competitors in my cross-hairs, including RapidMiner, R, Pentaho, Autobox, Clementine, SAS and Predixion Software, a company founded by SSDM developers Jamie MacLennan and Bogdan Crivat. Which one I will examine next is still up in the air, nor do I know what I’ll find when I finally try them out. My misadventures with DB2 and Oracle taught me not to delve into these topics with preconceived notions, because there are surprises lurking out there in the data mining marketplace – such as the Cinderella story of WEKA, the free tool which beat DB2 and Oracle hands-down in terms of reliability. The most pleasant surprise with Minitab was how smoothly the GUI interface worked, making it trivial to perform many advanced statistical tests effortlessly.</p>
<p>[1] Kolmogorov is only mentioned in connection with the Kolmogorov-Smirnov goodness-of-fit test.</p>
<p>[2] I downloaded this last year from the <a href="https://archive.ics.uci.edu/ml/datasets/HIGGS">University of California at Irvine’s Machine Learning Repository</a><u> and converted it to a SQL Server table of about 5 gigs, which now resides in the sham DataMiningProjects database I’ve been using for practice purposes for the last few tutorial series.</u></p>
<p>[3] This example displays data from the same Duchennes muscular dystrophy dataset I’ve been using as practice data for the last several tutorial series, which I downloaded ages ago from <a href="http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets">Vanderbilt University’s Department of Biostatistics</a><u>.</u></p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/multidimensionalmayhem.wordpress.com/497/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/multidimensionalmayhem.wordpress.com/497/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=multidimensionalmayhem.wordpress.com&blog=27407452&post=497&subd=multidimensionalmayhem&ref=&feed=1" width="1" height="1" />Integrating Other Data Mining Tools with SQL Server, Part 2.1: The Minuscule Hassles of Minitab
http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2015/06/30/integrating-other-data-mining-tools-with-sql-server-part-21-the-minuscule-hassles-of-minitab/
Tue, 30 Jun 2015 18:02:38 UT/blogs/multidimensionalmayhem/2015/06/30/integrating-other-data-mining-tools-with-sql-server-part-21-the-minuscule-hassles-of-minitab/0http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2015/06/30/integrating-other-data-mining-tools-with-sql-server-part-21-the-minuscule-hassles-of-minitab/#comments<p><strong>By Steve Bolton</strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>It may be called Minitab, but SQL Server users can derive maximum benefits from the Windows version of this professional data mining and statistics tool – provided that they use it for tasks that SQL Server doesn’t do natively. This was one the caveats I also observed when appraising WEKA in the first installments of this occasional series, in which I’ll pass on my misadventures with using various third-party data mining tools to the rest of the SQL Server community. These are intended less as formal reviews than preliminary answers to the question, “How would these fit in a SQL Server data miner’s toolbox?”<br />
<span style="font-size:10pt;color:white;">…………</span>WEKA occupies a very small place in that toolbox, due to various shortcomings, including an inability to handle datasets that many SQL Server users would consider microscopic. In a recent trial with Minitab 17.1 I encountered many of the same limitations, but at much less serious levels – which really ought to be the case, given that WEKA is a free open source tool and Minitab costs almost $1,500 for a single-user license. I didn’t know what to expect going into the trial, since I had zero experience with it to that point, but I immediately realized how analysts could recoup the costs in a matter of weeks, provided that they encountered some specific use cases often enough. Minitab is useful for a much wider range of scenarios than WEKA, but the same principles apply to both: it is best to use SQL Server for any functionality Microsoft has provided out-of-the-box, but to use these third-party tools when their functionality can’t be coded quickly and economically in T-SQL and .Net languages like Visual Basic. Like most other analysis tools, Minitab only competes with SQL Server Data Mining (SSDM) tangentially; most of its functionality is devoted to statistical analysis, which neither SSDM nor SQL Server Analysis Services (SSAS) directly addresses. If I someday had enough clients with needs for activities like Analysis of Variance (ANOVA), experiment design or dozens of specific statistics that aren’t easily calculable in SQL Server, Minitab would be at the top of my shopping list (with the proviso that I’d also evaluate their competitors, which I have yet to do). I’m not a big Excel user, so I can’t speak at length on whether or not it compares favorably, but I personally found Minitab much easier to work with for statistical tasks like these. Minitab has some nice out-of-the-box visualizations which can be done with more pizzazz in Reporting Services, provided one has the need, skills and time to code them. One of Minitab’s shortcomings is that it simply doesn’t have the same “Big Data”-level processing capabilities as SQL Server. This was also the case with WEKA, but Minitab can at least perform calculations on hundreds of thousands of rows rather than a paltry few thousand. It doesn’t provide neural nets, sequence clustering or some of my other favorite SSDM algorithms from the <a href="https://multidimensionalmayhem.wordpress.com/2012/11/15/a-rickety-stairway-to-sql-server-data-mining-part-0-0-an-introduction-to-an-introduction/">A Rickety Stairway to SQL Server Data Mining</a> series, but it does deliver dozens of alternatives for lower-level data mining methods like regression and clustering which SSDM doesn’t provide. If given the opportunity and the need, I’d incorporate this into my workflows for the kind of hypothesis testing routines I spoke of in <a href="https://multidimensionalmayhem.wordpress.com/2014/10/28/outlier-detection-with-sql-server-part-2-1-z-scores/">Outlier Detection with SQL Server</a>, preliminary testing of statistical code, formula validation and certain data mining problems, when one of Minitab’s specialized algorithms is called for.<br />
<span style="font-size:10pt;color:white;">…………</span>One pitfall to watch out for when evaluating Minitab is that there are scammers out there selling counterfeit copies on some popular, above-board online shops. They’re just pricy enough to look like legitimate second-software being resold from surplus corporate inventory or whatever; a Minitab support specialist politely advised me that resales are violations of the license agreement, so the only way to get a copy is to shell out the $1,500 fee for a single per-user license. Another Minitab rep was kind enough to extend my <a href="http://it.minitab.com/en-us/products/minitab/free-trial.aspx">trial software</a> for another 30 days after I simply ran out of time to collect information for these reviews, but I don’t think that will color the opinions I represent here, which were already positive from my first hours of tinkering with it (except, that is, in moments where I considered the hefty price tag). One obvious plus in Minitab’s favor is that the installer worked right off the bat, which wasn’t the case in <a href="https://multidimensionalmayhem.wordpress.com/2012/03/31/thank-god-i-chose-sql-server-part-i-the-tribulations-of-a-db2-trial/">Thank God I Chose SQL Server part I: The Tribulations of a DB2 Trial</a>. In fact, I never did get Oracle’s installer to work at all in <a href="https://multidimensionalmayhem.wordpress.com/2012/05/01/thank-god-i-chose-sql-server-part-ii-how-to-improperly-install-oracle-11gr2/">Thank God I Chose SQL Server part II: How to Improperly Install Oracle 11gR2</a>, thanks to some Java errors that Oracle has chosen not to fix, rather than novice inexperience. Another plus is that the .chm Help section is crisply written and easy to follow even for non-experts, which is really critical when you’re dealing with the advanced statistical topics that can quickly become mind-numbing. I didn’t run into the kinds of issues I did with SSDM and WEKA in terms of insufficient documentation. I was also pleasantly surprised to find that Minitab installed more than 350 practice datasets out of the box, far more than any other analytics or database-related product I can recall seeing, although I rare use samples of this kind.</p>
<p style="text-align:center;"><strong>The Minitab GUI</strong></p>
<p> At first launch, it is immediately obvious that spreadsheets are the centerpiece of the GUI, in conjunction with a text output window that centralizes summary data for of all worksheets when algorithms are run on them. That window can quickly become cluttered with data if you’re running a lot of different analyses in one session; I was also relieved to discover that this text-only format is supplemented by many non-text visualizations available as well, which I’ll cover in the next article. The user interface is obviously based on Microsoft’s old COM standard, not Windows Presentation Foundation (WPF), but it’s well-done for COM and definitely leaps and bounds ahead of the Java interfaces used in third-party mining suites like Oracle, DB2 and WEKA. Incidentally, Minitab has automation capabilities, but these are exposed through the old COM standard, which of course a lot more difficult to work with than .Net. Greater emphasis is placed on macros, which involves learning Session Commands that have an Excel-like syntax. Although I was generally pleased with the usability of the interface, there were of course a few issues, especially when I unconsciously expected it to behave like SQL Server Management Studio (SSMS). Sorting is really cumbersome in comparison, for instance. You have to go up to a menu and choose a <strong>Sort… </strong>command, then make sure it’s manually applied to every column if you want the worksheet synchronized; the sorted data then has to be placed into a new worksheet, none of which would fly in a SQL Server environment. Most of the action takes place in dialog windows brought up through menu commands, where end users are expected to select a series of worksheet columns and enter appropriate parameters. One pitfall is that typing constants into the dialog boxes is often a non-starter; most of the time you need to select a worksheet column from a list on the left side, which can be counter-intuitive in some situations. A lesser annoyance is that sometimes the columns to the left in the selection boxes are blank until you click inside a textbox, which makes you wonder sometimes if it is supposed to be greyed out to indicate unavailability. Another issue is that if you forget to change worksheets during calculations, Minitab will just dump rows from the table you’re doing computations on into whatever spreadsheet is topmost; as if to rub salt in our wounds, it’s not sorted either.<br />
<span style="font-size:10pt;color:white;">…………</span>Minitab can import data from many sources, but in this series we’re specifically concerned with integrating it into a SQL Server environment. This is done entirely through ODBC; apparently Minitab also has Dynamic Data Exchange (DDE) capabilities, but I didn’t bother to try to connect through this old Windows 2000-era medium, which I haven’t used since I got my MCSD in Visual Basic 6. From the File Menu, choose <strong>Query ODBC Database…</strong> as shown in Figure 1. If you don’t have a file or machine DNS set up yet, you will have to click New… in the Select Data Source window shown in Figure 2. The graphic after that depicts six windows you’ll have to navigate through to create one, most of which is self-explanatory; you basically select a SQL Server driver type, an existing server name and the type of authentication, plus a few connection options like default database. Later in the process, you can test the connection in a window I left out for the sake of succinctness. There isn’t much going on here that’s terribly different from what we’re already used to with SQL Server; the only stumbling block I ran into was in the SQL Server Login windows in Figure 4, where I had to leave the Server SPN blank, just as I did in the DNS definition. I’m not up on Service Principal Names (SPNs), so there’s probably a sound reason I’m not aware for leaving them out in this case.</p>
<p><strong><u> Figure 1: Using the Query Database (ODBC) Menu Command<br />
<a href="https://multidimensionalmayhem.files.wordpress.com/2015/06/menu-command-for-connecting.jpg"><img class="alignnone size-full wp-image-482" src="https://multidimensionalmayhem.files.wordpress.com/2015/06/menu-command-for-connecting.jpg?w=604" alt="Menu Command for Connecting" /></a><br />
</u></strong></p>
<p><strong><u>Figure 2: Selecting a DNS Data Source<br />
<a href="https://multidimensionalmayhem.files.wordpress.com/2015/06/select-data-source.jpg"><img class="alignnone size-full wp-image-483" src="https://multidimensionalmayhem.files.wordpress.com/2015/06/select-data-source.jpg?w=604" alt="Select Data Source" /></a><br />
</u></strong></p>
<p><strong><u>Figure 3: Six Windows Used to Set Up a SQL Server DNS<br />
<a href="https://multidimensionalmayhem.files.wordpress.com/2015/06/minitab-new-data-source-windows-1.jpg"><img class="alignnone wp-image-485" src="https://multidimensionalmayhem.files.wordpress.com/2015/06/minitab-new-data-source-windows-1.jpg?w=358&h=263" alt="Minitab New Data Source Windows (1)" width="358" height="263" /></a><a href="https://multidimensionalmayhem.files.wordpress.com/2015/06/minitab-new-data-source-windows-2.jpg"><img class="alignnone wp-image-486" src="https://multidimensionalmayhem.files.wordpress.com/2015/06/minitab-new-data-source-windows-2.jpg?w=350&h=267" alt="Minitab New Data Source Windows (2)" width="350" height="267" /></a></u></strong></p>
<p><a href="https://multidimensionalmayhem.files.wordpress.com/2015/06/minitab-new-data-source-windows-4.jpg"><img class="alignnone wp-image-487" src="https://multidimensionalmayhem.files.wordpress.com/2015/06/minitab-new-data-source-windows-4.jpg?w=353&h=246" alt="Minitab New Data Source Windows (4)" width="353" height="246" /></a><a href="https://multidimensionalmayhem.files.wordpress.com/2015/06/minitab-new-data-source-windows-5.jpg"><img class="alignnone wp-image-488" src="https://multidimensionalmayhem.files.wordpress.com/2015/06/minitab-new-data-source-windows-5.jpg?w=357&h=242" alt="Minitab New Data Source Windows (5)" width="357" height="242" /></a></p>
<p><a href="https://multidimensionalmayhem.files.wordpress.com/2015/06/minitab-new-data-source-windows-6.jpg"><img class="alignnone wp-image-489" src="https://multidimensionalmayhem.files.wordpress.com/2015/06/minitab-new-data-source-windows-6.jpg?w=344&h=247" alt="Minitab New Data Source Windows (6)" width="344" height="247" /></a><a href="https://multidimensionalmayhem.files.wordpress.com/2015/06/minitab-new-data-source-windows-7.jpg"><img class="alignnone wp-image-490" src="https://multidimensionalmayhem.files.wordpress.com/2015/06/minitab-new-data-source-windows-7.jpg?w=351&h=247" alt="Minitab New Data Source Windows (7)" width="351" height="247" /></a></p>
<p><strong><u>Figure 4: Logging in with SQL Server<br />
<a href="https://multidimensionalmayhem.files.wordpress.com/2015/06/two-sql-server-login-windows-2.jpg"><img class="alignnone size-full wp-image-484" src="https://multidimensionalmayhem.files.wordpress.com/2015/06/two-sql-server-login-windows-2.jpg?w=604" alt="Two SQL Server Login Windows (2)" /></a><br />
</u></strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>One of my primary concerns was that Minitab wouldn’t be able to display as many rows as SSMS, especially after WEKA choked on just 5,500 records in my first two tutorials. Naturally, one of the first things I did was stress-test it using the 11-million-row Higgs Boson dataset I’ve been using for practice data for the last couple of tutorial series, which originally came from the <a href="https://archive.ics.uci.edu/ml/datasets/HIGGS">University of California at Irvine’s Machine Learning Repository</a> and now takes up about 5 gigs in a SQL Server table. SSMS can handle it no problem on my wheezing old development machine, but I didn’t know what to expect, given than Minitab is not designed with “Big Data”-sized relational tables and cubes in mind. I was initially happy with how fast it loaded the first two float columns, which took about a minute in which mtb.exe ran on one core. Then I discovered that I couldn’t scroll past the 10 millionth row, although the distance to the end of the scrollbar was roughly proportional to the remaining million rows, i.e. about 10 percent. I then discovered the following limits in Minitab’s documentation, which SQL Server users might run into frequently given the size of the datasets we’re accustomed to:</p>
<blockquote><p> “Each worksheet can contain up to 4000 columns, 1000 constants, and 10,000,000 rows. The total number of cells depends on the memory of your computer, up to 150,000,000. This worksheet size limit applies to each worksheet in a Minitab project. For example, you could have two worksheets in your project, each with 150 million cells of data. Minitab does not limit the number of worksheets you can have in a project file. The maximum number of worksheets depends on your computer’s memory.”<a href="#_edn1" name="_ednref1">[1]</a></p></blockquote>
<p><span style="font-size:10pt;color:white;">…………</span>It is often said that SSMS is not intended for displaying data, yet DBAs, developers and others often use it that way anyway; I would regard it as something of a marketing failure on Microsoft’s part not to recognize that and deliberately upgrade the interface, rather than trying to force customers into a preconceived set of use cases. Despite this inattention, SSMS still gets the job of displaying large datasets done much better than Minitab; this may be a feature that I just happen to notice more due to the fact that I’m used to using SQL Server for data mining purposes, not the more popular use case of serving high transaction volumes. Performance comparisons of the calculation speed and resource usage during heavy load are more appropriate and in this area, Minitab did better than expected. I wouldn’t use it to mine models of the size I used in the Rickety series, let alone terabyte-sized cubes, but it performed better than I expected on datasets of moderate size. Keep in mind, however, that it lacks almost all of the tweaks and options we can apply in SQL Server, like indexing, server memory parameters, dynamic management views (DMVs) to monitor resource usage, tools like Resource Governor and Profiler – you name it. That’s because SQL Server is designed to meet a different set of problems that only overlap Minitab at certain points, mainly in the data mining field.<br />
<span style="font-size:10pt;color:white;">…………</span>Comparisons of stability during mining tasks are also more appropriate and in this respect, Minitab fared better than any of competitors of SSDM I’ve tried to date. Despite being an open source tool, WEKA turned out to be more stable than DB2 and Oracle, but I’m not surprised that Minitab outclassed them all, given that all three are written in clunky Java ports to Windows. I had some crashes while using certain computationally intensive features, particularly while performing variations of ANOVA. One error on a simple one-way ANOVA and another while using Tukey’s multiple comparison method forced me to quit Minitab. A couple of these were runtime exceptions on Balanced ANOVA and Nested ANOVA tasks that didn’t force termination of the program. I encountered a rash of errors towards the end of the trial, including a plot that seized up Minitab and a freeze that occurred while trying to select from the Regression menu. One of these occasions, I tried to kill the process in Task Manager, only to discover that I couldn’t close any windows at all in Explorer for a couple of minutes (there was no CPU usage, disk errors or other such culprits in this period, which was definitely triggered by the Minitab error). Perhaps the most troubling problem I encountered towards the end was increasingly lengthy delays in loading worksheets with a couple of hundred columns, but only about 1,500 rows; these were on the order of four or five minutes apiece, which is unacceptable. Overall, however, Minitab performed better and was more stable than any other mining tool I’ve used to date, except SSDM. The two tools are really designed for tangential use cases though, with the first specializing in statistical analysis and lower-level mining algorithms like regression, while SSDM is geared more towards serious number-crunching on large datasets, using higher-level mining methods like neural nets.</p>
<p style="text-align:center;"><strong>Weak Data Types but Unique Functions</strong></p>
<p> That explains why Minitab doesn’t hold a candle to SQL Server in terms of the range of its data types, which may become an issue in large datasets and calculations where high precision makes sense. Worksheets can only hold positive or negative numbers to a maximum of 10<sup>18</sup> in either direction, beyond which the values are tagged as missing and an error is raised.[2] It is possible to store values up to 80 decimal places long in the spreadsheet (scientific notation is not automatically invoked), but they may be treated as text, not numbers. The Fixed Decimal dialog box only allows users to select up 30 decimal places. Worse still, only 17 digits can be entered to the left or right before truncation begins, whereas SQL Server’s decimal and numeric types can go as high as 38. Our floats can handle up to 308 decimal places – which sounds like overkill, until you start translating common statistical functions for use on mining large datasets and quickly exhaust all of this extra slack. The existing SQL Server data types are actually inadequate for implementing useful data mining algorithms on Big Data-sized models – so where does that leave Minitab, where the permissible ranges are an order of magnitude smaller? Incidentally, another possible source of frustration with Minitab’s data type handling is its lack of an equivalent to identity columns; the same functionality can only be implemented awkwardly, through such methods as manually setting the same sort options for each column in a worksheet.<br />
<span style="font-size:10pt;color:white;">…………</span>At present, I’m trying to acquire the math skills to translate statistical formulas into T-SQL, Visual Basic and Multidimensional Expressions (MDX), which in some cases can be done more efficiently in SQL Server. This DIY approach can take care of some of the use cases in between SQL Server’s and Minitab’s respective spheres of influence, but as the sophistication of the stats begins to surpass a developer’s skill levels, the balance increasingly leans towards Minitab. One area where home-baked T-SQL solutions have the advantage is in terms of the mathematical functions and constants that Minitab provides out-of-the-box. It has pretty much the same arithmetic, statistical logical, trigonometric, logarithmic, text and date/time functions that SQL Server and Common Language Runtime (CLR) languages like Visual Basic and C# do, except that our versions have much higher precision. It is also trivial to use far more precise values of Pi and Euler’s Number in T-SQL than those provided in Minitab. On top of that, it is much easier to use one of the functions inside a set-based routine than it is to type it into a spreadsheet, which opens up a whole world of possibilities in SQL Server that simply cannot be done in Minitab. There are Excel-like commands to Lag, Rank and Sort data, but they don’t hold a candle to T-SQL windowing functions and plain old ORDER BY statements.<br />
<span style="font-size:10pt;color:white;">…………</span>Minitab provides a few functions that aren’t available out-of-the-box with SQL Server, but even here, the advantage resides with T-SQL solutions. It is trivial to implement stats like the sum of squares and geometric mean in T-SQL, where we have fine-grained control and can leverage our knowledge of all of SQL Servers’ internal workings for better performance and encapsulation; a DBA can do things like write queries that do a single index scan and then calculate two similar stats from it afterwards at trivial added cost, but that’s not going to happen in statistical packages like Minitab. This is true even in terms of advanced statistical tests where Minitab’s implementation is probably the better choice; their Kolmogorov-Smirnov Test is certainly better than the crude attempt I’ll post in my next series, but you’re not going to be able to calculate Kuiper’s Test alongside it in a sort of two-for-the-price-of-one deal like I’ll do in that tutorial. In general, it is best to trust to Minitab for such advanced tests unless there’s a need for tricks of that kind, but to use T-SQL solutions when they’d be easy to write and validate. Some critical cases in point include Minitab’s Combinations, Permutations and Gamma functions, which are severely restricted by the limitations of their data types. At 170 records, I was only able to get permutations and combinations results when I used a k value no higher than 8, but it only took me a couple of minutes to write T-SQL procedures that leveraged the size of SQL Server’s float data type to top out at 168 k. I was likewise able to write a factorial function that took inputs up to 170, but Minitab’s version only goes up to 19. In the same vein, their gamma function only accepts inputs up to 20. These limitations might not cut it for some statistical and data mining applications with high values or record counts; as I’ve found out the hard way over the last couple of tutorial series, some potentially useful algorithms and equations can’t even be implemented at all in many mainstream languages because they require permutations and other measures that are subject to combinatorial explosion. There are still a few Minitab functions I haven’t tried to implement yet, like Incomplete Gamma, Ln gamma, MOD, Partial Product, Partial Sum, Transform Count and Transform Population, in large part because they have narrower use cases I’m not familiar with, but I suspect the same observations hold there.<br />
<span style="font-size:10pt;color:white;">…………</span>As the sophistication of the math under the hood increases, the balance shifts to Minitab over T-SQL solutions. For example, all of the probability functions I’ll code in T-SQL for my series Goodness-of-Fit Testing with SQL Server are provided out-of-the-box in Minitab 17, including probability density functions (PDFs), cumulative distribution functions (CDFs), inverse cumulative distribution functions and empirical distribution functions (EDFs) for many more distributions beside the Gaussian normal I was limited to. These include the Normal, Lognormal, 3-parameter lognormal, Gamma, 3-parameter gamma Exponential, 2-parameter exponential, Smallest extreme value Weibull, 3-parameter Weibull Largest extreme value Logistic, Loglogistic and 3-parameter loglogistic, which are the same ones available for Minitab probability plots. There is something to be said for coding these in T-SQL if you run into situations where higher precision data types, indexing, execution plans and the efficiency of windowing functions can make a difference, but for most use cases, you’re probably off depending on the proven quality of the Minitab implementation. In fact, Minitab implements many of the same goodness-of-fit tests I’ll be covering in that series, like the Anderson-Darling, Kolmogorov-Smirnov, Ryan-Joiner, Chi-Squared, Poisson and Hosmer-Lemeshow, as well as the Pearson correlation coefficient. You’re probably much better off depending on the proven quality of their versions than taking the risk of coding your own – unless, of course, you have a special need for higher-precision results for Big Data scenarios, as my mistutorial series demonstrated how to implement.</p>
<p><strong><u>Figure 5: The Stat Menu<br />
<a href="https://multidimensionalmayhem.files.wordpress.com/2015/06/anova-menu.jpg"><img class="alignnone size-full wp-image-480" src="https://multidimensionalmayhem.files.wordpress.com/2015/06/anova-menu.jpg?w=604" alt="ANOVA Menu" /></a><br />
</u></strong><span style="font-size:10pt;color:white;">…………</span>That is doubly true when we’re talking about even more complex calculations, such as ANOVA tests, which are accessible from the Stat menu. Analysis of variance is only tangentially related to data mining per se, but its output can be useful in shedding light on the data from a different direction; to make a long story short, variance is partitioned in order to provide insight into the reasons why the mean values of multiple datasets differ. As depicted in Figure 5, Minitab includes many of most popular tests, like Balanced, Fully Nested, General and One-Way ANOVA, plus One Way Analysis of Means and a Test for Equal Variances; I’ve tried to code a couple of these myself and can attest that they’re around that boundary where a professional tool begins to make more sense than DIY T-SQL solutions. Some of the tests on the Nonparametrics submenu, like Friedman, Kruskal-Wallis, Mann-Whitney and the like, are fairly easy to do in T-SQL, as are some of the Equivalence Tests. A couple of routines are available to force data into a Gaussian or “normal” distribution, like the Box-Cox and Johnson Transformation, but I don’t have any experience with using them, let alone coding them in T-SQL. Minitab also has some limited matrix math capabilities available through other menus, but I’m on the fence so far as to whether I’d prefer a T-SQL or .Net solution for these. The Basic Statistics menu features stats that are easy to code or come out-of-the-box in certain SQL Server components, like variance, correlation, covariance and proportions, but it also has more advanced ones like Z and T tests, outlier detection and normality testing functions. There are also some related specifically to the Poisson distribution. The Table menu is home to the Chi-Square Test for Association and Cross-Tabulation, each of which isn’t particularly difficult to code in T-SQL either; the time, skills and energy required to program them all yourself begins to mount with each one you develop a need for though, till the point is eventually reached where Minitab (or perhaps one of its competitors) begins to justify its cost.<br />
<span style="font-size:10pt;color:white;">…………</span>Minitab really shines in the area of stats for specific engineering applications, like reports and templates for Six Sigma engineering, plus separate sections in Help explaining in-depth how to use the Reliability and Survival Analysis and Quality Process and Improvement functionality on the Stat menu. The documentation for Design of Experiments (DOE) is excellent as well. This functionality is accessible through the DOE item on the Stat menu, which allows you to perform such helpful tasks as gauging how many records are required to meet your chosen confidence levels. Various factorial, mixture, Taguchi and response surface DOE designs are available. I’m not familiar with either DOE or these engineering applications, so I’d definitely use a third-party tool for these purposes instead of coding them in SQL Server or .Net. Some of the individual items on these menus include Distribution Analysis, Warranty Analysis, Accelerated Life Testing, Probit Analysis, Gage Study and Attribute Agreement Analysis, all of which are highly specialized. Most of the meat and potatoes in the program can be found on Stat menu, but the Assistant menu also provides access to many prefabricated workflows that can really save a lot of time and hassle. In fact, I was able to learn something about the function of Capability Analysis and Measurement Systems Analysis just by looking at the available options on these workflows. The Regression Assistant is directly relevant to data mining, while the workflows for certain other activities like planning and interpreting experiments might prove just as useful. The hypothesis testing workflow in Figure 6 would probably come in handy for statistical tasks that are complementary to data mining.</p>
<p><strong><u>Figure 6: The Hypothesis Testing Assistant<br />
<a href="https://multidimensionalmayhem.files.wordpress.com/2015/06/hypothesis-testing-assistant.jpg"><img class="alignnone size-full wp-image-481" src="https://multidimensionalmayhem.files.wordpress.com/2015/06/hypothesis-testing-assistant.jpg?w=604&h=480" alt="Hypothesis Testing Assistant" width="604" height="480" /></a><br />
</u></strong></p>
<p><span style="font-size:10pt;color:white;">…………</span>The Graphical Analysis Assistant also helps centralize access to many of the disparate visualizations scattered throughout the GUI, like probability plots, histogram windows, contour plots, 3D surface plots and the like. Normally, these open up in separate windows when a task from the Stat menu is run. I’ll cover these in the next installment and address the question of whether or not it is better off buying an off-the-shelf functionality like this, or developing your own Reporting Services solutions in-house. All of these visualizations can be coded in SQL Server – with the added benefit that RS reports can be customized, which is not the case with their Minitab counterparts. I’ll also delve into some of the Stat menu items that overlap SSDM’s functionality, like Regression and Time Series. Minitab features a wider range of clustering algorithms than SSDM, which are accessible from the Multivariate item. This item also includes Principle Components Analysis, Factor Analysis, Item Analysis and Discriminant Analysis, none of which I’m familiar enough with to code myself; the inclusion of principle components, for example, in data mining workflows is justified by the fact it’s useful in selecting the right variables for analysis. I have no clue as to what Minitab’s competitors are capable of yet, but after my experience with it I’d definitely use a third-party tool in this class for tasks like this, plus hypothesis testing, ANOVA and DOE. Some of the highly specific engineering uses are beyond the use cases that SQL Server data miners are likely to encounter, but should the need arise, there they are. As with WEKA, Minitab’s chief benefits in a SQL Server environment are its unique mining algorithms, which I’ll introduce in a few weeks.</p>
<p>[1] See the <u>Minitab</u> webpage “Topic Library / Interface: Worksheets” at <a href="http://support.minitab.com/en-us/minitab/17/topic-library/minitab-environment/interface/the-minitab-interface/worksheets/">http://support.minitab.com/en-us/minitab/17/topic-library/minitab-environment/interface/the-minitab-interface/worksheets/</a></p>
<p>[2] See the <u>Minitab</u> webpage “Numeric Data and Formats” at <a href="http://support.minitab.com/en-us/minitab/17/topic-library/minitab-environment/data-and-data-manipulation/numeric-data-and-formats/numeric-data-and-formats/" rel="nofollow">http://support.minitab.com/en-us/minitab/17/topic-library/minitab-environment/data-and-data-manipulation/numeric-data-and-formats/numeric-data-and-formats/</a></p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/multidimensionalmayhem.wordpress.com/479/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/multidimensionalmayhem.wordpress.com/479/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=multidimensionalmayhem.wordpress.com&blog=27407452&post=479&subd=multidimensionalmayhem&ref=&feed=1" width="1" height="1" />Integrating Other Data Mining Tools with SQL Server, Part 1.2: Finding Use Cases for WEKA
http://www.sqlservercentral.com/blogs/multidimensionalmayhem/2015/06/02/integrating-other-data-mining-tools-with-sql-server-part-12-finding-use-cases-for-weka/
Tue, 02 J