By Steve Bolton
…………Like many of my other series of amateur mistutorials, the section of my blog devoted to the broad topic of Information Measurement is ultimately meant to investigate whether or not set-based languages can efficiently implement popular Data Mining, Machine Learning and Artificial Intelligence algorithms. As “Big Data” becomes even Bigger, it’s going to become increasingly critical to think about attributes in higher-level representations like sets and to find optimal ways of aggregating over billions of rows. I strongly believe that the industry as a whole is missing opportunities to leverage languages like T-SQL and MDX for the probabilistic distances discussed earlier in this series and even more so when it comes to the topic of my Implementing Fuzzy Sets in SQL Server series. Along the way, however, I’ve encountered some negative examples, most notably in Information Measurement with SQL Server, Part 5.1: A Digression on the Unsuitability of Fisher Information for SQL Use Cases. In that instance, the stumbling blocks included the current lack of matrix support in SQL Server and the equation-solving involved in Maximum Likelihood Estimation (MLE), which is much better-suited to ordinary languages that aren’t set-based, like VB.Net and C#. I found out the hard way that the Information Gain metric used in the ID3 Decision Trees algorithm has its own hurdles and hindrances, which ultimately make it quite difficult to code in set-based languages. It in fact turned into one of the most difficult coding challenges I’ve ever faced, to the point of tempting me to procrastinate on regular updates to my blog – which will be undergoing a facelift soon, as I’ll explain in the last section.
…………Information Gain appears deceptively easy to implement: we merely take the statistic we calculated in Information Measurement with SQL Server, Part 2.1: The Uses and Abuses of Shannon’s Entropy and then subtract the same stats calculated across different slices of a dataset. Voilà, you have instant quantifications of how much more you might learn from one of those subordinate slices than another. This type of Information Gain is also intimately related to Mutual Information and another type to the Kullback-Leibler Divergence, both of which were discussed at length earlier in this series. The former is the optimization metric used in the ID3 algorithm devised by data mining researcher John Ross Quinlan in the 1990s, which is still perhaps the most recognizable variant of Decision Trees. The gist of the algorithm is that we assign entropy figures to specific subsets of our data, select the ones with the highest information gain at a particular strata and reject the rest, forming a tree structure that is easily translated into a series of If-Then statements. I’m not going to rehash all of the ins and outs of Decision Trees, a wide-ranging subject that I first delved into back in A Rickety Stairway to SQL Server Data Mining, Algorithm 3: Decision Trees. My purpose in this series is to explain how to derive our own DIY data mining algorithms and metrics right in T-SQL, although in this case it would normally be more efficient to use products like SQL Server Data Mining (SSDM).[ii] Attempting to mash these hierarchical data structures into tables and single-level sets added an additional level of complexity beyond the drawbacks listed in the next section, which ultimately proved fatal to the experiment. As the SQL Server community already knows, trees are notoriously difficult to work with in, so the exercise called for nimble dancing across several sets of derived tables with some outrageously complex T-SQL. The Information Gain metric is so trivial that it could be dispensed with in a few short paragraphs and a single subtraction operation; the difficulties only arise when we decide to derive it from complicated structures like directed graphs. Set-based languages generally weren’t designed to aggregate across five different levels of granularity, which made the code intensely difficult to write, debug, maintain and port.
Pros and Cons Common to All Implementations of Decision Trees
Another weakness of Information Gain is that its simplicity makes it too simplistic for many advanced data mining, Machine Learning and A.I. scenarios; it sacrifices predictive power in favor of interpretability. Decision Trees is one of the most popular algorithms precisely because the results are easily intelligible for non-specialists, not because of its effectiveness at ensnaring valuable information. The basic concepts behind all of its flavors are simple enough for laymen – who are ultimately the main consumers of most data mining products – to grasp without much fuss, unlike neural nets, which still seem like inscrutable black boxes despite recent advances in neural interpretation. My favorite workaround for this substandard prediction problem is to use more powerful techniques like neural nets and Support Vector Machines (SVMs) for that purpose, then implement Decision Trees to replicate their results to leverage their interpretability, thereby allowing us to have our cake and eat it too. I have seen Decision Trees algorithms predict well, but to date I have yet to see one exceed a properly selected neural net or SVM in prognostic power.
…………Yet another underappreciated issue with the whole family of Decision Trees algorithms is the assumption that parsimonious trees and shortest paths across them ought to be preferred. As pointed out in my SSDM series, Decision Trees may be particularly useful when we’re dealing with small numbers of attributes, processing speed is at a premium, the data naturally follows a hierarchical structure or easily communicable explanations of an algorithm are needed. On the other hand, there are many well-known issues with ID3 and other Decision Trees implementations, including the fact that they’re easily susceptible to overfitting (i.e. learning a single specific dataset in a way that’s not generalizable to other data, which is often accompanied by performance degradation); getting stuck in local minima (i.e. choosing suboptimal solutions); problems in handling continuous variables as easily as discrete ones; the inability to handle XOR decision points; and its classification as a greedy algorithm, i.e. one that naively chooses the best answer at each decision point, possibly at the expense of a better overall solution. Some variants are also biased towards attributes with high number of distinct values.[iii] Various refinements have been developed for each of these issues, but its greatest limitation usually goes unaddressed: a naïve dependence on the principle of Occam’s Razor. This form of abductive reasoning is interwoven in the fabric of a preponderance of data mining algorithms and metrics, yet is often wrongly applied, sometimes intentionally by those with an axe to grind; the Razor is sometimes used recklessly to slice both truth and falsehood away, like a fiend in a slasher movie. It does not merely mean selecting the shortest explanation to any question, but the shortest explanation that fits all the facts. Yet even when the temptation to ignore that all-important caveat is resisted, there is still no guarantee that Occam’s Razor is going to give the correct answer to any particular question, since it is inherently probabilistic and probabilities are not always certainties. More often than not, the shortest explanation that fits as many facts as possible is indeed correct, but this is does not always hold true, so basing one’s decisions on the If-Then interpretations Decision Trees lend themselves to is still fraught with risk. Detective novels offer obvious examples of how convoluted explanations occasionally turn out to be correct; perhaps they’re fictional, but as the saying goes, “truth is sometimes stranger than fiction.” Ultimately, all reasoning based on the results returned by data mining experiments must be connected to epistemology, which Occam’s Razor is a quite useful but nevertheless imperfect principle of.
…………The whole point of using Information Gain in Decision Trees is to compare the reduction in entropy based on choosing one classification scheme over another.[iv] It is safe to assume that Shannon’s Entropy will give us “the minimum number of bits of information needed to encode the classification,”[v] but only as long as the additional assumption that we’re dealing with a uniform probability distribution holds true. Even if both of these conditions are met, there is no guarantee that a more complex branch might represent a better solution to a particular mining problem than the shortest, simplest one. The smallest possible tree is not necessarily going to give us the most actionable information in all situations, although that assumption will hold for most. Retaining branches with lower Information Gain might potentially be helpful in circumventing this inherent limitation of Occam’s Razor. Decision Trees is one of the most popular mining techniques because it is simple, but all of its weaknesses stem from its f naïve assumptions. My preferred method for dealing with this is to first derive solutions using more sophisticated methods like neural nets whose inner workings are unfortunately more difficult to communicate to end users, then use them as reference points to compare the results from simpler algorithms like Decision Trees that are easier to explain.
Figure 1: DDL for the Published Practice Data
CREATE TABLE Practice.DecisionTreeTable(
ID bigint IDENTITY(1,1) NOT NULL,
PlayFootball bit NOT NULL CONSTRAINT DF_DecisionTreeTable_PlayFootball DEFAULT ((0)),
Outlook nvarchar(50) NULL,
Humidity nvarchar(50) NULL,
Temperature nvarchar(50) NULL,
Windy bit NOT NULL CONSTRAINT DF_DecisionTreeTable_Windy DEFAULT ((0)),
CONSTRAINT PK_DecisionTreeTable PRIMARY KEY CLUSTERED
( ID ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON PRIMARY) ON PRIMARY
Figure 2: Customary Practice Data Values
…………The novel solution provided here returns the correct tree structures and information gain figures for the sample data I’ve tested it on, which can be downloaded from University of Toronto Prof. Saed Sayad’s website[vi] and plugged into the SQL Server table defined in Figure 1. His sample data is apparently taken from the third chapter of Carnegie Mellon University Prof. Tom M. Mitchell’s 1997 book [vii] on Machine Learning, where judgments are made on whether or not to play tennis based on frequency tables of three columns of categorical data (Outlook, Temperature and Humidity) and a Boolean value for Windy. Sayad took the liberty of changing Mitchell’s PlayTennis column to PlayGolf, so I’ll do the same and alter it to PlayFootball (something my favorite team, the Buffalo Bills, hadn’t done in a generation at the time I put this experiment on the back burner). I considered changing the Temperature figures to something more appropriate for the NFL, like “Blizzard” and “Ice Bowl” but wasn’t sure if non-fans would get the cultural references. I’m not crazy on cluttering T-SQL code tutorials with a flood of INSERT…VALUES statements, so I’ve substituted a screenshot of what the final table should look like in Figure 2. The same tutorial values are often recycled in the literature on Decision Trees, which makes debugging easier, since others experience errors on familiar sets of attribute-value pairs. This was a silver lining to a very dark cloud, because an additional round of debugging is necessary to cram the tree structures into OLTP and OLAP systems that were not designed to handle them. Data analysts already have their hands full addressing the shortcomings listed in the last section, so it is probably wise to calculate Information Gain in other software tools that do not come with this added baggage.
…………The results in Figure 4 match those of the intermediate and final entropy and Information Gain figures included in their tutorials, but as the length of the code in Figure 2 illustrates, arriving at these results in T-SQL is not a simple task. Working with trees is still notoriously awkward in any version of SQL, even after the addition of such helpful affordances as the hierarchyid data type, which is nowhere to be found in my sample code precisely because it can’t directly accommodate the kind of directed paths used in Decision Trees. Apart from the final SELECT and lengthy table variable definitions, we’re only speaking of 55 lines of code, which could obviously be compacted further by removing indentations. The length does not reflect the complexity of the derivation process, which included numerous alternative approaches that were ultimately scrapped, including several predicated on recursive common table expressions (CTEs). I’ve noticed a general progression in revision of my own code for algorithms of all types, towards gradually upgrading CTEs to table variables, then temporary tables, which can in turn be optimized with nonclustered indexes and the like when performance issues arise. Bottlenecks can be expected in the particular case of self-joins on CTEs, where the query optimizer invariably chooses to execute the exact same CTE queries twice. It also proved much simpler to debug table variables instead of inscrutable chains of CTEs; it would be trivial to change these to temp tables and perform additional fine-tuning, which I have not bothered to do here for reasons of code legibility and length.
…………The 55 lines of code may not reflect the difficulty involved, but still seems excessive for what started out as a misleadingly simple subtraction operation. It stands as yet another cautionary tale, proving that T-SQL can operate on tree structures – but not easily, especially when we have to aggregate across five different granularity levels, in addition to coordinating the five columns of our practice dataset. This required a GROUP BY CUBE in addition to a GROUP BY CASE, plus a welter of complex logical conditions with hard-coded column names. The last limitation is obviously a serious barrier to portability across other datasets; we could write some excessively complex dynamic SQL to compensate for that, but it would only shift the code maintenance nightmare to an outer shell of arcane string manipulation. Note that the WHERE and GROUP BY conditions are not sophisticated enough to handle null values, which are not present in the practice data, so yet another layer of complexity would have to be added to accommodate them in each join and grouping statement. It would be much easier to pull off this overly ambitious T-SQL juggling act if there were a simpler way of visualizing the complex bends across these derived tables; this would be quite useful in many other SQL Server scenarios, so perhaps someday the folks at RedGate (hint, hint) will come up with such a visualization tool (assuming that’s possible, since it might require depicting more than three dimensions). This might ameliorate the debugging and maintenance issues some, but these are only two drawbacks among many. The code below qualifies as a proof-of-concept and ought not be confused with an optimal solution. The real question is whether it’s worth optimizing further; perhaps there are certain workloads where a T-SQL approach might show performance benefits over ordinary Decision Trees tools like SSDM, Accord.Net and ML.Net, but even identifying those limited circumstances would take up valuable analysis time. Users could easily add recursive CTEs to walk particular branches of the trees depicted in Figure 4 and translate the bit codes into IF-AND-THEN statement strings. I haven’t bothered to include them, first to keep the focus on the most important aspects of the code and second because I question whether the solution is valuable as anything more than a curiosity – at best the T-SQL equivalent of David Letterman’s Stupid Pet Tricks[viii] or at worst, a deterrent and word to the wise against repurposing set-based languages in a tortuous manner.
Figure 3: Calculating Information Gain on a Tree Structure
DECLARE @TableCount bigint
SELECT @TableCount = COUNT(*)
DECLARE @MaxParentLevel bigint
— STEP #1 – Derive a table of Frequencies using complex GROUP BY logic
DECLARE @InformationGainTable table
IsHumidityGrouping2 bit, — this *was* a concise solution to the bitmap reversal issue, till I ran into SQL Server’s bizarre inability to perform arithmetic on bits
PlayFootballMinuend AS 1 + (CAST(~IsPlayFootballGrouping1 AS smallint) – CAST(~IsPlayFootballGrouping2 AS smallint)),
OutlookMinuend AS 1 + (CAST(~IsOutlookGrouping1 AS smallint) – CAST(~IsOutlookGrouping2 AS smallint)),
WindyMinuend AS 1 + (CAST(~IsWindyGrouping1 AS smallint) – CAST(~IsWindyGrouping2 AS smallint)),
TemperatureMinuend AS 1 + (CAST( ~IsTemperatureGrouping1 AS smallint) – CAST(~IsTemperatureGrouping2 AS smallint)),
HumidityMinuend AS 1 + (CAST(~IsHumidityGrouping1 AS smallint) – CAST(~IsHumidityGrouping2 AS smallint)),
MinuendEntropy decimal(38, 15),
SubtrahendEntropy decimal(38, 15),
InformationGain AS MinuendEntropy – SubtrahendEntropy
DECLARE @ValueFrequencyTable table
(ID bigint IDENTITY (1,1),
Probability decimal(16, 15) DEFAULT(1) –, — the default is just a convenience for the top level node
INSERT INTO @ValueFrequencyTable
(PlayFootball, Outlook, Temperature, Humidity, Windy, LocalCount, Level,
IsPlayFootballGrouping, IsOutlookGrouping, IsWindyGrouping, IsTemperatureGrouping, IsHumidityGrouping, GroupingID, ValueID)
SELECT PlayFootball, Outlook, Temperature, Humidity, Windy, LocalCount, Level,
IsPlayFootballGrouping, IsOutlookGrouping, IsWindyGrouping, IsTemperatureGrouping, IsHumidityGrouping, GroupingID,
ROW_NUMBER() OVER (PARTITION BY GroupingID ORDER BY GroupingID) AS RN
FROM( SELECT PlayFootball, Outlook, Temperature, Humidity, Windy, LocalCount, Level,
IsPlayFootballGrouping, IsOutlookGrouping, IsWindyGrouping, IsTemperatureGrouping, IsHumidityGrouping, GroupingID
FROM ( SELECT PlayFootball, Outlook, Temperature, Humidity, Windy, Count(*) AS LocalCount,
GROUPING_ID(PlayFootball) + GROUPING_ID(Outlook) + GROUPING_ID(Windy) + GROUPING_ID(Temperature) + GROUPING_ID(Humidity) AS Level,
GROUPING_ID(PlayFootball) AS IsPlayFootballGrouping,
GROUPING_ID(Outlook) AS IsOutlookGrouping,
GROUPING_ID(Windy) AS IsWindyGrouping,
GROUPING_ID(Temperature) AS IsTemperatureGrouping,
GROUPING_ID(Humidity) AS IsHumidityGrouping, GROUPING_ID(PlayFootball, Outlook, Humidity, Temperature, Windy) AS GroupingID
GROUP BY (), CUBE(PlayFootball, Outlook, Humidity, Temperature, Windy)) AS T1) AS T2
ORDER BY Level DESC, GroupingID
SET Probability = T0.LocalCount / CAST(@TableCount as float)
FROM @ValueFrequencyTable AS T0
INNER JOIN @ValueFrequencyTable AS T2
ON T0.Level = T2.Level –1
AND (T2.IsPlayFootballGrouping =1 OR (T2.IsPlayFootballGrouping = 0 AND T2.PlayFootball = T0.PlayFootball))
AND (T2.IsOutlookGrouping =1 OR (T2.IsOutlookGrouping = 0 AND T2.Outlook = T0.Outlook))
AND (T2.IsTemperatureGrouping =1 OR (T2.IsTemperatureGrouping = 0 AND T2.Temperature = T0.Temperature))
AND (T2.IsWindyGrouping =1 OR (T2.IsWindyGrouping = 0 AND T2.Windy = T0.Windy))
AND (T2.IsHumidityGrouping =1 OR (T2.IsHumidityGrouping = 0 AND T2.Humidity = T0.Humidity))
INSERT INTO @InformationGainTable
(ParentID, ChildID, ParentLevel, IsPlayFootballGrouping1, IsOutlookGrouping1, IsWindyGrouping1, IsTemperatureGrouping1, IsHumidityGrouping1,
IsPlayFootballGrouping2, IsOutlookGrouping2, IsWindyGrouping2, IsTemperatureGrouping2, IsHumidityGrouping2, SubtrahendEntropy)
SELECT T1.GroupingID AS ParentID, T2.GroupingID AS ChildID, T1.Level, T1.IsPlayFootballGrouping, T1.IsOutlookGrouping, T1.IsWindyGrouping, T1.IsTemperatureGrouping, T1.IsHumidityGrouping,
T2.IsPlayFootballGrouping, T2.IsOutlookGrouping, T2.IsWindyGrouping, T2.IsTemperatureGrouping, T2.IsHumidityGrouping,
SUM(T1.Probability* CASE WHEN T2.LocalCount = T1.LocalCount THEN 0 ELSE –1 * (T2.LocalCount / CAST(T1.LocalCount AS float)) * Log((T2.LocalCount / CAST(T1.LocalCount AS float)), 2) END) AS Subtrahend
FROM @ValueFrequencyTable AS T1
INNER JOIN @ValueFrequencyTable AS T2
ON T1.Level = T2.Level + 1
WHERE (T1.IsPlayFootballGrouping =1 OR (T1.IsPlayFootballGrouping = 0 AND T1.PlayFootball = T2.PlayFootball))
AND (T1.IsOutlookGrouping =1 OR (T1.IsOutlookGrouping = 0 AND T1.Outlook = T2.Outlook))
AND (T1.IsTemperatureGrouping =1 OR (T1.IsTemperatureGrouping = 0 AND T1.Temperature = T2.Temperature))
AND (T1.IsWindyGrouping =1 OR (T1.IsWindyGrouping = 0 AND T1.Windy = T2.Windy))
AND (T1.IsHumidityGrouping =1 OR (T1.IsHumidityGrouping = 0 AND T1.Humidity = T2.Humidity))
GROUP BY T1.GroupingID, T2.GroupingID, T1.Level,
CASE WHEN T1.IsPlayFootballGrouping <> T2.IsPlayFootballGrouping HEN T1.IsPlayFootballGrouping
WHEN T1.IsOutlookGrouping <> T2.IsOutlookGrouping THEN T1.IsOutlookGrouping
WHEN T1.IsTemperatureGrouping <> T2.IsTemperatureGrouping THEN T1.IsTemperatureGrouping
WHEN T1.IsWindyGrouping <> T2.IsWindyGrouping THEN T1.IsWindyGrouping
WHEN T1.IsHumidityGrouping <> T2.IsHumidityGrouping THEN T1.IsHumidityGrouping END,
T1.IsPlayFootballGrouping, T1.IsOutlookGrouping, T1.IsWindyGrouping, T1.IsTemperatureGrouping, T1.IsHumidityGrouping,
T2.IsPlayFootballGrouping, T2.IsOutlookGrouping, T2.IsWindyGrouping, T2.IsTemperatureGrouping, T2.IsHumidityGrouping
SET MinuendEntropy = T2.MinuendEntropy
FROM @InformationGainTable AS T0
INNER JOIN (SELECT GroupingID, Level,IsPlayFootballGrouping, IsOutlookGrouping, IsWindyGrouping, IsTemperatureGrouping, IsHumidityGrouping,
SUM(CASE WHEN Probability = 1 THEN 0 ELSE –1 * Probability * Log(Probability, 2) END) AS MinuendEntropy
GROUP BY GroupingID, Level,IsPlayFootballGrouping, IsOutlookGrouping, IsWindyGrouping, IsTemperatureGrouping, IsHumidityGrouping) AS T2
ON T0.PlayFootballMinuend = T2.IsPlayFootballGrouping
AND T0.OutlookMinuend = T2.IsOutlookGrouping
AND T0.WindyMinuend = T2.IsWindyGrouping
AND T0.TemperatureMinuend = T2.IsTemperatureGrouping
AND T0.HumidityMinuend = T2.IsHumidityGrouping
SELECT @MaxParentLevel = Max(ParentLevel)
SELECT ParentID, ChildID, ParentLevel, PlayFootballMinuend, OutlookMinuend, WindyMinuend, TemperatureMinuend, HumidityMinuend, MinuendEntropy,
WHERE ParentLevel <> @MaxParentLevel
ORDER BY ParentLevel DESC, MinuendEntropy, ParentID
Figure 4: Results from the Practice Table
…………In the first INSERT statement, we store the permutations of all attribute-value pairs for the five columns in our practice data. In the next step, we set the probabilities of each permutation based on joins to the Level above, which differentiate the number of attributes we’re examining at each stage. For example, at Level 4 we find a row where IsPlayFootballGrouping is the only one of the five columns set to zero, then divide its count by the counts of all the rows at Level 3 where IsPlayFootballGrouping and one of the other columns both = 0, plus the values of IsPlayFootballGrouping also match. In other words, we assign probabilities that tell what the odds are that the Outlook will be Sunny, Rainy or Overcast when PlayFootball is either true or false. When comparing Levels 3 and 2, we would also assign counts and probabilities for more complex permutations, like all possible combinations of the Temperature values (Hot, Cool and Mild) with Humidity (Normal and High) to ascertain the likelihood of IsWindy being true or false. At Level 0, all five bit columns in the @ValueFrequencyTable are 0 and the most fine-grained permutations are assigned values; at the highest Level (5 in this case, one for each column), all bit columns are 1.
A Quagmire of Negative Information Gain and Combinatorial Explosion
The initial INSERT also involves assigning GroupingIDs which are transformed into ParentID and ChildID identifiers in the @InformationGain table that hosts our tree structures. To cut down on the number of subqueries, we calculate Shannon’s Entropy and multiply it by the probability of the parent attribute-value set within the same SUM statement that yields the subtrahend. It became necessary to think explicitly about the minuend and subtrahend, to the point of identifying them by their formal names, in order to avoid a consistent pitfall in Decision Trees practice exercises: mismatched levels of granularity that inevitably yield negative Information Gain. Mathematical proofs show that it cannot possibly be less than 0, but it is nonetheless common to find threads in computing and data science forums where would-be coders encounter negative results, often on the very same practice data. This is always a clear sign that the programmer is aggregating over the wrong counts and by extension, the probabilities and entropies derived from them. Determining precisely where they went wrong can be exceedingly difficult, since we’re dealing with a quite complex set of joins and nested aggregations that become ever-more complicated as more columns and distinct values are added to the practice data.
…………The best solution I’ found for this misalignment dilemma was to repurpose the Boolean columns returned by the original GROUP BY CUBE to identify each unique pattern as a series of bits. Unfortunately, to calculate the subtrahend (the entropy being subtracted) across two different levels of granularity (identified by the ParentID, ChildID and Level) we need to compare two patterns, thereby doubling the number of bit columns. That is why there are two “Is” columns in the @InformationGainTable for each column in the practice data. The complexity of the solution is underscored by the fact that there simply wasn’t room in Figure 4 to depict the 10 other bit columns that define the coordinates of the Parent and ChildIDs. The intricacy is increased further by the fact that we need a third set of bit columns to identify the Minuends, i.e. the entropies we subtract from, which exist at yet another level of granularity. I left the CAST statements in the @InformationGainTable to underline one of my pet peeves, the inability SQL Server to handle simple bit addition; I devised a concise solution for identifying these coordinates from reversal of the other bit columns through simple ~ NOT operators, but had to encase them in these awkward conversions and additions to work around Microsoft’s artificial limitation on Boolean math. The associated MinuendEntropy is calculated in the second UPDATE, where we have yet another Shannon’s Entropy calculation encased in a SUM – except sans the multiplication with another probability, which was required in the subtrahend calculation.
…………The actual formula for Information Gain is an entropy subtraction so trivial that there’s no use repeating it here, especially since there are many other useful tutorials like Sayad’ and Mitchell’s that discuss the mathematical derivation in an easily digestible manner. Some of the more common return values seen in their tutorials and other implementations of the same practice data are highlighted in red in Figure 4. The routine returns 75 rows, one for each possible branch, not all of which are depicted; I cannot guarantee that the other results beyond these common toy examples are accurate, but the fact that they’re no longer returning negative Information Gain is encouraging. Subtracting the right quantities is the real challenge here: it is quite easy to aggregate over the wrong levels and unconsciously make apples vs. oranges comparisons, which quickly return invalid results. Correctly matching the attribute-value permutations across so many levels of granularity demands a combinatorial explosion in the number of tracking columns and logical conditions that renders a solution in T-SQL impractical.
Beyond the Impasse: Fresh Datasets, Exciting New Algorithms, New Series and a Blog Makeover
I’m not ruling out the possibility that in certain circumstances there might be performance benefits in calculating Information Gain in T-SQL in narrow use cases on particular datasets, only that it’s unlikely due to the proliferation of coordinate matching statements. Porting any solution to other datasets is likely to be a real code maintenance hassle, especially if an increase in the number of columns exacerbates the combinatorial explosion. Ironically, I have long intended to devote the next segment of the Information Measurement series to a fresh set of statistics that define hard limits on how short computer programs can possibly be, among other uses; such measures as Kolmogorov Complexity, Minimum Description Length and Solomonoff Probability represent an entirely different type of information than we’ve discussed to date. Although precise measurements of MDL and its kin are technically incalculable, I would not be surprised if the minimum code length for a T-SQL solution to Information Gain is not prohibitively high. This fresh set of information metrics are not only fascinating in their own right, but might be much more conducive to set-based solutions than Information Gain was, especially if we can leverage cross joins and the like in search of optimal codes.
…………I have yet to write that code, in part because the topic is more advanced, but also in part because of long-delayed obstacles like the Information Gain conundrum. I learned some hard lessons about letting particularly difficult code sit for too long, because I basically had to restart the debugging process from scratch. In the interregnum when my blog was being updated less frequently, I also scrapped plans for a segment on the Akaike, Bayesian and other Information Criterions. Writing this series helps me to absorb the material faster; they are a learning experience, sometimes a hard one in the case of Information Gain. I have little prior expertise with some of the measures I’m curious about, including the information criterions, which are typically used to judge the predictive power of regression and other mining models. I discovered that they are even less practical for set-based programming than any other algorithms I’ve written about to date – even Fisher Information. Not only is MLE involved, as in the Fisher Information Matrix, but the other terms of the information criterions are determined by the number of “parameters,” which is often little more complex than counting the number of variables in a regression equation or other math formula. This is a job for VB.Net or C#, not T-SQL or MDX, same as with likelihood functions.
…………We’ll move on to more promising targets, including supplemental series on Coding Natural Language Processing Metrics in T-SQL, Programming Swarm Optimization Algorithms in SQL Server and perhaps some additional installments of the older Integrating Other Data Mining Tools with SQL Server series to discuss the Accord.Net and ML.Net Machine Learning libraries. I’ll be putting the open-ended Information Measurement series on hold simply for a badly-needed change of pace, despite the fact that I have a keen interest in Kolmogorov Complexity and its relatives. This article at least succeeded in getting past the stale Higgs Boson dataset I’d been using for demonstration purposes for the last several years and I plan to leave it far behind, in favor of fresh challenges like cryptanalytic and engineering optimization datasets. Over the past two years I’ve posted some older code that doesn’t reflect my current level of coding skill, so readers can look forward to better solutions. In the months to come I hope to give this blog a long-overdue makeover, now that I’ve gotten past Information Gain and the Fisher Information Matrix – two of the most aggravating trials I’ve faced since I started programming Basic on an old Compucolor II back in ’81.
See the Wikipedia page “ID Algorithm” at http://en.wikipedia.org/wiki/ID3_algorithm for more information.
[ii] See the Wikipedia article “Information Gain in Decision Trees” at https://en.wikipedia.org/wiki/Information_gain_in_decision_trees
[iv] Mitchell, Tom M., 1997, Machine Learning. McGraw-Hill: New York. The third chapter is available online at the Princeton University web address https://www.cs.princeton.edu/courses/archive/spring07/cos424/papers/mitchell-dectrees.pdf
[v] IBID., p. 56.
[vii] pp. 57-58, Mitchell.
[viii] I see that many well-known figures in the SQL Server community have already published “Stupid T-SQL Tricks” on their blogs – maybe I ought to try out for an episode.