Information Measurement with SQL Server, Part 4.1: The Kullback-Leibler Divergence

By Steve Bolton

…………This informal series of tutorials on how to implement a hodgepodge of information metrics in SQL Server is somewhat disorganized by necessity, given that I’m trying to cram in every noteworthy type of measure into one series while simultaneously acquainting myself with the topics along the way. That’s going to require some skipping around, but whenever possible, I try to provide what little structure I can by starting off each segment with the simplest and/or more fundamental measures. The formula underlying the Kullback-Leibler Divergence is not quite as easy as certain intuitive distance measures like Euclidean Distance, which belongs to a group of measures that are more appropriate for calculating the separation between data points in clustering algorithms, all of which I’ll deal with later on in this series if all goes according to plan. Yet neither is the Kullback-Leibler Divergence (KL Divergence) much of a burden in terms of comprehension and calculation, as long as we already have concepts like information entropy under our belts.
…………I kicked off this series with a segment on entropic measures because they are such elementary parts of information theory, whose subject matter often overlaps that of this amateur series. I saved Information Measurement with SQL Server, Part 2.4: Conditional and Joint Entropy for the tail end of that segment because it involved comparisons between two entropies rather than one, which is also the case with the topic of this week’s mistutorial; the good news is that it is actually takes a little less code and even performs better, with faster calculation times and fewer burdens on server resources. The KL Divergence and its relatives serve a different purpose though: instead of measuring the association between two probability distributions as Conditional and Joint Entropy do, they are used to assess how far apart an approximate distribution is from its target. To compute it, we merely need to divide each actual probability value by its approximate counterpart and plug the results into the LOG operation in Shannon’s Entropy, which I covered in a previous article. Various shades of meaning can be attached to the KL (as is always the case with the fundamental information metrics), but they all boil down to measuring some kind of information loss, in terms of how much we can learn from particular data values when we encounter them, i.e. of  “newsworthiness.”. The ultimate goal is often to select better approximations by minimizing the KL Divergence, which is useful in a wide range of applications.

The Cold War Origins of a Foundational Information Metric

This indispensable information measure was the brainchild of Solomon Kullback (1907-1994) and Richard Leibler (1914-2003), two more of those pioneering American cryptanalysts who never received proper credit for breaking the Japanese and Soviet ciphers in World War II and the early Cold War.[1] While working at the National Security Agency (NSA) at a time when it supposedly did not exist (hence the nickname “No Such Agency”[2]), the two code-breakers outlined this simple yet powerful measure in an classic 1951 paper in The Annals of Mathematical Statistics.[3] At the time Claude Shannon’s newborn measure of information entropy hadn’t yet been repurposed for discriminating between two distributions, so their groundbreaking idea opened up a whole new world of use cases for these entropic measures, which were then confined more to coding theory. Its inventors noted in the original paper how the measure can serve as a bridge between Shannon’s Entropy and Fisher Information, an important metric based on variances and covariances that I’ll have to explain in detail later in this series.
…………Both the Fisher Information and KL Divergence are vital to the field of information geometry, but we needn’t look to the cutting edge to find use cases. For example, it has applications in the study of “large deviations” where it acts as the rate function and can be used as an alternative measure of “surprise” in the kind of Bayesian statistics I introduced in the last article, if the approximation is treated as incorrect a priori knowledge.[4] The latter might be of more use to SQL Server DBAs and data miners than its original applications in coding theory, where it was interpreted either as a measure of compression or the count of bits needed to derive the incorrect code embodied in the approximation.[5] It can be adapted for statistical hypothesis testing[6], which as I noted in my Outlier Detection series is an uncommon use case in the SQL Server world; the same principles also make it attractive as a measure of the usefulness of mining models though, which users of SQL Server Data Mining (SSDM) might be able to leverage. Later in this long series I hope to cover the Akaike Information Criterion, which performs the same mining purposes by inferring the equivalent of the KL Divergence through calculations of statistical likelihood.[7] It has even served in recent years as an alternative to the distribution matching tasks I covered in the Goodness-of-Fit Testing with SQL Server series.
…………It has many deep relationships with other forms of entropy, hence the frequency of the nickname “Relative Entropy” in the information theory literature. In some situations it acts as the expected value of the Mutual Information[8] measure we covered a couple of articles ago, which is closely related to the type of information gain we’ll tackle in a future article on tree structures. Since the KL Divergence measures information loss, it makes sense that it would be useful in gauging information gain. Furthermore, the KL Divergence intersects in many complex ways with other measures of information distance; for example, when the main formula discussed in Information Measurement with SQL Server, Part 2.2: The Rényi Entropy and Its Kin is treated as a distance measure, it becomes the Rényi Divergence or a-Divergence, which the KL Divergence serves as a special case of[9]. It is also an F-Divergence derivable from a Bregman Divergence, all of which I’ll explain to the best of my inability later in this segment of the series. Given that it integrates with all of these diverse information measures in myriad ways, it is not surprising that the Kullback-Leibler is probably encountered more often in the data mining and information theory literature than any of these other information distances. These complex interrelationships work to our advantage, however, by enabling us to cast a web of metrics around whatever information we’re trying to capture in our tables and cubes. The KL Divergence is useful in large part because it integrates so well with many of its relatives, which makes it one of the strongest threads in the web.

Coding the KL

Since the KL Divergence can also be derived by subtracting the Shannon’s Entropy for the target distribution from another simple measure known as Cross Entropy[10], I included it in the sample T-SQL in Figure 1 as a check against my calculations. The code is structured like the six T-SQL samples in the segment on entropic measures, so it shouldn’t be too hard to follow despite the length of the text. In fact, I just cut and pasted a good part of it, particularly the derivation of probabilities from actual proportions found for the second float column in the same Higgs Boson dataset I’ve been using for practice material for ages[11] and their storage in an @EntropyTable table variable. The second set of probabilities are derived from a Calculations.NormalDistributionCDFFindProbabilityFunction, which is not depicted here but takes the mean and standard deviation for the column and plugs them into the Cumulative Distribution Function (CDF) for the Gaussian distribution (i.e. the bell curve). Note that I substituted an arbitrary value of 0.61 for the standard deviation in this case, simply because it generated non-zero probabilities for every ActualProportion; most other values did not, even when using the real standard deviation, which is a strong clue that Column2 would fail a normality test based on the Kullback-Leibler Divergence. In fact, the KL isn’t even defined in many cases like this. One of the restrictions on the Kullback-Leibler Divergence is that both distributions must be “absolutely continuous,” i.e. they sum to 1, which signifies that the counterpart of each non-zero probability must also be non-zero; this means that we can use the Value column as a sort of primary key for the @EntropyTable, but it’s actually an irritation that disqualifies many distributions from comparison. The first set of probabilities can never be zero because we’re only storing ones that have observed values in the dataset, but some nil values might sneak into the second set of probabilities if we populated it through a distribution that wasn’t absolutely continuous with the first. The @IsAbsolutelyContinuous variable acts as a check against this and sets the KL results to NULL when there are any non-zero values for the second probability. Both the ActualProportion and GaussianProbability indeed add up to 1 in this case, as can be verified by uncommenting the validation block in the middle.
…………The target probability (the ActualProportion in this case) is divided by the approximation (the GaussianProbability in this sample) and the result is bubbled up to what amounts to the Shannon’s Entropy formula,  featuring the usual LOG, multiplication and summation operations, except with a different input. I also calculate the Cross Entropy and Shannon’s Entropy in the same pass, in order to compute the simple check in the last SELECT against my calculation of the KL Divergence. Figure 2 depicts only slight discrepancies for this check at really small precisions, which could be due to rounding errors alone. The code could use some sprucing up (the CASE checks might not even be necessary) but it gets the point across and can serve as a springboard for anyone who wants to write production code based on it. Note how in Figure 2 the KL Divergence is much higher for Column1 in the top graphic than for Column2 in the bottom one. After using these two columns as practice data for the last few tutorial series, I know that Column2 is somewhat Gaussian and Column1 definitely is not, so I thought they’d make an ideal contrast. Since Column1 has a different distribution, I had to substitute a different fake standard deviation value to simulate the property of absolute continuity. I also had to set an arbitrary lag of Value – 0.1 to cover cases when the first proportion in the @EntropyTable generated a GaussianProbability of zero.

Figure 1: Sample T-SQL Code for the KL Divergence and Cross Entropy
DECLARE @LogarithmBase decimal(38,36)
SET @LogarithmBase = 2 2.7182818284590452353602874713526624977  10

DECLARE @KullbackLeiblerDivergence float, @CrossEntropy float, @ShannonsEntropy float, @IsAbsolutelyContinuous bit
–@JeffreysDivergence float,
–@KDivergence float,
–@TopsoeDistance float,
–@JensenShannonDivergence float,
–@JensenDifference float,

DECLARE @Count bigint, @Mean float, @StDev float

SELECT @Count=Count(*), @Mean = Avg(Column2), @StDev= 0.61 –, @StDev= StDev(Column2)
FROM Physics.HiggsBosonTable
WHERE Column2 IS NOT NULL

DECLARE @EntropyTable table
(Value decimal(33,29),
ActualProportion decimal(38,37),
GaussianProbability decimal(38,37),
SummationInputForEntropy float,
SummationInputForKullbackLeibler float,
SummationInputForCrossEntropy float
)

INSERT INTO @EntropyTable
(Value, ActualProportion, GaussianProbability, SummationInputForEntropy, SummationInputForKullbackLeibler, SummationInputForCrossEntropy)
SELECT Value, ActualProportion, GaussianProbability, ActualProportion *
Log(ActualProportion, @LogarithmBase) AS SummationInputForEntropy,
CASE WHEN GaussianProbability<= 0 THEN 0 ELSE ActualProportion * Log(ActualProportion / GaussianProbability, @LogarithmBase) END AS SummationInputForKullbackLeibler,
CASE WHEN GaussianProbability <= 0 THEN 0 ELSE ActualProportion * Log(GaussianProbability, @LogarithmBase) END AS SummationInputForCrossEntropy
FROM (SELECT Value, ActualProportion, Calculations.NormalDistributionCDFFindProbabilityFunction(ActualValueLag, Value, @Mean, @StDev) AS GaussianProbability
FROM (SELECT Value,  ActualProportion, Lag(Value, 1, Value 0.1) OVER (ORDER BY Value ASC) AS ActualValueLag
FROM (SELECT Value,  ValueCount / CAST(@Count as float) AS ActualProportion
FROM (SELECT Column2 AS Value, Count(*) AS ValueCount
FROM Physics.HiggsBosonTable
GROUP BY Column2) AS T1) AS T2) AS T3) AS T4

— VALIDATION – uncomment to debug
–SELECT SUM(CAST(ActualProportion AS float)), SUM(CAST(GaussianProbability AS float))
–FROM @EntropyTable

if this count is not = 0, then the distributions are not absolutely continuous and the KL Divergence is not defined
SELECT @IsAbsolutelyContinuous = CASE WHEN Count(*) > 0 THEN 0 ELSE 1 END
FROM @EntropyTable
WHERE GaussianProbability = 0

SELECT @ShannonsEntropy = ABS(SUM(SummationInputForEntropy)), @CrossEntropy = ABS(SUM(SummationInputForCrossEntropy)),
@KullbackLeiblerDivergence = SUM(SummationInputForKullbackLeibler)
FROM @EntropyTable

SELECT @IsAbsolutelyContinuous AS IsAbsolutelyContinuous, @ShannonsEntropy AS ShannonsEntropy,
@CrossEntropy AS CrossEntropy,
ABS(@CrossEntropy @ShannonsEntropy) AS CalculationCheck,
ABS(@CrossEntropy @ShannonsEntropy) @KullbackLeiblerDivergence AS DiscrepancyWithCrossEntropyCheck,
CASE WHEN @IsAbsolutelyContinuous = 1 THEN @KullbackLeiblerDivergence ELSE NULL END AS KullbackLeiblerDivergence

Figure 2: Results from the First and Second Float Columns in the Higgs Boson Dataset (click to enlarge)

…………The accuracy of the CrossEntropy check still needs some work, but the performance is stellar. It runs in just 3 seconds on my rusty abacus of a development machine, which hardly qualifies as a modern production server. The execution plans are quite similar to those for in the segment on entropic measures, albeit even simpler. The INSERT accounts for 98 percent of the batch costs, and one Sort operator accounts for 49 percent of its costs, with 45 percent more attributable to the actual Table Insert. That means it probably can’t optimized much further – unless of course you can have precalculated probabilities that can be read directly from a table, in which case most of this code and the associated costs in the INSERT can be sliced in half.  It does soak up more RAM and TempDB space than I’d like, but this is only a problem when calculating all of these probabilities on the fly on a little desktop machine.
…………You’ve probably noticed already how I commented out five other measures at the beginning of Figure 1. Since they’re all members of the same family of entropic distance measures I’ll be tackling them in the next article, which will involve merely pasting the same code and altering some of the math operations in the INSERT. None of them are particularly difficult to translate into T-SQL, so I’ll dispose of them in one fell swoop. This segment will basically consist of diving into a maelstrom of distance measures, many of which are easy to code and also have minimal performance costs; many of them are used in clustering algorithms, which may be the hidden reason why the SSDM functionality I covered in A Rickety Stairway to SQL Server Data Mining, Algorithm 7: Clustering ran so efficiently compared to other built-in algorithms. The overarching reason why theoreticians have develop so many of these distances metrics is that they exhibit a welter of different properties, some of which are more conducive to certain data mining and statistical tasks than others. For example, divergences don’t have to be symmetric, in the sense that measuring in reverse from the destination to the source is not guaranteed to return the same values as the original formula. Nor are they subadditive, in that the distances between each set of data points and a third reference point must always sum to greater than or equal to the distance between the two distributions. Technically speaking, “distances” exhibit symmetry but are not required to be subadditive (i.e. they do not have to obey the “triangle inequality”). Throughout this series I’ve been using the term “metric” in a casual way, but in mathematical parlance it also has the stricter meaning of a distance measure that exhibits both symmetry and subadditivity. The KL does neither, hence the need for some of the compensatory tweaks embodied in the divergences we’ll discuss next time around; the KL is still remarkably useful without these desirable properties though[12], to the extent that it is still more commonly used than any of these variants, the most well-known of which may be the Jensen-Shannon Divergence. After that, we’ll tackle divergences, distances and metrics that aren’t limited to probabilistic inputs. In the meantime, I highly recommend Cha Sung-Hyuk’s well-organized and highly readable taxonomy of distance measures, which explains better than I can how they fit together.[13] There’s such an abundance of data mining measures available in the mathematical literature that I doubt anybody will be able to corner them all in the analytics marketplace. By the end of the series, however, I hope to demonstrate how several dozen of the leading metrics can be “weaponized” in T-SQL form and put into action in cornering the uncertainty in our SQL Server tables and cubes from every possible direction. The most numerous section of our arsenal will consist of these divergences and distances, which will allow us to perform DIY data mining with greater skill than before.

[1] See the Wikipedia pages “Solomon Kullback” and “Richard Leibler” at http://en.wikipedia.org/wiki/Solomon_Kullback and http://en.wikipedia.org/wiki/Richard_Leibler respectively.

[2] Which was still in use long after the Church Committee publicly exposed its illegal domestic surveillance operations in the mid-‘70s.

[3] Kullback, Solomon and Leibler, Richard, 1951, “On Information and Sufficiency,” pp. 79-86 in The Annals of Mathematical Statistics, March 1951. Vol. 22, No. 1. Available online at the West Virginia University Lane Department of Computer Science and Electrical Engineering web address http://www.csee.wvu.edu/~xinl/library/papers/math/statistics/Kullback_Leibler_1951.pdf KL

[8] See the Wikipedia page “Information Gain in Decision Trees” at http://en.wikipedia.org/wiki/Information_gain_in_decision_trees

[9] See the Wikipedia article “Rényi Entropy” at http://en.wikipedia.org/wiki/R%C3%A9nyi_entropy