**By Steve Bolton**

…………Before closing out this long-delayed segment on stochastic information distances, I wanted to provide another example of how their complex interrelationships can be integrated together, to essentially trap the information we’re seeking in a web of sorts. Each article in this segment has scratched the surface of the ties between these metrics to some extent, but we also ran across a clear in Information Measurement with SQL Server, Part 2.2: The Rényi Entropy and Its Kin of how they can sometimes be parameterized under a single formula. When that metric’s a-parameter is set to 2, it becomes Collision or Quadratic Entropy (often referred to simply as “the Rényi Entropy”). The formula would cause a divide-by-zero error if a value of 1 were plugged in, but if this were not true, it would be equivalent to the title measure in Information Measurement with SQL Server, Part 2.1: The Uses and Abuses of Shannon’s Entropy. The equally unreachable limit is the Min Entropy (a.k.a. Chebyshev Entropy), which acts as a cap on the other end. There is also a corresponding distance metric known as the Rényi Divergence (or a-Divergence), which at an alpha parameter value of 1 is equal to the important metric discussed in Information Measurement with SQL Server, Part 4.1: The Kullback-Leibler Divergence; with some tweaking it can also be used to derive the first measure in Information Measurement with SQL Server, Part 4.7: The Bhattacharyya and Hellinger Distances.[1] Another similar generalization is the f-Divergence, which can also be used to derive the Kullback-Leibler, the Hellinger Distance and the ?2 covered in Information Measurement with SQL Server, Part 4.5: The Squared L2 Family of Distances. It can also be used to derive Total Information Distance mentioned in the Bhattacharyya- Hellinger article and with some tweaks, can be used to derive the Rényi Divergence.[2] I’ll forego discussing the Rényi since it can be subsumed under the f-Divergence. By the same token, I’ll skip over the f-Divergence, since it can likewise be subsumed under today’s topic, the Bregman Divergence – which is advantageous not only because it sits at the top of the hierarchy of stochastic information distances, but because it also has important links to specific probability distributions, thereby giving us more links to other points in the scaffolding of data mining mathematics.

…………It may sound like the title of a Frederick Forsyth novel, but the Bregman Divergence is really just a useful concept for organizing the various information distances in a way that helps narrow down their use cases, while also helping make them more modular and interchangeable. It not only incorporates the f-Divergence and the Rényi by extension, but also two of the most important distance metrics I’ve covered in my DIY data mining articles, Squared Euclidean Distance and Mahalanobis Distance. Naturally, it is also inextricably linked to a generalized version of the Kullback–Leibler, which is one of the most ubiquitous metrics in data mining. Since I have yet to cover Itakura-Saito Distance, another important metric that can be derived from the Bregman Divergence, I’ll implement that efficiently in today’s T-SQL sample code. All of these measures qualify as instances of the Bregman Divergence[3] because they all satisfy the same general equation that defines it, which is equivalent to subtracting a point on a second probability distribution from a point the first, then subtracting the Taylor Expansion of the second, evaluated at the first.[4]

…………Barnabás Póczos, a machine learning expert at Carnegie Mellon University, has published a handy table of the various instances of the Bregman Divergence, which also includes members that are lesser-known at the Itakura-Saito, like the Generalized I-Divergence, Squared Loss and Logistic Loss, none of which I’d heard of before.[5] As Póczos points out, some of these Bregman Divergences are generalizable to various types of matrices, like the Squared Euclidean to the squared Frobenius, from the KL-Divergence (a.k.a. Relative Entropy) to the von Neumann or Quantum Relative Entropy and Itakura-Saito Divergence to the Burg or LogDet Divergence.[6] Those uses are too advanced for our purposes, especially since matrix math isn’t always well-suited to T-SQL or other set-based languages, but the linkages of those distances to particular probability distributions are quite helpful. In a publicly available slide show, Inderjit S. Dhillon, the director of the Center for Big Data Analytics at the University of Texas at Austin illustrates how some of the most well-known probability distributions correspond to specific instances of the Bregman Divergence, such as the Spherical Gaussian to Squared Euclidean Distance, the Multinomial to the Kullback-Leibler Distance and the Exponential to Itakura-Saito Distance.[7] It is possible to develop Bregman formulas for all of the other members of the Exponential Family of distributions, to which the Gaussian (the “normal distribution,” a.k.a. the bell curve), Multinomial and Exponential also belong.[8] Póczos provides another handy table showing additional links to the Bernoulli, Binomial, regular Gaussian and Gamma, among others.[9]

**The Properties of the Bregman Divergences**

As discussed throughout this segment of the series, mathematicians craft these metrics in order to derive particular mathematical properties. I sometimes use terms like “metric” and “distance” in a loose way, but mathematicians apply them in rigorously defined ways, based on what properties particular formulas possess. Metrics technically exhibit both symmetry and subadditivity, meaning that that measuring in reverse from the destination to the source is not guaranteed to return the same values, plus the distances between each set of data points and a third reference point must always sum to greater than or equal to the distance between the first two points. Distances are only required to have the first quality, i.e. symmetry. Divergences lack them both, which is usually looked upon as a drawback; one of the reasons the metrics in Information Measurement with SQL Server, Part 4.2: The Jensen–Shannon Divergence and Its Relatives and other more recent innovations were developed was to address this shortcoming of the KL-Divergence.

…………Since they’re divergences, none of the instances of the Bregman exhibit subadditivity (i.e. the triangle inequality) or symmetry. On the other hand, the general formula for the Bregman Divergence guarantees certain other properties which can be leveraged for certain use cases, including non-negativity (the values are always positive), linearity (the points follow a straight line at least for its “non-negative coefficients”) and duality. Of more importance is the fact that the Bregman formulas guarantees that the mean always “minimizes total squared error,” which gives data miners another reference point to zero in on important information, particularly when constructing Bayesian estimators.[10] It makes it easier to minimize the error by simply using the mean, whereas it might not be so straight-forward with other distances. Perhaps the most important property is partial convexity, which causes the first argument in the formula to curve upwards on a graph and therefore place boundaries on the range of values that the results can take. Convexity is widely mentioned in the data mining literature, often alongside the Holy Grail of finding global optimums to particular data mining problems, in a world where it’s easy to get stuck in local minima. Most of the math is still beyond me, but it is clear that convexity can be leveraged to derive global optima, by stringing together a lot of lemma and theorems that attach it to the wider foundations of calculus. I picture it as wrapping a curve around the information in our databases and trap it within certain bounds, so to speak. L.M. Bregman’s original paper is highly technical, but clearly points to convexity as the key property to leverage.[11] For example, a cyclical projection of a Bregman divergence can converge to a global optimum, whereas doing the same with the trusty old K-Means algorithm used by so many data miners is actually NP-Hard.[12] That’s the same complexity class that Alan Turing’s famous halting problem is in.[13]

…………It is not my purpose here to recap all of the explanations for why it works; as I’ve pointed out before, people writing data mining code don’t necessarily need to develop algorithms, only to implement them for end users, just as motorcycle mechanics don’t need to engineer their own Harleys from scratch. It doesn’t hurt, but it isn’t always necessary, as long as we have some idea of when to apply a particular formula. The really useful thing about the 1-to-1 bijection between specific probability distributions and instances of the Bregman is that if we can figure out what distribution our data follows – perhaps through some kind of Goodness of Fit Testing with SQL Server – then we can make a good guess as to which specific distance to use. Dhillon provides some more advanced criteria for choosing a particular Bregman Divergence, but it’s a highly technical explanation based on matrix ranks and comparisons to advanced topics like the Frobenius Norm; just be aware that more rigorous, advanced criteria exist in case you need them.[14] A sizeable share of the use cases mentioned by authors like Dhillon[15] involve such tasks as matrix estimation and optimization, but as I’ve touched on in past mistutorials, there’s a point at which matrix math becomes prohibitively difficult to implement in T-SQL and other set-based languages. Transposes and the like are quite easy, but the symbols for determinants and Eigenvalues are usually red flags that traditional .Net languages like VB.Net or C# are called for.

**Clustering Algorithms and Specific Probability Distributions**

The leading, most obvious use of any Bregman Divergence would be in clustering algorithms, which can be easily be implemented in T-SQL; given the issue I’ve raised before about how modern data mining tools are in some ways 20 or more years behind the research, writing your own DIY mining code may be your only option if you want to apply any of the scores of clustering innovations available in the literature. With the aid of a Bregman Divergence, we can also make a smart choice of distance metrics to plug into them. The popularity of Euclidean Distance in clustering is obvious, but other Bregman formulas are used in various clustering schemes, such as K-nearest neighbors.[16] If you intend to plug your own Bregman into a DIY clustering algorithm, I’d strongly recommend consulting a paper Dhillon co-wrote with Arindam Banerjee and Joydeep Ghosh on “Clustering with Bregman Divergences.”[17] There are several sections full of mathematical proofs and lemmas that programmers can skip over, but also hard reasoning for applying it to clustering of various types – including the popular Expectation Maximization (EM) method that Microsoft leverages in SSDM, which I discussed back in A Rickety Stairway to SQL Server Data Mining, Algorithm 8: Sequence Clustering. Banerjee et al. discuss in depth Bregman Divergence can be applied in both “hard” and “soft” clustering scenarios, which are basically differentiated by whether or not overlap is allowed between clusters.[18] These algorithms essentially minimize the loss of Bregman Information, which is the Bregman counterpart of the other types of information surveyed so far in this series, like Shannon, Lautum, Mutual and Shared Information (all of which are best viewed as quantifying “newsworthiness,” not “knowledge”).[19] They also provide empirical studies of Bregman implementations on various datasets[20], which means that their discussion is not based merely on theory and conjecture and gives us another reference point to start our own clustering efforts from. Their conclusion is simple: “the quality of the clustering depends on the appropriateness of the Bregman divergence.”[21]

…………Thanks to the 1-to-1 correspondence of probability distributions from the Exponential Family and instances of the Bregman Divergence, it will often be appropriate to test or anticipate the expected distribution, then simply choose the right distance. Typically that would mean using the Itakura-Saito Distance I’ve coded in Figure 1 in conjunction with the Exponential Distribution, which is an instance within the Exponential Family. It has found specific applications in signal processing[22] and efficient speech coding in particular.[23] In Figure 1 I apply it to the precalculated table of probabilities I built earlier in this segment upon two float columns of the 11-million-row Higgs Boson Dataset, which I downloaded long ago from the University of California at Irvine’s Machine Learning Repository for stress-testing the code in my tutorials. No stress is involved at all in deriving the distance number in Figure 2, since all of the grunt work occurs in deriving these probabilities. Some cautions are in order though, as usual, since I’m an amateur at this, not the least of which is the fact that examples with sample data to work through are few and far behind in the cutting edge data mining literature. I may be mishandling the divide-by-zero check in the inner SELECT incorrectly; it may have to be set to 0 or to the value of Probability1. Likewise, it may be useful to include an ABS within the LOG operation to avoid Invalid Floating Point Operation errors for similar reasons, since they’re triggered when negative and zero values are plugged into a logarithm. The formula cited by Póczos[24] is rather simple: all we need to do is divide the probability of the first number by the second, take the result and subtract the logarithm of the result, then subtract 1 and tally ‘em up. That seems to be the standard way of calculating Itakura-Saito, but be aware that the Wikipedia entry also divides the summation result (they use the integration sign instead, but a summation does the job) by two pi.[25]

** Figure 1: The Itakura-Saito Flavor of Bregman Divergence in T-SQL**; WITH CalculationCTE

(InnerOperationResult)

AS

(

SELECT DivisionResult – LOG(DivisionResult) – 1 AS InnerOperationResult

FROM (SELECT CASE WHEN Probability2 = 0

THEN 1 ELSE

Probability1 / CAST(Probability2 AS float) END AS DivisionResult

FROM Physics.HiggsBosonProbabilityTable

WHERE Probability1 <> 0 AND Probability2 <> 0) AS T1

)

SELECT Sum(InnerOperationResult) AS ItakuraSaitoDistance

FROM CalculationCTE

__Figure 2: Results from the Higgs Boson Probability Data__

…………After all that discussion, the actual code is really quite simple. So is the calculation in Figure 2, which takes a fraction of a second if we precalculate the probabilities. I suppose I could have calculated the probabilities in the HiggsBosonProbabilityTable by ranges instead of specific float values, which means that the Probability1 <> 0 AND Probability2 <> 0 strains out almost all the records, but that would only affect the structure of the table the sample data resides in, not the distance code above. The code’s so short in comparison to the routines posted in my previous articles on information distances that I was tempted to plug the KL-Divergence and Squared Euclidean in for comparison purposes, but they’re calculated much differently, both from each other and the Itakura-Saito, so it would’ve cluttered the presentation. I organized this segment around Cha Sung-Hyuk’s brilliant article on “Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions,” which organizes various stochastic distance measures by the manner in which they’re calculated. This allowed me to succinctly code groups of them together by applying the same operations, usually buried deep in subqueries and common table expressions (CTEs).[26] That basically allows us to calculate whole families of distance metrics together in a 10-for-the-price-of-1 deal, but since the KL-Divergence and Squared Euclidean come from other families, that bargain would be null and void. They’re linked to the Itakura-Saito because they obey the general Bregman formula and the properties it produces, not due to commonalities in their equations that a programmer can easily exploit.

…………The catch with the Bregman, Rényi and f-Divergences is remembering that they’re general formulas, which many widely varied equations can fit into. We could plug in many other equations as long as they obey certain restrictions (some of which I may have omitted in my amateur ignorance), although connecting them to specific probability distributions in a useful way without a better mathematical understanding of the properties and their implications might be difficult. That is a task for mathematicians, ours is to await their next big discoveries and be ready to implement them on SQL Server in an efficient way for DIY data mining purposes. To illustrate just how fast-moving the innovations in this field are, while checking the formula for Itakura-Saito I discovered that it is equivalent to a Beta Divergence – which I’d never heard of before – with a ß-parameter of zero.[27] The KL is returned by a ß value of zero and the Squared Euclidean by a value of 2, while the ß-Divergence itself is a special case of the Bregman; experimentation is also apparently ongoing with ß-Divergence-based clustering.[28] I suspect that there are probably many other useful parameterizations out in the literature waiting to be discovered that could neatly organize all of these overlapping information measures in a comprehensive way, plus intersect with the sturdy scaffolding of specific entropies, types of information and probability distributions. Ideally, they’d possess helpful properties that divergences do not, such as symmetry and subadditivity; perhaps the information distances of tomorrow will also be able to zero in on global optima (probably leveraging convexity) with greater efficiency. Right now, the extant mining software is decades behind these developments, so to stay one step ahead of the competition, we have to keep one eye open. The knowledge that set-based languages can calculate these measures so efficiently puts us another step ahead of the curve. In the next segment I’ll address how various information criterions related to regression are also best handled in languages like T-SQL, on a Big Data-capable platform like SQL Server – as well as explain why the versions of them geared towards parameter estimation and Maximum Likelihood Estimation (MLE) are *not* well-suited to our use cases at all. For those purposes, we’re better off turning to tools like R (or better yet, VB.Net or C#).

[1] Technically, setting the value to 1 would cause a divide-by-zero error in the formula, but it can easily be amended with a CASE statement (or similar construct in some other language) to set it equal to the KL-Divergence at this point.

[2] I’ve read about the f-Divergence at other, more professional sources, but for a quick introduction to the topic, see the __Wikipedia__ entry “F-Divergence” at https://en.wikipedia.org/wiki/F-divergence

[3] The famous reference by the Dezas mentions two Bregman “functions,” but I’m having a hard time following their notation, so I’m not sure if they’re equivalent to any of the metrics discussed today. See pp. 171-172, 182-183, Deza, Elena and Deza, Michel Marie, 2006, __Dictionary of Distances__. Elsevier: Amsterdam.

[4] There are plenty of sources for the general formula, but for a quick reference, see the __Wikipedia__ page “Bregman Divergence” at https://en.wikipedia.org/wiki/Bregman_divergence

[5] p. 6, Póczos, Barnabás, 2008, “Bregman Divergences,” RLAI Tea Talk at the University of Alberta at Edmonton on Aug 5, 2008. Available online at the __Carnegie Mellon University School of Computer Science__ web address https://www.cs.cmu.edu/~bapoczos/other_presentations/Bregman_Divergences_08_05_2008.pdf

[6] *IBID*., p. 16.

[7] Dhillon, Inderjit S. 2007, “Learning with Bregman Divergences: Machine Learning and Optimization,” published Jan. 18, 2007 by the Banff International Research Station. Available online at the __University of Texas at Austin__ web address http://www.cs.utexas.edu/users/inderjit/Talks/bregtut.pdf

[8] *IBID*., pp. 25-30.

[9] p. 15, Póczos.

[10] No source I’ve run across yet puts this as succinctly, without an excess of jargon, than the __Wikipedia__ entry “Bregman Divergence” at https://en.wikipedia.org/wiki/Bregman_divergence

[11] Bregman, L.M., 1966, “The Relaxation Method of Finding the Common Point of Convex Sets and Its Application to the Solution of Problems in Convex Programming,” pp. 200–217 in __USSR Computational Mathematics and Mathematical Physics__, Vol. 3.

[12] pp. 35-36, Póczos.

[13] See the reply by the user named Jason at the __StackOverflow__ thread “What are the Differences Between NP, NP-Complete and NP-Hard?” on Dec. 7, 2009 at http://stackoverflow.com/questions/1857244/what-are-the-differences-between-np-np-complete-and-np-hard

[14] p. 53, Dhillon.

[15] *IBID. *p. 41.

[16] *IBID*., p. 3.

[17] Banerjee, Arindam and Dhillon, Inderjit S. and Ghosh, Joydeep, 2005, “Clustering with Bregman Divergences,” pp. 1705-1749 in __Journal of Machine Learning Research__, Vol. 6.

[18] *IBID*., p. 1706.

[19] *IBID. *p. 1707.

[20] *IBID*., p. 1736.

[21] *IBID*.

[22] p. 11, Dhillon.

[23] Banerjee, et al., p. 1736.

[24] pp. 5-6, Póczos.

[25] See the __Wikipedia__ entry for “Itakura-Saito Distance” at https://en.wikipedia.org/wiki/Itakura%E2%80%93Saito_distance

[26] Sung-Hyuk, Cha, 2007, “Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions,” pp. 300-307 in __International Journal of Mathematical Models and Methods in Applied Sciences__, Vol. 1, No. 4.

[27] Hennequin, Romain; David, Bertrand and Badeau, Roland, 2011, “Beta-divergence as a subclass of Bregman Divergence,” pp.83-86 in __IEEE Signal Processing Letters__, Vol. 18, No. 2.

[28] See the as-yet unreleased manuscript “Clustering with Beta Divergences” at the __Cornell University Library__ web address http://arxiv.org/pdf/1510.05491.pdf