Information Measurement with SQL Server Part 4.7: The Bhattacharyya and Hellinger Distances

,

By Steve Bolton

…………Over the course of this segment of the series, we’ve slowly assembled a classification system for some distances measures that can be used for DIY data mining scenarios, particularly in clustering. The most succinct way I can put it is that the underlying equations determine the mathematical properties of the results, which in turn dictate which use cases they’ll be ideal for. Sometimes these formulas (which I rarely post, for the sake of brevity and simplicity) can be subtracted from 1 to switch back and forth between similarity and dissimilarity measures, making them interchangeable in most of cases. Many of these measures have a confusing array of nicknames, variant formulas and related coefficients and indexes, of which I can only mention a handful; although we can plug other inputs like sets, vectors and raw values into many of these alternate versions, I’m sticking mainly to a well-organized, highly readable taxonomy published by Prof. Cha Sung-Hyuk, who deals mainly with distances between probability distributions.[1] In today’s article I’ll introduce what he terms the “Fidelity family or Squared-chord family,” whose most widely used members are the Bhattacharyya and Hellinger Distances. The properties of this family arise from commonalities in the underlying formulas I’ve omitted for the sake of simplicity, that boil down to the usage of sums of square roots. Some of these properties in turn make some of these measures useful for such disparate applications as feature selection, correspondence analysis and even detecting human emotions in photographs. Churning out the numbers is actually quite easy in SQL Server, even on large datasets, whereas discerning which use cases are appropriate for particular distances is not as straightforward.
…………As mentioned in the last article (back in July; my plan for monthly blog updates was interrupted again by another string of strange misadventures), the Fidelity measure depends upon the geometric mean; this necessitates multiplying the two values together and taking the square root (or cube root etc. when more than two are involved[2]) and is apparently used often when fractional values are involved.[3] Squared-Chord Distance has two interchangeable forms, one of which is also based on the root of the product, the other of which begins by taking the difference between the roots of both distributions[4]. As we’ve seen with some of the other families surveyed recently, these simple measures are mainly useful as building blocks. I found one hit on CrossValidated, StackExchange’s data mining forum, for Squared Chord Distance and none that seemed relevant for the keyword Fidelity. In comparison, I found 48 results for Hellinger Distance, 28 for Bhattacharyya Distance and 7 for Matusita, which sometimes goes by the name Jeffries-Matusita. Like most other distance measures the Matusita is used in clustering, but seems to be a favorite when spectral analysis of crops and forest fire damage is the problem at hand. For our purposes, it may be helpful to know that it “tends to suppress high separability values, whilst overemphasizing low separability values” when used for feature selection in neural nets.[5] The paper that quote comes from includes a handy chart comparing the classification capabilities of Matusita Distance to five other such measures, including Bhattacharyya and the plain “Divergence” measure introduced a couple of articles ago. One of the limitations is that it is specifically geared towards probability distributions, not other types of inputs like vectors of raw values. The good news is that unlike the Bhattacharyya, the Hellinger Distance exhibits subadditivity, a desirable property guaranteeing that the distances between any two points will be less than or equal to the third (a.k.a., the triangle inequality). Another desirable property is that it is not dependent on the units used.[6] Hellinger Distance also intersects with Total Variation Distance[7], a measure I haven’t yet covered that in turn traverses the Kullback-Leibler Divergence we discussed awhile back, thereby allowing us to connect them all together to “trap” the information in our databases. The bad news is that it might also be hobbled by absolute continuity, i.e. the requirement that every non-zero probability for one distribution also be non-zero in the other. As we saw awhile back in Information Measurement with SQL Server, Part 4.1: The Kullback-Leibler Divergence, this can disqualify a lot of distributions from comparison; thanks to my limited grasp of the math, however, I cannot say for certain that it is required.

 Matching Use Cases to Distances in the Fidelity/Squared Chord Family

              The Hellinger Distance has an interchangeable alternate formula based on the Fidelity measure instead of the Squared-Chord, which in turns links it to the Bhattacharyya Distance. Fortunately, the latter is definitely not haunted by the specter of absolute continuity[8], although it lacks both boundedness and subadditivity. Its inventor, West Bengalese statistician Anil Kumar Bhattacharyya (1915-1996), was a colleague of P. C. Mahalanobis,[9] so it’s not surprising that the famous measures named after both men share much in common. The best clue I can find on how to pronounce his striking name can be found here. Aside from typical uses for distances in clustering and pattern recognition, Bhattacharyya Distance is one of the most prominent measures in feature selection[10]. As I touched on in A Rickety Stairway to SQL Server Data Mining, Part 0.1: Data In, Data Out, SSDM uses alternate methods to perform the same task, which essentially amounts to straining irrelevant column-value pairs out of mining models. This may be why it is also useful in signal optimization[11], a topic that overlaps coding theory and in turn, the entropic information measures covered earlier in this series. Like many other popular information metrics, it is widely used and therefore has accumulated some diverse and sometimes exotic implementations, including the clustering of cell phones[12], security concerns with stenography[13] and discriminating between the emotions evident in speech content.[14] I have also seen it referred to in texts on quantum mechanics and know it has some niche uses in Time Series.[15] In contrast, the applications for the Hellinger seem to be a bit more diffuse, ranging from correspondence analysis to Decision Trees to clustering. As with other popular measures, confusion can arise because of its many variants: the Bhattacharyya Distance has an associated coefficient by the same name and an alternate formula for statistical measures.[16] The distance is also related to an inequality that gives rise to the Bhattacharyya Bound, which is apparently related to the Cramér–Rao Bound[17], a vital measure in Maximum Likelihood Estimation (MLE) and Fisher Information, two advanced topics I dismissed in Information Measurement with SQL Server, Part 5.1: A Digression on the Unsuitability of Fisher Information for SQL Use Cases. It simultaneously provides a bound on Bayesian misclassification.[18] The measure I covered way back in Outlier Detection with SQL Server, part 8: A T-SQL Hack for Mahalanobis Distance is a special case in which the standard deviations of the data being compared are equal.[19] The multiplicity of such relationships may be a bit dizzying, but each one actually works to our advantage by allowing us to stitch together a net of related information metrics, many of which can be calibrated against each other.
…………The important thing is to ascertain the principles of why it meets these use cases, so that we can decide if it would match some of our own; as mentioned in previous articles, this is often more challenging than writing or running the trivial code. The most succinct explanation of why it exhibits these properties I’ve seen to date is a comment by Ryerson University Prof. Konstantinos G. Derpanis, to the effect that it “has a simple geometric interpretation as the cosine of the angle between the N-dimensional vectors.” [20]  For the uninitiated (like myself), suffice it to say that it induces a space that is measured across angles rather than the straight lines we’re used to with ordinary Euclidean Distance. This was also the case with the Cosine Similarity measure discussed in the last article. To understand how such higher-level structures can be useful, think back to the older topic of Manhattan Distance, which quantifies a problem many of us encounter every day: measuring across city blocks that we simply can’t cut across to take the shortest distance between two points. Translating these into usage criteria on particular SQL Server datasets would require a more explicit understanding of this angularity, but at least we have a starting point. One of the drawbacks of the Bhattacharyya is that it does not qualify as a metric, although workarounds have recently been developed that would add the missing properties.[21] My long-standing contention that the analytics marketplace is in some respects decades behind the research is justified by the fact that established measures like Bhattacharyya Distance and the Kullback-Leibler Divergence are used far more often to this day than the adjustments available for them, few of which have percolated down into general usage. The sample code in Figure 1 and the rest of the series is aimed mainly at demonstrating how we can code these things ourselves in SQL Server, without having to wait years for commercial software packages that will certainly cost us an arm and a leg.

Figure 1: Sample SQL Server code for the Squared-Chord Family of Distances
DECLARE @FidelityDistance float,
@BhattacharyyaDistance float,
@HellingerDistance float,
@MatusitaDistance float,
@SquaredChordDistance float

DECLARE @Count bigint

SELECT @Count=Count(*)
FROM Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL AND Column2 IS NOT NULL
 

; WITH CrossProductCTE
(StandardizedValue, SquareRootOfProduct,DifferenceOfSquareRoots, SquareOfDifferenceOfSquareRoots)
AS (SELECT StandardizedValue, SquareRootOfProduct, CAST(DifferenceOfSquareRoots AS float),
Power(DifferenceOfSquareRoots, 2) AS SquareOfDifferenceOfSquareRoots
       FROM   (SELECT StandardizedValue, SquareRootOfProduct, DifferenceOfSquareRoots
             FROM (SELECT StandardizedValue, CASE WHEN Probability1 = 0 OR Probability2 = 0 OR Sign(Probability1 * Probability2) = 1 THEN 0 ELSE Power(Probability1 * Probability2, 0.5) END AS SquareRootOfProduct,
             CASE WHEN Probability1 <= 0 AND Probability2 > 0 THEN 0 Power(Probability2, 0.5)
             WHEN Probability1 > 0 AND Probability2 <= 0 THEN Power(Probability1, 0.5)
             WHEN Probability1 <= 0 AND Probability2 <= 0 THEN 0
             ElSE Power(Probability1, 0.5) Power(Probability2, 0.5) END AS DifferenceOfSquareRoots
             FROM Physics.HiggsBosonProbabilityTable) AS T1) AS T2
)

SELECT @FidelityDistance = SumOfSquareRootOfProduct,
@BhattacharyyaDistance = 1 * Log(SumOfSquareRootOfProduct),
@HellingerDistance= Power(2 * SumofSquareOfDifferenceOfSquareRoots, 0.5), — CASE WHEN 1 – SumOfSquareRootOfProduct <= 0 THEN 0 ELSE 2 * Power(1 – SumOfSquareRootOfProduct, 0.5) END,
@MatusitaDistance = Power(SumofSquareOfDifferenceOfSquareRoots, 0.5), –CASE WHEN 2 * (2 – SumOfSquareRootOfProduct) <= 0 THEN 0 ELSE Power(2 * (2 – SumOfSquareRootOfProduct), 0.5) END,
@SquaredChordDistance = 2 * SumDifferenceOfSquareRoots 1
FROM (SELECT SUM(SquareRootOfProduct) AS SumOfSquareRootOfProduct, SUM(DifferenceOfSquareRoots) AS SumDifferenceOfSquareRoots, SUM(SquareOfDifferenceOfSquareRoots) AS SumofSquareOfDifferenceOfSquareRoots
       FROM CrossProductCTE) AS T1

SELECT @FidelityDistance AS FidelityDistance, @BhattacharyyaDistance AS BhattacharyyaDistance, @HellingerDistance AS HellingerDistance, @MatusitaDistance AS MatusitaDistance, @SquaredChordDistance AS SquaredChordDistance,  1 @SquaredChordDistance AS SquaredChordSimilarity

Figure 2: Results for the First Two Float Columns in the Higgs Boson Dataset (click to enlarge)

…………The sample T-SQL in Figure 1 is structured almost exactly the same as the code for the last few articles on distance measures, except for a few changes to the math operations. Once of the few additions is a CAST to the float data type, which is necessary on the DifferenceOfSquareRoots to prevent an odd problem where we couldn’t SUM > 10 without an arithmetic overflow (probably because the underlying data type is a decimal). Once again, the CrossProductCTE retrieves probability values from the Physics.HiggsBosonProbabilityTable, which were precalculated from the same Higgs Boson Dataset I’ve been using to stress-test my code for several tutorial series now[22]. The code I used to precalculate and standardize these values is available in the End Notes of Information Measurement with SQL Server Part 4.4: Sørensen Distance and Its Relatives, which surprisingly runs 20 seconds faster than simpler calculations on the plain raw values in Information Measurement with SQL Server, Part 4.3: Euclidean Distance and the Minkowski Family. To that measly 3 seconds we can tack on the mere 23 milliseconds it takes Figure 1 to execute. Even that can be chopped down to 3 milliseconds on a third or fourth use, perhaps because SQL Server is caching the @Count. The sample code for the last few articles had identical executions times and execution plans, in which the retrieval of the @Count strangely accounted for 99 percent of the batch costs. Given that I was working with a rickety desktop machine, we can expect further dramatic reductions in execution time on a real server run by DBAs with real experience in query tuning, possibly using columnstore indexes.
…………The infinitesimal performance costs of such queries signifies that we can experiment to our heart’s content until we find a set of distance measures best suited to our data. The difficulty still consists in figuring out in advance which measures those are likely to be, for the underlying principles are not widely published – nor are they apparently clear even to the brilliant theoreticians who think them up, given that they’re still publishing fresh discoveries of mathematical properties and pioneering applications for old distance metrics, even as they invent new ones to supersede them. To illustrate just how rapidly changing the field is, I only discovered a bounded version of the Bhattacharyya at the last minute.[23] Note how there is no obvious upper bound to the results for the Bhattacharyya in Figure 2. It is typically more convenient to uses measures bounded between 0 and 1, since this can be easily integrated with many other mathematical theories that use the same scale, beginning with stochastics and ending in fuzzy sets, Decision Theory[24] and neural nets. For this reason, I’d probably prefer the new bounded variant of the Bhattacharyya when matching members of the Fidelity/Squared-Chord family to applications, assuming that its works well in practice. It may actually be worthwhile to decide what properties you’d like your output to possess, in order to solve a particular SQL Server data mining problem, then peruse the math journals to find one that best suits your needs. Despite the known weaknesses of established metrics like the Hellinger, Kullback-Leibler and the like, discussions on these topics typically revolve around them rather than the plethora of upgrades published in the math journals. It may be decades before any of them are commercially available. That is where the ability to program them yourself in T-SQL can put you light years ahead of the competition; this is doubly true, given that little recognition is given to set-based languages, which may be ideal for solving a lot of these problems. The efficiency with which they perform in my sloppy code testifies to that. In the next installment, we’ll cover a family characterized by MIN and MAX operations rather than square roots, which aren’t as widely used as the other measures covered in this segment, yet perform just as well.

[1] Sung-Hyuk Cha, 2007, “Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions,” pp. 300-307 in International Journal of Mathematical Models and Methods in Applied Sciences, Vol. 1, No. 4. Another indispensable source is Deza, Elena and Deza, Michel Marie, 2006, Dictionary of Distances. Elsevier: Amsterdam.

[2] See the Wikipedia page “Geometric Mean” at http://en.wikipedia.org/wiki/Geometric_mean

[3] See the Quora.com thread “When is It Most Appropriate to Take the Arithmetic Mean vs. Geometric Mean vs. Harmonic Mean?” posted Sept 7, 2014. by the user named Data Geek at http://www.quora.com/When-is-it-most-appropriate-to-take-the-arithmetic-mean-vs-geometric-mean-vs-harmonic-mean

[4] p. 302-303, Sung-Hyuk.

[5] See the reply by the user named rroowwllaanndd to the CrossValidated thread “Pros of Jeffries Matusita Distance,” posted on June 11, 2014 at http://stats.stackexchange.com/questions/102810/pros-of-jeffries-matusita-distance He in turn cites Kavzoglu, Taskin and Mather, Paul M., 2000, “The Use of Feature Selection Techniques in the Context of Artificial Neural Networks,” Proceedings of the 26th Annual Conference of the Remote Sensing Society (CD-ROM). Conference Date Sept. 12-14 in Leicester, UK. No page numbers listed.

[6] p. 61, Pollard, David, 2002, A User’s Guide to Measure Theoretic Probability. Cambridge University Press: New York.

[7] See the Wikipedia article “Hellinger Distance” at  http://en.wikipedia.org/wiki/Hellinger_distance

[8] p. 2, Roman, Ahmed; Jolad, Shivakumar and Shastry, Mahesh C., 2012, “Bounded Divergence Measures Based on Bhattacharyya Coefficient.” Apparently this was published in relation to some conference, but I have yet to see a citation that gives the details, or even the journal name. It is nonetheless available online at many websites, including the Cornell Univeristy Library address http://xxx.tau.ac.il/pdf/1201.0418v3.pdf

[9] See the Wikipedia article “Anil Kumar Bhattacharya “ at http://en.wikipedia.org/wiki/Anil_Kumar_Bhattacharya

[10] See articles like Lee, Chulhee and Hong, Daesik, 1997, “Feature Extraction Using the Bhattacharyya Distance,” pp. 2147-2150 in Systems, Man, and Cybernetics: 1997 IEEE International Conference on Computational Cybernetics and Simulation, Vol. 3. Conference date Oct. 12-15, 1997 in Orlando, Florida. I at least skimmed all of the abstracts or full articles listed here for the Bhattacharyya.

[11] Kailath, T., 1967, “The Divergence and Bhattacharyya Distance Measures in Signal Selection,” pp. 52-60 in IEEE Transactions on Communication Technology, Feb. 1967. Vol. 15, No. 1.’

[12] Mak, Brian and Etienne Barnard, Brian, 1996, “Phone Clustering Using The Bhattacharyya Distance,” pp. 2005-2008 in Proceedings of the International Conference on Spoken Language Processing, Vol. 4. Conference Date Oct. 3-6, 1996. Available online at the University of Delaware Applied Science and Engineering Laboratories web address http://www.asel.udel.edu/icslp/cdrom/vol4/281/a281.pdf

[13] Korzhik, Varley; Imai, Hideki; Shikata, Junji; Morales-Luna, Guillermo, and Gerling, Ekaterina, 2008, “On the Use of Bhattacharyya Distance as a Measure of the Detectability of Steganographic Systems,” pp. 23-32 in Transactions on Data Hiding and Multimedia Security III. Springer-Verlag: Berlin.

[14] Lay New, Tin; Hieu, Nguyen Trung and Limbu, D.K., 2013, “Bhattacharyya Distance Based Emotional Dissimilarity Measure for Emotion Classification,” pp. 7512-7516 in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013. Conference date May 26-31, 2013.

[15] See the Encyclopedia of Mathematics webpage “Bhattacharyya Distance” at http://www.encyclopediaofmath.org/index.php/Bhattacharyya_distance

[16] See the Encyclopedia of Mathematics webpage “Bhattacharyya Distance” at http://www.encyclopediaofmath.org/index.php/Bhattacharyya_distance

[17] See publications like Ghosh, J.K. and Purkayastha, Sumitra, 2010, “Sequential Cramér-Rao and Bhattacharyya Bounds:Work of G.R. Seth and Afterwards,” pp. 137-144 in Journal of the Indian Society of Agricultural Statistics, Vol. 62, No. 2.

[18] pp. 302-303, Sung-Hyuk.

[19] See the Wikipedia page “Bhattacharyya Distance” at http://en.wikipedia.org/wiki/Bhattacharyya_distance

[20] Derpanis, Konstantinos G., 2008, The Bhattacharyya Measure. Published March 20, 2008 at the York University web address http://www.cse.yorku.ca/~kosta/CompVis_Notes/bhattacharyya.pdf

[21] Derpanis, Konstantinos G., 2008, The Bhattacharyya Measure. Published March 20, 2008 at the York University web address http://www.cse.yorku.ca/~kosta/CompVis_Notes/bhattacharyya.pdf

[22] This was downloaded ages ago from University of California at Irvine’s Machine Learning Repository and now resides in a nearly 6-gigabyte database labeled DataMiningProjects.

[23] See Roman, et al.

[24] As discussed in my Implementing Fuzzy Sets in SQL Server series.

 

Original post (opens in new tab)
View comments in original post (opens in new tab)

Rate

Share

Share

Rate