Information Measurement with SQL Server Part 4.8: The Intersection Family of Distances

,

By Steve Bolton

…………Each group of distance measures surveyed in this segment of the series was defined by affinities between the underlying equations. For example, the common denominator for the distances discussed in Information Measurement with SQL Server Part 4.3: Euclidean Distance and the Minkowski Family is a simple subtraction between two sets of probabilities or other inputs, while the Kullback-Leibler and Jensen-Shannon Divergences have the same kind of logarithm operations found in Shannon’s Entropy. The L1 family built upon the difference used in the Minkowski family by adding a divisor (usually involving the corresponding sum), while the L2 was based on the square of the same difference, i.e. the ordinary Euclidean Distance; in the same manner, the Inner Product family discussed in Information Measurement with SQL Server Part 4.6: The Jaccard Distance and Its Relatives had a simple multiplication in common, whereas square roots were prominent in the Fidelity family covered last time around. Although there are hundreds of other distances we could cover, in addition to innumerable variants, coefficients, indexes and aliases, only one remains from the taxonomy I leaned on to organize this brief introduction: the Intersection family, which is distinguished by the use of MIN and MAX operations in their defining formulas.
…………I rarely include the underlying formulas in order to avoiding burdening readers with details that are more relevant to theoreticians, but the best source I have found to date is Prof. Cha Sung-Hyuk’s handy guide to distances between probability distributions.[1] The key principle to keep in mind is that the equations are designed to produce different mathematical properties, which can be leveraged to meet special use cases when doing DIY data mining with SQL Server. It’s essentially a chain in which formulas produce properties, some desirable and other deleterious, which in turns determines the applications they’ll be useful for. The counter-intuitive aspect is that matching distance measures to the right use cases is much more challenging than coding them in T-SQL, which does a fantastic job in returning the results efficiently. Given that many of these distances aren’t available in commercial software and probably won’t be in a cost-effective way for some time means that this lead in efficiency is pretty much absolute; if there’s only one horse in a race, the smart money is on them. Some experimentation with the Intersection family will be necessary to determine which measures are a good fit to the problems SQL Server users encounters, but the performance hit for testing them out is pretty much trivial.

The Desuetude of the Intersection Family

               On the other hand, I won’t dwell long on the Intersection family, since its constituent measures seem to be less frequently used than any of the others in Sung-Hyuk’s study. A casual search of CrossValidated, Stack Exchange’s best resource for data mining answers, turned up just 13 hits for Tanimoto Distance, 4 for the Kulczynski Similarity and none for the Wave-Hedges, Motyka and Ruzicka Distances. I was surprised to find just one off-hand mention of the Czekanowski Similarity, given that it overlaps the Sørensen Distance I covered a few articles ago, to the point of being identical in some circumstances.[2] It is equal to half of the Motyka Similarity, which differs from the Kulczynski Similarity only in the use of a sum over each set of values rather than a difference; the resemblance to the L1 family discussed in Information Measurement with SQL Server, Part 4.4: Sørensen Distance and Its Relatives is obvious at a glance, so it is not surprising that Sung-Hyuk includes most of the Intersection family as a subset of L1.[3] I don’t have the qualifications to write on this topic – which is precisely why I’m writing on it, in order to assimilate the material a lot faster – but Tanimoto Distance was the only one I recognized from the literature I read before Sung-Hyuk’s study. The name is often used interchangeably with Jaccard, but they’re not quite the same thing, since it incorporates the same MIN and MAX operations found in the rest of the Intersection family, but not the Inner Product group the Jaccard belongs to. I found 22 hits on CrossValidated for the terms “intersection” and “distance” together, but the plainness of both terms obviously has something to do with this. The curious thing is that in most other families, the simplest members are useful mainly as components of other, more popular members, with the Inner Product, Fidelity and Squared-Chord Distances being cases in point; in contrast, the plain Intersection Distance “ is a widely used form of similarity where the non-overlaps between two pdfs.”[4] As Sung-Hyuk, “it is nothing but the L1 divided by 2,” a property that we can leverage to convert the distances built upon it back and forth from the L1.[5] Don’t quote me on this, but the results in Figure 2 suggest that all of these measures are bounded from 0 to 1, except for the Wave-Hedges; if so, it is safe to use decimal(38,37) here instead of floats, as I have in the declaration section at the top of Figure1. Because I had a hard time coming up with references, validation is more of a problem than usual; my old disclaimer that my code is useful mainly as a starting point is in play, as always. The Kulczynski Similarity should theoretically be the reciprocal of the Kulczynski Distance discussed in the L1 family article and is in line here with the expected value, but that does not mean that other aggregation and equation transcription errors might not throw off the results of both, along with their kin.

Figure 1: Sample T-SQL For the Intersection Family of Distances

DECLARE @IntersectionSimilarity decimal(38,37),
@WaveHedgesDistance float,
@CzekanowskiSimilarity decimal(38,37),
@MotykaSimilarity decimal(38,37),
@KulczynskiSimilarity decimal(38,37),
@RuzickaSimilarity decimal(38,37),
@TanimotoDistance decimal(38,37)

DECLARE @Count bigint
SELECT @Count=Count(*)
FROM Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL AND Column2 IS NOT NULL

; WITH CrossProductCTE
(DifferenceAmount, AdditionResult, LesserOfTwoValues, GreaterOfTwoValues, WaveHedgesInput)
AS
(
       SELECT DifferenceAmount, AdditionResult, LesserOfTwoValues, GreaterOfTwoValues,
1
(LesserOfTwoValues / GreaterOfTwoValues) AS WaveHedgesInput
       FROM (SELECT Probability1 Probability2 AS DifferenceAmount, Probability1 + Probability2 AS AdditionResult,
             CASE WHEN Probability1 <= Probability2 THEN Probability1
ELSE Probability2
END AS LesserOfTwoValues,
             CASE WHEN Probability1 <= Probability2 THEN Probability2
ELSE Probability1 END AS GreaterOfTwoValues
             FROM Physics.HiggsBosonProbabilityTable) AS T1
)

SELECT @IntersectionSimilarity = SumOfLesserOfTwoValues,
@WaveHedgesDistance = WaveHedgesSum,
@CzekanowskiSimilarity= (2 * SumOfLesserOfTwoValues) / SumOfAdditionResult,
@MotykaSimilarity = (SumOfLesserOfTwoValues) / SumOfAdditionResult,
@KulczynskiSimilarity = (SumOfLesserOfTwoValues) / SumOfAbsoluteDifferenceAmount,
@RuzickaSimilarity = SumOfLesserOfTwoValues / SumGreaterOfTwoValues,
@TanimotoDistance = SumOfMinMaxDifferenceAmount / SumGreaterOfTwoValues
FROM (SELECT SUM(Abs(DifferenceAmount)) AS SumOfAbsoluteDifferenceAmount,
      SUM(AdditionResult) AS SumOfAdditionResult,
      SUM(LesserOfTwoValues) AS SumOfLesserOfTwoValues,
      SUM(GreaterOfTwoValues) AS SumGreaterOfTwoValues,
      SUM(GreaterOfTwoValues LesserOfTwoValues) AS SumOfMinMaxDifferenceAmount,
      SUM(WaveHedgesInput) AS WaveHedgesSum
       FROM CrossProductCTE) AS T1

SELECT @IntersectionSimilarity AS IntersectionSimilarity,
1 @IntersectionSimilarity AS IntersectionDistance,
@WaveHedgesDistance AS WaveHedgesDistance,
@CzekanowskiSimilarity AS CzekanowskiSimilarity,
1 @CzekanowskiSimilarity AS CzekanowskiDistance,
@MotykaSimilarity AS MotykaSimilarity,
1 @MotykaSimilarity AS MotykaDistance,
@KulczynskiSimilarity AS KulczynskiSimilarity,
@RuzickaSimilarity AS RuzickaSimilarity,
@TanimotoDistance AS TanimotoDistance

 Figure 2: Results for the First Two Float Columns in the Higgs Boson Dataset (click to enlarge)

…………The sample code in Figure 1 is structured almost identically to that of the last four self-tutorials. The CrossProductCTE is used to retrieve some probabilities precalculated from the 11-million-row Higgs Boson Dataset[6], which I have been using to stress-test my tutorial code for ages now; some simple sums and differences are bubbled up to the outer SELECT; then a SUM aggregate and a couple of other simple math operations are applied in the SELECT after the common table expression (CTE). The subtractions from 1 in the final SELECT are a tried-and-true method of switching back a forth between similarities and dissimilarities, which measure the same quantities from opposite directions. Since Figure 1 is so similar to other code I’ve posted recently, it is no longer a surprise that it runs in just 3 milliseconds on a third or fourth use. It might seem like cheating to precalculate the probabilities, but this only takes 3 seconds, compared to the 23 seconds it took to run a simpler set of measurements on the plain raw values in Information Measurement with SQL Server, Part 4.3: Euclidean Distance and the Minkowski Family. I still don’t understand the incredible leap forward in performance – although the fact that 99 percent of the batch costs are taken up by the single @Count retrieval – but I won’t look a gift horse in the mouth.
…………Given the paucity of the information in the literature on this family, I’d wager that it is less likely to be of use for SQL Server DBAs and data miners who want to do a little DYI statistical spelunking. This is not due just to the obvious tendency of these measures to reach the illegible extremes evident in Figure 2 when fed millions of rows (a problem analogous to the exploding and vanishing gradients that plagued neural nets till the invention of LSTMs). I’m posting this in part to round out the sample code for Sung-Hyuk’s taxonomy and in part because there’s no telling just when these particular measures might be of use. There really aren’t any widely accessible guides on matching distances to the most popular applications of SQL Server, to things like inventory, CRM, order-taking, payroll, human resources, marketing programs and the like. In fact, there really isn’t a standard guide listing the properties of all of the hundreds of metrics in use today, let alone one that succinctly identifies what kinds of problems they’re best suited for. Deza’s famous Dictionary of Distances is an invaluable resource, but it doesn’t include this information for each measure; I don’t think that’s the case with their follow-up encyclopedia either, but don’t quote me.[7] I’m hardly familiar with all of the literature, but the best plain-English guide may be a widely available monograph by Christian Hennig and Bernhard Hausdorf titled “Design of Dissimilarity Measures: A New Dissimilarity between Species Distribution Areas.” [8] The title obviously shows that the authors had biology in mind, but the organizing principles they mention are useful to most other fields as well. Hamidreza Mostafaei’s short article “Probability Metrics and their Applications” is also helpful but is heavy on the math.[9] While I was writing the tail end of the last article I discovered a recent upgrade that adds the helpful properties of boundedness and subadditivity to the Bhattacharyya Distance. Based on that experience, I’d suggest coming up with a list of properties that would be helpful for the problem at hand, then check the literature to see if one has already been developed; keep in mind that the analytics marketplace is in some respects decades behind the extant data mining research, so no one’s going to provide them all to us any time soon, at any price. In fact, most discussions among professionals still revolve around the original distance measures these variants were designed to upgrade, like the Kullback-Leibler Divergence with its oftentimes fatal flaw of absolute continuity.

The Right Tool for the Job: Differentiating Distance Use Cases

               The stellar performance of the sample code in this segment – which occurred on a large dataset, through inexpert hands on a desktop machine – suggests that there is something to be said for using a set-based language like T-SQL to implement these things. Among the other guiding principles I had to slowly glean through trial-and-error is that properties like symmetry and subadditivity are almost always desirable (since they allow us to measure the same quantities in opposite directions) and that others, particularly absolute continuity, are not. Measures bounded between 0 and 1 are typically preferable because they integrate so well with other mathematical theories on the same scales, like probabilities, neural nets, Decision Theory and the topic of my series of articles on Implementing Fuzzy Sets with SQL Server, among others. The properties that make one metric useful for particular use cases but not others can often be expressed in terms of geometry; for instance, the angularity introduced into the Cosine Similarity and the Bhattacharyya via square roots is useful for certain problems, whereas Manhattan Distance is applicable to problems where we have to calculate roundabout paths along right angles, as we would on a city block. The Euclidean Distance covered with the Minkowski family represents the ordinary measure of the shortest possible distance between two points, which makes is a good starting point for most investigations. Yet it is not applicable to all problems, to the point where it can fall into a trap known as Orloci’s Abundance Paradox “in datasets with many zeroes.”[10]
…………This is a specific instance of a wider problem I’ve tried to emphasize through all of these tutorial series, which boils down to the risk of logical disconnects between the numbers returned by data mining and statistical tools and their interpretations. Given the frequency with which statisticians themselves warn against fallacious interpretations of their work, this is no trivial consideration. I try to always keep in mind that the software will usually spit back numbers of some kind, but the number of potentially wrong or misleading ones is infinite. Various writers on data mining topics over-emphasize their own particular areas of expertise, such as data quality or perhaps buying scrubbed data from vendors, but we absolutely must have an unbroken chain of logic from beginning to end, beginning with plugging good data to the machinery of the right algorithms, followed by good interpretation, preferably with the aid of enlightening data visualizations. An error at any point in this chain can lead to erroneous, deceptive or even disastrous results – and these are mostly likely to occur in fallacious reasoning, one of the banes of the human race. The plethora of distance measures presents us with opportunities to make our data mining software spit back the most useful possible answers, but the same multiplicity also introduces commensurate risks of error. The number of ways we can rearrange these distance measures is practically infinite, but predicting how they’ll behave in advance would tax our finite intellectual and computational resources; that is one of the reasons why property-problem matching can be perplexing with these information metrics. The fact that we have an infinite haystack makes it likely that an optimal needle is contained somewhere within, but it also induces an infinite search space that makes it difficult to find just the right metric.
…………So far, researchers have accumulated hundreds of these distance, divergence, dissimilarity and similarity measures, many of which have associated variants and statistical indexes, like those for qualitative variation. Covering them all would be literally impossible, since their potential is infinite. In this segment I merely introduced some of the major families, after having covered the Cook’s, Mahalanobis and Shared Information Distances in previous tutorial series. I may revisit the topic later on (by which time I may finally actually know what I’m talking about) but I’ll close out this segment of the series with the Bregman Divergence, which subsumes or correlates many of these previous topics. After the Bregman I may start a new segment on the fascinating topic of measuring algorithmic information, using such metrics as Minimum Description Length (MDL), Minimum Message Length (MML) and Solomonoff Algorithmic Probability. This will include sample code for cutting-edge algorithms to compute the Kolmogorov Complexity – which amazingly enough, is technically incomputable. At a much later date I will introduce the Hamming Distance, a common but coarse measure often used in coding theory. Since it is also used for text analysis, I will save it to kick off a separate segment on semantic similarity measures, which quickly escalates into a much more advanced topic of quantifying meaning itself.

[1] Sung-Hyuk, Cha, 2007, “Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions,” pp. 300-307 in International Journal of Mathematical Models and Methods in Applied Sciences, Vol. 1, No. 4.

[2] pp. 301-302, Sung-Hyuk.

[3] IBID., p. 301.

[4] IBID.

[5] IBID.

[6] The original source was the University of California at Irvine’s Machine Learning Repository.

[7] Another indispensable source is Deza, Elena and Deza, Michel Marie, 2006, Dictionary of Distances. Elsevier: Amsterdam.

[8] pp. 3-4, Hennig, Christian and Hausdorf, Bernhard , 2007, “Design of Dissimilarity Measures: A New Dissimilarity between Species Distribution Areas?” published in Jan. 2007 as the University College of London Department of Statistical Science Research Report No. 273. Available online at the University College of London web address https://www.ucl.ac.uk/statistics/research/pdfs/rr273.pdf

[9] Mostafaei, Hamidreza, 2011, “Probability Metrics and their Applications,” pp. 181-192 in Applied Mathematical Sciences, Vol. 5, No. 4. Available online at the Hikari, Ltd. web address http://www.m-hikari.com/ams/ams-2011/ams-1-4-2011/mostafaeiAMS1-4-2011.pdf I suspect the following source could be helpful as well, but it is beyond my budget: Babu, C.C., 1972, “On the Relationship Between the Equivocation and Other Probabilistic Distance Measures Used for Feature Selection,” pp. 1098-1099 in Proceedings of the IEEE, Sept. 1972. Vol. 60, No. 9.

[10] See the CrossValidated thread “Is It Possible to Use Hellinger Distance for Environmental Variables?” started July 18, 2013 by the user named Bea, which is available online at http://stats.stackexchange.com/questions/64769/is-it-possible-to-use-hellinger-distance-for-environmental-variables

 

Original post (opens in new tab)
View comments in original post (opens in new tab)

Rate

Share

Share

Rate