Information Measurement with SQL Server Part 4.6: The Jaccard Distance and Its Relatives

,

By Steve Bolton

…………Over the course of this inexpert exploration of the distance measures used for such data mining applications as clustering, we’ve gradually introduced a system of classification of sorts. For example, in most cases, we can also switch back and forth between dissimilarity and similarity measures by subtracting each version from 1, which also helps to reduce the complexity involved in tracking them all. There are many additional ways of cataloguing these measures, but I’ve adopted Prof. Cha Sung-Hyuk’s well-written taxonomy of probabilistic distances as the backbone.[1] His system flows from similarities in the underlying math formulas, so it makes sense that he included the popular Jaccard Distance in the Inner Product family, whose members all include a multiplication operation in their dividends. In some respects it also belongs to the L2 family covered in the last installment, many of which have addition operations in their divisors.[2] The good news is that all of the formulas in each of these formulas can be easily and efficiently implemented way in SQL Server, which means we can leverage them immediately for DIY data mining purposes, without having to wait a decades for them to be implemented in overpriced commercial software.

…………I’ve spoken in the past about how the main difficulty actually consists in matching the right distance measure to the right use case – all of which is dictated by the mathematical properties that emerge from their underlying equations –  but I’ve neglected another problem: the wild proliferation of names and variant formulas, which together complicate the task of matching metrics correctly. The Jaccard Distance is a perfect example. We can gauge the popularity by the number of alternate names is goes by, including the Kumar-Hassebrook (PCE) Distance and Tanimoto Distance, although the latter is also the name of a slightly different formula included in the Intersection family. The same is true of the related Dice Similarity, which also goes by the names like “Sorensen, Czekanowski, Hodgkin-Richards or Morisita,”[3] thereby leading to additional confusion with the distances in Information Measurement with SQL Server, Part 4.4: Sørensen Distance and Its Relatives, where I listed a small sample of the many nicknames for Sørensen Distance. It doesn’t help matters much that some of the Sørensen aliases include the name Czekanowski, which is also the title of a coefficient we’ll tackle in a later discussion of the Intersection family. You might say it’s a “Dicy” issue. To add to the confusion, the probabilistic version of Jaccard Distance is actually a souped-up version of the underlying Jaccard formula a / (a+b)[4], except enhanced with the summing of squares. Furthermore, it is one of those distance measures like the Sørensen-Dice Coefficient that can be adapted to use sets as input parameters. In such cases the term “index” seems to be more common, but not a universal rule. At some later date I may write a separate article on the Jaccard Index and other variants of distance measures that accept set inputs, particularly since a set-based language like T-SQL would probably be the ideal way to implement them. The formulas are similar, but they’re not quite the same as the probabilistic versions we’ll discuss today, since they substitute of unions and intersections for the addition, subtraction, multiplication and squaring.

…………The Jaccard Distance and Jaccard Index seem to be cited in the data mining literature more frequently than any of the measures I’ve covered over the last two articles – although that should be taken with a grain of salt, since I’ve read only a fraction of what I need to and am writing this in order to familiarize with the topic. The Jaccard is widely used for classification purposes, particularly as the “most widely used dissimilarity measure in biogeography”[5]  One of the advantages of the Jaccard Distance is that it is a true metric in the technical sense, i.e. its mathematical properties include subadditivity, unlike ordinary distances or “semi-metrics,” whereas divergences lack both that and symmetry. The latter property is desirable because it guarantees that measuring from point A to point B will return the same answer when measured from B to A, whereas the latter signifies that the sum of any two distances between three points will be less than or equal to the length of the third.[6] On the other hand, as I’ve discussed in previous articles, there’s no such thing as a perfect distance measure that will meet all use cases, just as there is no perfect lens that will serve the divergent biological needs of hawks, humans and rabbits. The Kulczynski Coefficient represents an alternative, since it addresses a distortion that inflates the Jaccard Coefficient when small categories are members of disproportionately larger ones; on the other hand, the Kulczynski and its corresponding distance, which we will take up in a future article on the Intersection family of distances, does not retain subadditivity.[7] The key to sorting out distance measures and grasping their raisons d’etre is to realize that such properties can be leveraged to meet different ranges of use cases, which no single measure can possibly satisfy. That is why theoreticians are still hard at work developing new measures that can better capture the information needed to answer particular real-world problems. There is a method to the madness beneath the all of the arcane square roots and other symbols included in these formulas, which are geared toward producing specific properties, particularly symmetry and subadditivity.

The Common Properties and Derivation of the Inner Product Family

I almost always omit those underlying equations for the sake of communicating the overarching ideas to non-specialists with greater clarity and succinctness; suffice it to say that they all involve summing the product of the input values of both sets in the dividend, then adding the sums of the squares or square roots of both on the divisor. The Inner Product Similarity seems to be useful mainly to serve as the dividend in the others. The Harmonic Mean in turn incorporates the Inner Product in its dividend and an addition in its divisor, thereby setting the tone for the Cosine, Jaccard and Dice Distances, all of which have a similar structure except with different addition operations. It is apparently useful on its own for taking the average of rates,[8] which may be why I’ve seen examples of the Cosine Similarity being used for the same purpose. The Cosine apparently has also uses[9] for some of the variants of fuzzy sets I mentioned in Implementing Fuzzy Sets in SQL Server, Part 1: Membership Functions and the Fuzzy Taxonomy. I’ve also seen Cosine Similarity mentioned in texts on data mining and quantum information under one of its aliases, Ochiai Distance. When used to discriminate among documents, it leverages the fact that most of the content will not be identical.[10] Cosine has disparate other applications in fields like clustering and information retrieval and intersects with Euclidean Distance in certain circumstances, which I don’t fully grasp yet; such overlapping relationships work to our benefit though by allowing us to construct a stronger scaffolding of related metrics for better information detection, interpretability and validation. “It represents the angular distance of two vectors while ignoring their scale,”[11] which for practical purposes signifies that “if you have data that is very different in magnitude, but you do not want to take magnitude into account, then use cosine.”[12] In cases where the size of the values matters, Euclidean Distance would be preferred; keep in mind that the Euclidean is the ordinary measure of distance along a straight line, but the Cosine Similarity measures the distance along angles, in the same fashion that the Manhattan Distance discussed a few articles ago measures in a roundabout way, along the kind of right angles we encounter in city blocks. These properties may well emerge from the fact that Cosine Similarity is the only one in the Inner Product family that uses square roots rather than squares in their dividends. One of the drawbacks is that it apparently lacks subadditivity for certain definitions of space.[13]

…………Taneja Distance is actually a member of Sung-Hyuk’s Combination Family, but I included it in the sample T-SQL in Figure 1 partly because it builds upon the geometric mean[14] and partly because it has only two kindred measures in its family.[15] The Fidelity Similarity we’ll cover in my next post also makes use of it by summing the geometric means of all records.[16] For some truly helpful discussions of the differences between the ordinary geometric means, ordinary averages and harmonic means, see this thread at CrossValidated or this one at Quora.com. The latter suggests that ordinary averages may be best in cases where there are few outliers, whereas the geometric mean is useful when the data contains fractional values and the harmonic is best when fractions and numerous outliers are involved. This of course provides us with additional starting points for discriminating between the applications of these various distances. Taneja Distance has the most complex formula of all of the measures surveyed in this article, but you wouldn’t know it from Figure 1. The structure is almost identical to that of the last three articles, in which the inner SELECT of the first common table expression (CTE) retrieves probabilities precalculated from the Higgs Boson dataset. Long-time readers will recognize that I’ve been using the same two float columns in that 11-million-row dataset to stress-test my tutorial code for ages now.[17] Over the last few articles, I’ve resorted to retrieving the probabilities from the precalculated aggregates in the Physics.HiggsBosonProbabilityTable. When I populated this using the DDL and MERGE mentioned in the End Notes of the aforementioned Sørensen Distance article, I was stunned to find that it ran in just 3 seconds. This was a whopping 20 seconds faster than the nearly identical code for the article on Euclidean Distance, which retrieved raw values rather than probabilities. This was despite the extra overhead of standardization and calculating the proportions. Except for some slight tweaks to the math, the only difference between this and the code for Information Measurement with SQL Server, Part 4.3: Euclidean Distance and the Minkowski Family is the removal of a CTE that retrieved the raw values from the original Physics.HiggsBosonTable. I also substituted decimal(38,37) for the floats I usually default to my initial declaration section, since I believe the results retuned by all of these formulas are bounded from 0 to 1. That cosmetic change wasn’t responsible for the incredible performance of Figure 1, whose execution time was a trivial 23 milliseconds on my clunky old workstation. Just as in the last two articles, this was cut down further to a mere 3 milliseconds on a third or fourth use. I suspect that this boost occurs after SQL Server caches the @Count retrieval, which oddly takes up 99 percent of the batch costs in the execution plan.

Figure 1: Sample T-SQL for the Inner Product Family of Distances

DECLARE @InnerProductDistance decimal(38,37),

@HarmonicMeanDistance decimal(38,37),

@CosineDistance decimal(38,37),

@JaccardSimilarity decimal(38,37),

@DiceSimilarity decimal(38,37),

@TanejaDistance decimal(38,37)

DECLARE @Count bigint

SELECT @Count=Count(*)

FROM Physics.HiggsBosonTable

WHERE Column1 IS NOT NULL AND Column2 IS NOT NULL

; WITH CrossProductCTE

(Probability1, Probability2, Product, AdditionResult, Probability1Squared, Probability2Squared)

AS

(

SELECT Probability1, Probability2, Product, AdditionResult, Probability1Squared, Probability2Squared

FROM (SELECT Probability1, Probability2,   Probability1 – Probability2 AS DifferenceAmount,

Probability1 + Probability2 AS AdditionResult,

Probability1 * Probability2 AS Product,

Power(Probability1, 2) AS Probability1Squared,

Power(Probability2, 2) AS Probability2Squared

FROM Physics.HiggsBosonProbabilityTable) AS T1

),

InnerProductDistanceCTE

(InnerProductDistance)

AS

(

SELECT Sum(Product) AS InnerProductDistance

       FROM CrossProductCTE AS T1

)

SELECT @InnerProductDistance = InnerProductDistance,

@HarmonicMeanDistance = 2 * Sum(Product / AdditionResult),

@CosineDistance= InnerProductDistance  / (Power(SUM(Probability1Squared), 0.5) * Power(SUM(Probability2Squared), 0.5)),

@JaccardSimilarity = InnerProductDistance  / (SUM(Probability1Squared) + SUM(Probability2Squared) – InnerProductDistance),

@DiceSimilarity = 2 * InnerProductDistance / (SUM(Probability1Squared) + SUM(Probability2Squared)),

@TanejaDistance = SUM(InputToTanejaDistance)

FROM (SELECT (SELECT InnerProductDistance FROM InnerProductDistanceCTE) AS InnerProductDistance, Product, AdditionREsult, Probability1Squared, Probability2Squared,

CASE WHEN (2 * Product) = 0 THEN 0 ELSE (AdditionResult / 2) * Log(AdditionResult / (2 * Product)) END AS InputToTanejaDistance

       FROM CrossProductCTE) AS T1

GROUP BY InnerProductDistance

— there’s only one value, it’s just a placeholder to prevent a query error

SELECT @InnerProductDistance AS InnerProductDistance,

@HarmonicMeanDistance AS HarmonicMeanDistance,

@CosineDistance AS CosineDistance, — aka Ochiai [2,4] and Carbo

@JaccardSimilarity AS JaccardSimilarity,

1 – @JaccardSimilarity AS JaccardDistance,

@DiceSimilarity AS DiceSimilarity,

1 – @DiceSimilarity AS DiceDistance,

@TanejaDistance AS TanejaDistance

Figure 2: Results from the Higgs Boson Dataset (click to enlarge)

…………It took just 3 milliseconds to derive the results for the whole batch of measure sin Figure 2, across an 11-million-row dataset on a dusty desktop machine which had just six cores at the time. The good news is that with this kind of performance, we could test such metrics to our hearts’ content on much larger datasets. The bad news is that these particular results don’t seem to be as useful as some of the metrics we’ve previously covered, since the values are too infinitesimally small or just a hair away from the limit of 1. I wonder if this isn’t the result of applying these numerous multiplication and squaring operations across so many rows, which tends to either make us bang our heads on the ceilings of arithmetic overflows or watch the results dwindle down towards the netherworld of decimal places around zero. This reduces their practical utility. It is entirely possible, however, that it is a side effect of undiscovered errors in my code; I’m doing what has rarely or never been done before by implementing such cutting-edge measures in set-based languages, without the requisite experience or a second set of eyes, so my code should always be rechecked before putting into production. Given what I’ve learned while writing this, I would still give Jaccard Distance a shot when subadditive classification measures are called for. I’d probably also try the Cosine Similarity when the problem at hand called for measuring along angles rather than direct distances, or when I suspected the items being classified had little in common. The Taneja Distance and Harmonic Mean methods might be worth a shot in cases where many extreme values were expected.

…………This brings us back to the preeminent question hanging over this whole segment of the Information Measurement series: how do we know which of the hundreds of distances will best match our use cases? In SQL Server, this often means analyzing order-taking, CRM, financial and other such data often associated with our most common applications. When asking such questions, keep in mind that the theoreticians who think up these things often discover new properties and relationships many years after their development. Some experimentation is going to be necessary, because even experts who breathe this stuff day-in, day-out can’t give a definitive answer. I can only point you in the general direction of certain mathematicians and data mining specialists who have taken the time to think out some organizing principles. To date, the clearest explanation I have found to date of the principles underlying distance measure selection is Christian Hennig and Bernhard Hausdorf’s monograph on the subject of designing dissimilarities. As they elucidate, “A starting point for dissimilarity design is the question: how can the researcher’s (or the research group’s) idea of the similarity between objects given a certain research aim be translated into a formally defined function of the observed object characteristics?”[18] Some of the other principles they attach to this (in a quite comprehensible manner that doesn’t take a doctorate in stats to understand) are worth reading up on, for any amateurs like myself who want to integrate these measures into their data mining workflows. Also keep in mind that I’ve thus far focused mainly on subadditivity and symmetry, but these distances exhibit many other useful or detrimental characteristics that can make or break them when matching them to particular use cases. Other important properties I have yet to delve into include convexity, concavity, invariance and linkages to the Fisher-Rao metric, touched upon in an earlier post on Fisher Information. Some are undesirable, like absolute continuity; as we saw in Information Measurement with SQL Server, Part 4.1: The Kullback-Leibler Divergence, the requirement that every non-zero value for one probability must also be non-zero in the other disqualifies many data distributions from consideration. Next time around, we’ll find out which of these properties are exhibited by two measures I’ve seen mentioned as often as the Jaccard in the data mining literature, the Bhattacharyya and Hellinger Distances. Just to illustrate how problem-property matching trumps the performance issue, I can already give you the execution time on the Higgs Boson Dataset for the same workstation: just 3 milliseconds again. I have yet to do the really hard work of discerning exactly which applications would benefit from them – but the good news is that that consideration is entirely dependent on their mathematical properties, not performance. It is not often that SQL Server users can get away with paying little attention to the computational costs of a broad class of queries, even on relatively large datasets. Thankfully, this has been surprisingly but serendipitously true of entire families of distance measures covered in this segment of the Information Measurement series, which frees up users to focus instead on selecting the tools that are best-suited to the analytical tasks at hand.

 

[1] Sung-Hyuk Cha, 2007, “Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions,” pp. 300-307 in International Journal of Mathematical Models and Methods in Applied Sciences, Vol. 1, No. 4. I also like to lean heavily on Deza, Elena and Deza, Michel Marie, 2006, Dictionary of Distances. Elsevier: Amsterdam.

[2] IBID., p. 303.

[3] p. 302, Sung-Hyuk.

[4] I lost my citation for this, but it’s a really simple formula that’s probably widely cited.

[5] p. 4, Hennig, Christian and Hausdorf, Bernhard , 2007,”Design of Dissimilarity Measures: A New Dissimilarity between Species Distribution Areas?” published in Jan. 2007 as the University College of London Department of Statistical Science Research Report No. 273. Available online at the University College of London web address https://www.ucl.ac.uk/statistics/research/pdfs/rr273.pdf

[6] See the MathOverflow thread “Is the Jaccard Distance a Distance?” started by the user named Rgrig on March 13, 2010. Available online at http://mathoverflow.net/questions/18084/is-the-jaccard-distance-a-distance

[7] pp. 4-5, Hennig and Hausdorf, 2007.

[8] See the Wikipedia page “Harmonic Mean” at http://en.wikipedia.org/wiki/Harmonic_mean

[9] Ye, Jun, 2011, “Cosine Similarity Measures for Intuitionistic Fuzzy Sets and Their Applications,” pp. 91–97 in Mathematical and Computer Modelling, Jan.2011. Vol. 53, Nos. 1-2. I can’t afford to hefty fee to read the article, but I know of its existence through the ScienceDirect webpage http://www.sciencedirect.com/science/article/pii/S0895717710003651

[10] I’m rather loosely paraphrasing Shanley, Philip, 2010, “Discussion of Similarity Metrics: Cosine Similarity,” posted at the Human Oriented Systems web address http://mines.humanoriented.com/classes/2010/fall/csci568/portfolio_exports/sphilip/cos.html. He in turn cites Pang-Ning, Tan; Steinbach, Michael and Kumar, Vipin, 2006, Introduction to Data Mining. Pearson Addison Wesley: Boston.

[11] See the undated Math.NET Numerics webpage “Distance Metrics” at http://numerics.mathdotnet.com/Distance.html

[12] See the reply by the user named Anony-Mousse to the CrossValidated thread  “Distance Measure of Angles Between Two Vectors, Taking Magnitude Into Account,” posted Oct. 2, 2013 at http://stats.stackexchange.com/questions/71614/distance-measure-of-angles-between-two-vectors-taking-magnitude-into-account

[13] See the MathOverflow thread “Cosine Similarity /  Distance and Triangle Equation” started by the user named Anony-Mousse on Jan. 27, 2012 at 9:48

[14] The formula is simpler than the harmonic mean, since it merely involves multiplying the two values together and taking the square root, or cube root and so forth when more than two are involved. See the Wikipedia page “Geometric Mean” at http://en.wikipedia.org/wiki/Geometric_mean

[15] p. 304, Sung-Hyuk.

[16] p. 302, Sung-Hyuk.

[17] The original source was the University of California at Irvine’s Machine Learning Repository. I converted it long ago into a nearly 6-gigabyte table in a sham SQL Server database called DataMiningProjects.

[18] p. 3, Hennig and Hausdorf.

Original post (opens in new tab)
View comments in original post (opens in new tab)

Rate

5 (1)

Share

Share

Rate

5 (1)