By Steve Bolton
…………In the last segment of this series of amateur self-tutorials, we discussed how to code various ways of quantifying how much we don’t know about the data in SQL Server tables and cubes. The various probabilistic entropies I translated into T-SQL in those six articles can be viewed as measures of a particular sort of information, which might best be interpreted as “newsworthiness,” since they tell us how much we might learn from each data point. The rest of this wide-ranging series will have to be somewhat haphazard, due to the sheer number of information metrics and the fact that I’m learning about them as I go; writing on these topics helps me absorb the material a lot faster, while hopefully helping other amateurs avoid my inevitable mistakes. I thought it fitting, however, to complement the topic of information entropy by next discussing means of quantifying what we do know about our data. Bayes Factors are mercifully easy to calculate, since all we need to do is divide two probabilities by each other and plug them into a logarithm. In fact, we can reuse some of the code from Information Measurement with SQL Server, Part 2.4: Conditional and Joint Entropy and simply strip out the LOG operations that transformed the conditional probabilities into entropies. This step could also be dispensed with, if one of the chief uses of Bayes Factors didn’t involve multiplying them by one conditional probability in order to derive another. This information metric can be interpreted as a sort of crude gauge of existing knowledge, but is chiefly useful in adjusting probabilities in the light of new evidence, through a famous probability formula developed by Presbyterian Minister Thomas Bayes (1701-1761) and extended a few decades later by Pierre-Simon Laplace (1749-1827)  to induction.
…………Bayes Factors are indeed simple to code, but making use of them requires a lot more context; this is basically the same paradox that made Information Measurement with SQL Server, Part 2.1: The Uses and Abuses of Shannon’s Entropy one of the longest articles in the last segment, although it was the simplest of the entropic metrics to code. This is due in part to the broad applications of Bayes’ Theorem, which essentially derives a conditional probability for one distinct event or value given another; this is arrived at by multiplying the probability of the first event by the conditional probability of the second given the first, then dividing by the overall probability of the second. The formula isn’t terribly difficult to follow even for an amateur like myself, but I don’t see any reason to post it and violate my longstanding ban on equations in this blog. Calculating a conditional probability in reverse in this way can be useful in some situations on its own, but can be harnessed for new uses by infusing with additional meaning as a “degree of belief.” Basically, the probability of X given Y derived from this inverse probability is interpreted by Bayesian statisticians as a posterior distribution, or a degree of belief updated in light of the conditional probability of Y given X, which is known as a prior. The Bayes Factor merely represents the ratio between the other two elements in Bayes’ Theorem, the two overall probabilities for X and Y. If we know the Bayes Factor and the prior conditional probability, we can derive the posterior conditional probability, hence the common refrain “posterior is proportional to prior times likelihood.” Sometimes the term “likelihood ratio” is used as a synonym for Bayes Factors, but this can lead to confusion, given that it is also used for the likelihood-ratio tests, certain diagnostic tests and “the ratio of two likelihood functions.” Moreover, the term “likelihood” carries specific connotations in statistical parlance, whereas in ordinary speech it is often used interchangeably with “probability.” For that reason, I’ll probably avoid it in discussions on topics like Bayes Factors that are directly related to likelihood.
From Inverse Probability to Decibans
I’ll also stick to the broader and older term “inverse probability” in place of the now-common “Bayesian probability,” in order to stay out of the knock-down, drag-out turf wars that occur among statisticians today over the topic; these methods don’t necessarily have to be used in ways often labeled “Bayesian” today, which as we shall see, sometimes implies endorsement of controversial interpretations. The process was originally known as inverse probability, with good reason, as a few intuitive illustrations of its usage will demonstrate. Many of the intuitive examples given for inverse probability in the statistical literature are similar to one in Fazlollah Reza’s An Introduction to Information Theory, where the goal is to gauge the probability of particular combinations of red and black balls being taken from three urns. The catch is that the probability is derived after they are drawn, not before, as is typically the case in ordinary “frequentist” probability. This is summarized quite succinctly in Dan McNeill’s Fuzzy Logic, one of the most readable books on the topics covered in my Implementing Fuzzy Sets in SQL Server series. Basically, in ordinary probability we’re asking what color the next ball we draw is likely to be, but with inverse probability we’re transposing the question into, “What is the ratio of the balls in the urn?” given foreknowledge of those we’ve drawn so far. Following Laplace’s “principle of indifference” in the absence of extra information, we’d set the probabilities for the outcomes even, as in a uniform distribution; with each subsequent draw from the urns, Bayesians would update these probabilities until they gradually approached the true distribution. A previous Wikipedia article on Bayes Theorem or Bayes Rule contained a well-written example, in which the goal was to guess the gender of a person with long hair, given an even distribution of men and women and probabilities of each sex having long hair set to 15 and 75 percent respectively (with the remainder being unknown). In that example, the Bayes Factor would be a simple ratio between 75 and 15, which equals odds of 5:1. All we need to do is divide them and apply a LOG operation, which is normally done in base 10 and then multiplied by a factor of 10. These units are known as decibans, a variant of an information measure pioneered by famed cryptanalyst Alan Turing and alternately known as the hartley, in honor of the developer of the metric discussed in Information Measurement with SQL Server, Part 1: A Quick Review of the Hartley Function. Statisticians find decibans convenient because common odds ratios can be translated into them seamlessly, such as 100 to 1, which equals 10 decibans, a nice round number.Figure 1 also includes the calculations in base 2, in order to provide a comparison point with bits, the most common unit associated with entropic information measures. About 95 percent of the performance costs are incurred in the INSERT, which is identical to the initial code used in Information Measurement with SQL Server, Part 2: Conditional and Joint Entropy, except with the entropic LOG calculations stripped out. This simplification may be the reason why it executes in just 1:14 on the same two float columns of the 11-million-row Higgs Boson dataset I’ve been using throughout this series for demonstration purposes, compared to the 1:46 required for the Joint Entropy article. One of the conditional probabilities can be reconstructed from the other as long as its inverse and individual probabilities are available, or can at least be derived from statistical estimation techniques; in the database server world, we’re more likely to have the luxury of using actual proportions derived from extensive counts, whereas in other fields these inputs would have to be derived from probability distribution formulas and likelihood estimation methods. I included the individual proportions and conditional probabilities in the final SELECT merely to convey how Bayes’ Theorem can be used to derive them from each other in this manner.
Figure 1: Deriving Bayes Factors with T-SQL
DECLARE @Count1 bigint, @Count2 bigint, @JointCount bigint
WHERE Column1 IS NOT NULL
WHERE Column2 IS NOT NULL
WHERE Column1 IS NOT NULL OR Column2 IS NOT NULL
DECLARE @EntropyTable table
INSERT INTO @EntropyTable
(Value1, Value2, ValueCount, Proportion1,Proportion2, JointProportion)
SELECT Value1, Value2,ValueCount, Proportion1,Proportion2, JointProportion
FROM (SELECT Value1, Value2,ValueCount,
CASE WHEN GroupingIDColumn1 = 0 AND GroupingIDColumn2 = 1 THEN ValueCount / CAST(@Count1 AS float) ELSE NULL END AS Proportion1,
CASE WHEN GroupingIDColumn1 = 1 AND GroupingIDColumn2 = 0 THEN ValueCount / CAST(@Count2 AS float) ELSE NULL END AS Proportion2,
CASE WHEN GroupingIDColumn1 = 0 AND GroupingIDColumn2 = 0 THEN ValueCount / CAST(@JointCount AS float) ELSE NULL END AS JointProportion
FROM (SELECT Column1 AS Value1, Column2 AS Value2, Count(*) AS ValueCount, GROUPING_ID(Column1) AS GroupingIDColumn1, GROUPING_ID(Column2) AS GroupingIDColumn2
WHERE Column1 IS NOT NULL AND Column2 IS NOT NULL
GROUP BY CUBE (Column1, Column2)) AS T1) AS T2
SELECT 10 * Log(Proportion1 / Proportion2, 10) AS BayesFactorForHypothesisTesting,
10 * ABS(Log(Proportion1 / Proportion2, 10)) AS BayesFactorInDecibans,
ABS(Log(Proportion1 / Proportion2, 2)) AS BayesFactorInBits,
Proportion1 / Proportion2 AS OddsRatio,
Proportion1, Proportion2, ConditionalProbabilityOfColumn2GivenColumn1, ConditionalProbabilityOfColumn1GivenColumn2,
(CAST(ConditionalProbabilityOfColumn2GivenColumn1 as float) * CAST(Proportion1 as float)) / CAST(Proportion2 as float) AS RecalculationOfConditionalProbabilityOfColumn1GivenColumn2
FROM (SELECT Proportion1, Proportion2, JointProportion, JointProportion / Proportion1 AS ConditionalProbabilityOfColumn2GivenColumn1, JointProportion / Proportion2 AS ConditionalProbabilityOfColumn1GivenColumn2
FROM (SELECT JointProportion
WHERE Value1 = –0.61253148317337036130000000000 AND Value2 = 0.54839861392974853520000000000) AS JointProportion,
(SELECT Proportion1 FROM @EntropyTable
WHERE Value1 = –0.61253148317337036130000000000 AND Proportion1 IS NOT NULL) AS Proportion1,
(SELECT Proportion2 FROM @EntropyTable
WHERE Value2 = 0.54839861392974853520000000000 AND Proportion2 IS NOT NULL) AS Proportion2) AS T1
…………At 0 decibans or bits, the probability is evenly split. When the odds ratio is less than 1:1 then the value, event, hypothesis or model embodied in the divisor is more likely, whereas if it is greater than 1:1, the evidence is weighted in favor of the dividend; since an odds ratio in favor of the divisor results in a fraction less than 1, the LOG applied to translate this into a Bayes Factor will result in a negative value. This makes it difficult to compare them to entropic measures, most of which are non-negative by definition, hence the need for the two ABS functions in the last SELECT. On the other hand, the minus sign is useful when the ratio’s interpreted as a ratio of the weights of evidence for the event or value in the dividend, vs. the evidence for the contrary hypothesis embodied in the divisor. Hence the need for the BayesFactorForHypothesisTesting, sans an ABS operation. Bayes Factors are used directly in the Bayesian version of hypothesis testing, on the grounds that the odds can be interpreted as evidence in favor of one hypothesis or the other. The same sort of comparison can be made for the global figures, in which case they are useful in model selection, a use more closely associated with data mining than ordinary statistics. For this use case, “an advantage of the use of Bayes factors is that it automatically, and quite naturally, includes a penalty for including too much model structure. It thus guards against overfitting.” Furthermore, the ability to update probabilities in a sequential manner as new information is acquired can be leveraged to upgrade Design of Experiments (DOE). It is more common to use Bayes Factors for these tasks than as stand-alone, pure information measures, but they can be treated as a sort of crude gauge of a priori knowledge. The mathematical properties that make Bayes Factors attractive for hypothesis testing and model selection were worked out in detail long ago, but some of them – such as additivity and eventual convergence to a Gaussian or “normal” distribution – might be useful in terms of an information metric as well.
…………I’m sure that theoreticians have worked out their relationships to measures of information entropy already, but I haven’t gotten that far in the literature yet. I do know that the journal Entropy recently published a special edition on integrating Bayesian stats with important information theory axioms like the Principles of Maximum Entropy and Minimum Cross Entropy, so apparently it’s still a hot research topic. I strongly suspect that they share much in common with the Self-Information metric I introduced in the last article, given that all we have to do to derive the ratio of Self-Information is to apply the LOG operations first, then perform the division. I omitted those calculations in order to shrink the graphic in Figure 2, but calculating the ratio of the Self-Information of these two distinct values for Column1 and Column2 yields a figure of 1.06758789451323 for base 2, compared to 0.786263282058452 for the BayesFactorsInBits depicted above. It would actually be quite economical to calculate Bayes Factors and Self-Information together, just as it wouldn’t cost us much more to derive the Conditional and Joint Entropies in the same pass across the original table in the INSERT. If we’re going to derive the two probabilities and the conditional probability across two columns, it’s really simple in T-SQL to derive the other alongside it rather than calculating backwards from Bayes’ Theorem.
Uses and Interpretations of Bayes Factors
The popular examples of counting balls of different color in separate urns would be easier to solve in SQL Server, given the extensive counts that are readily available in our tables and cubes. Bayesian probability would probably find wider use cases in terms of updating probabilities as new information comes in, which calls for interpreting them as “beliefs” in a manner similar to that introduced in Implementing Fuzzy Sets in SQL Server, Part 10.1: A Crude Introduction to Dempster-Shafer Evidence Theory. Bayesian stats can also be integrated with Decision Theory. In fact, it intersects with many other areas of statistics, which give rise to a wide array of use cases; for example, when Maximum Likelihood Estimates (MLEs) are used in Bayes Factors, it basically turns into an ordinary likelihood ratio test. Bayes Factors are thus useful in a smorgasbord of applications beyond just a pure information metric, which really doesn’t even qualify as the tip of the iceberg because it’s a somewhat obscure usage. Many of those applications occur in the field of data mining, including the Bayesian stats used in Naive Bayes in SSDM A Rickety Stairway to SQL Server Data Mining, Algorithm 1: Not-So-Naïve Bayes. Bayes Factors might be adapted for all kinds of DIY mining algorithms, but I wonder if Bayesian stats might be integrated in interesting ways with my pet topic, neural nets. The human brain cannot yet act on information it does not have, so it stands to reason that it must have a particularly efficient way of using priors to update beliefs, to help us quickly zero in on truth (or the fallacies we prefer out of pride, which has been one of mankind’s most tragic imperfections throughout history) out of endless sea of wrong answers. It might daisy-chain priors together to update our internal knowledge bases with incoming sensory input every split second, perhaps by calibrating anticipation in the same manner theorized by artificial intelligence researcher Jeff Hawkins.[15
…………The limits of human perception of differences in degrees of belief apparently occur around 1 deciban. This corresponds to inserting a ratio of 1.258925411794168 into the LOG operation, or 0.332192809488737 bits using base 2. Various informal scales for gauging the strength of evidence have been developed over the years, including one by famed Bayesian Harold Jeffreys and a more recent one by statistics Profs. Robert E. Kass and Adrian E. Raftery. They’re not terribly different from the one mentioned in a .pdf document published by Biostatistics Prof. Leonhard Held, who says that odds of 1:1 to 3:1 are “barely worth mentioning,” 3:1 to 20:1 are “substantial” and 20:1 through 150:1 are “strong,” with anything beyond that constituting “very strong” evidence in favor of the hypothesis embodied in the dividend. The three fences of 3:1, 20:1 and 150:1 corresponds to 4.77121254719662, 13.0102999566398 and 21.7609125905568 decibans respectively, or 1.58496250072116, 4.32192809488736 and7.22881869049588 bits.
Too Much of a Good Thing: The Danger of Fringe Interpretations of Bayesian Statistics
Beyond these somewhat subjective criteria, the inherent meaning of Bayesian statistics is hotly debated. As is often the case in fields related to information theory, the difficulty rests in the interpretation, not the calculations; in fact, interpretation of Bayesian stats has been the subject of incessant trench warfare among statisticians and researchers in related fields for the last few decades. Apparently the gist is that some Bayesians claim their techniques supersede all others, including those for describing uncertainty, which many other statisticians take issue with. Peter Cheeseman, a leading Bayesian, took note of the “nearly religious fervor” and “stand-up fights” that arose in the field as a result, as did many of his counterparts on the other side of the fence. This rivalry apparently spilled over into funding wars and appointments; for example, McNeill laments in Fuzzy Logic that Bayesians allegedly torpedoed fuzzy research in the U.S. in the 1980s. He points out in return, however, that at the time he wrote two decades ago, there were no Bayesian products in the statistical marketplace yet, but plenty that made use of fuzzy sets. George J. Klir, one of the authors of my favorite mathematical resource on fuzzy sets, criticizes Bayesianism on the grounds that it assumes degrees of belief when the corresponding numbers are in fact hazy. “It forces us to be certain about our uncertainty…“By making vague expectations precise, he says, the Bayesians fail to know their ignorance, and by disdain to model vagueness appropriately, they ignore available knowledge.” Klir says, then suggests that they read the wisdom of the Tao Te Ching: “Knowing ignorance is strength. Ignoring knowledge is sickness.”
…………Bayes Factors opens a whole world of Bayesian statistics that I have yet to set foot in, nor intend to incorporate much in this series, which is strictly concerned with coding information metrics in SQL Server for purposes of DIY data mining. Therefore, I won’t get into a holy war about something I don’t know enough about by joining one camp or another at this point. I’ll stick with my usual approach of finding ways of pragmatically incorporating these techniques in my toolbelt wherever I can and matching them up with appropriate use cases; I agree with some of the Bayesian criticisms of standard statistical tests involving P-values, confidence levels and the like, but do not think they warrant jettisoning the whole “frequentist” gamut of techniques. If the characterizations of some Bayesians as extremists who will insist on throwing out many of the competing tools, then I would definitely have a problem with that kind of obsessiveness. I would also point out that the common Bayesian refrain that “certainty is simply total belief” does not necessarily wash, since a person can be totally certain of falsehoods, even to the point of willingly embracing bad evidence; as Bill W. and his disciples are wont to point out, “Denial just ain’t a river in Egypt.” Other misgivings about Baysian inference would also have to be cleared up, such as the ramifications of using it to reason “backwards from the evidence to the hypothesis.” Counter-intuitive results are also known to arise in Bayesian interpretations, just as they do with the Evidence Theory belief and plausibility measures I covered in the fuzzy sets tutorial series. For example, in the well-known “Oliver’s Blood” problem, if two blood samples are left at a crime scene with Type O and Type A blood, the evidence would be slightly against a person with Type O blood accounting for one of them, even though the former is more common than the latter among the general population by odds of about 60:1. I’m not sure how Bayesians compensate for fallacious reasoning that often arises from ordinary probability problems, such as the common belief that priors have a direct, causative relationship in determining posteriors. This is an accurate statement in certain card games where applications without replacement are the rule (or against the rules, in the case of Ben Affleck), but not in coin tosses, where getting ten heads in a row has no effect whatsoever on the odds of getting an eleventh. This is one of the most dangerous aspects of probability theory, a trap I’d argue that some physicists have fallen into by overinterpreting quantum theory as arising from probabilities, rather than interpreting them as field effects.
…………The ultimate culprit there is the same as in extreme interpretations of Bayesianism: the idea that reality is subjective. Both Bayesian and frequentist models of probability are ultimately derived from combinatorics, includes Bayes Factors and entropic measures; this in turn gives them a rigid foundation in logic, but leavening it with the wrong kind of subjectivity might be tantamount to adding a fly to the ointment. It is critical to always keep in mind that subjectivity is fine when it is strictly a matter of gauging personal aesthetic preferences, including the questions we prefer to ask while data mining. “Is my data mine played out?” is an example of a subjectively selected question, which might be answered through such objective means as declining information gain and a progressive lack of informative priors. Given that probabilities do not determine outcomes, we might have to check every last square inch of ground to be certain that a particular mine does not hide the nuggets of gold we’re looking for, but these measures can serve as a reasonable guide on when to stop and look elsewhere. As I’ve pointed out many times in previous posts, subjectivity becomes problematic when confused with the answers to those questions, which are always a matter of hard, cold, objective reality. I’ve seen noted authors in the hard sciences and mathematics unconsciously let their definitions of these things drift across the course of their books, in ways which obscure this dividing line without justification. To put it bluntly, subjectivity in the wrong place leads to madness; when men have accepted the idea that their personal thoughts affect reality, they end up in institutions, and when the same ideas infect entire civilizations, they come crashing down. It is a permanent temptation of mankind, a dangerous undercurrent that has reared time and again through human intellectual history in guises like phenomenalism, nominalism, solipsism (which is considered a marker of schizophrenia, unless it is championed by professors) and certain misuses of David Hume’s philosophy. It is dangerous enough when civilizations tamper with the definition of purely abstract ideas like paper money (which always leads to hyperinflation), but doubly so when subjectivity degrades its capacity for wise decision-making. As I have pointed out from time to time in the fuzzy set tutorial series and A Rickety Stairway to SQL Server Data Mining, many of these mind-blowing concepts are apparently fraught with danger; perhaps that is why so many of the “rock star” mathematicians and physicists have gone mad or become degenerates, although I’d wager that overweening pride is more to blame. In the next installment I’ll pilot this series through less dangerous waters that I nevertheless really don’t have the expertise to navigate, in search of answers about the Kullback-Leibler Divergence, a tried-and-true distance measure used often in data mining.
 Pierre-Simon Laplace (1749- 1827). See the Wikipedia page “Pierre Simon Laplace” at http://en.wikipedia.org/wiki/Pierre-Simon_Laplace for his birth and death dates.
 See Statistical, LLC, 2015, “Likelihood,”available at the Bayesian-Inference web address http://www.bayesian-inference.com/likelihood and the Wikipedia pages “Bayes’ Theorem” and “Bayes’ Rule”at http://en.wikipedia.org/wiki/Bayes%27_theorem and http://en.wikipedia.org/wiki/Bayes%27_rule
 See the Wikipedia page “Likelihood Ratio” at http://en.wikipedia.org/wiki/Likelihood_ratio
 p. 48, Reza, Fazlollah M., 1961, An Introduction to Information Theory. McGraw-Hill: New York.
 p. 178, McNeill, Dan, 1993, Fuzzy Logic. Simon & Schuster: New York.
 This dataset is made publicly available by University of California at Irvine’s Machine Learning Repository. I downloaded it a few tutorial series ago and converted it to a SQL Server table, which now takes up about 5 gigabytes of space in a sham DataMiningProjects database I’ve been using ever since for practice data.
 “In order to complete the definition of a Bayesian model, both the prior distributions and the likelihood must be approximated or fully specified.” See Statistical, LLC, 2015, “Likelihood,”available at the Bayesian-Inference web address http://www.bayesian-inference.com/likelihood
 pp. 45-46, Downey, Allen B., 2012, Think Bayes: Bayesian Statistics Made Simple, Version 1.03. Green Tea Press: Needham, Mass. Available online at the Green Tea Press web address http://www.greenteapress.com/thinkbayes/thinkbayes.pdf
 Held, Leonhard, 2011, “Introducing Bayes Factors,” monograph published Nov. 25, 2011 at the Vienna University for Economics and Business Institute for Statistics and Mathematics web address http://statmath.wu.ac.at/research/talks/resources/talkheld.pdf
 See the Wikipedia article “Bayes Factor” at http://en.wikipedia.org/wiki/Bayes_factor
 Good, I. J., 2011, “A List of Properties of Bayes-Turing Factors,” undated monograph released March 9, 2011 by the National Security Agency as a result of Freedom of Information Act request #58820. Available online at the NSA web address https://www.nsa.gov/public_info/_files/tech_journals/list_of_properties.pdf
 Giffin, Adom, 2013, “Special Issue ‘Maximum Entropy and Bayes Theorem’,” commentary published Jan. 7, 2013 at the Multidisciplinary Digital Publishing Institute web address http://www.mdpi.com/journal/entropy/special_issues/bayes-theorem
 pp. 147, 238, Hawkins, Jeff, 2004, On Intelligence. Times Books: New York.
 Good, I.J., 1985, “Weight of Evidence: A Brief Survey,” pp. 249-270 in Bayesian Statistics 2, Bernardo J.M.; DeGroot, M.H.; Lindley, D.V. and Smith, A.F.M., eds. Elsevier Science Publishers: North Holland. Available online at the Californa Water Boards web address at http://www.waterboards.ca.gov/water_issues/programs/tmdl/docs/303d_policydocs/207.pdf . Originally cited at the Wikipedia webpage “Ban (Unit)” at http://en.wikipedia.org/wiki/Ban_(unit)
 Kass, Robert E. and Raftery, Adrian E., 1995, “Bayes Factors,” pp. 773-795 in Journal of the American Statistical Association, June 1995. Vol. 90, No. 430. Available online at the Carnegie Mellon web address http://www.andrew.cmu.edu/user/kk3n/simplicity/KassRaftery1995.pdf. I learned of it through the Wikipedia article “Bayes Factor” at http://en.wikipedia.org/wiki/Bayes_factor
 p. 2, Held.
 p. 181, McNeill.
 IBID, p. 181.
 IBID., p. 176.
 IBID., p. 190.
 IBID., pp. 188-189.
 IBID., pp. 177-178.
 p. 179, McNeill.
 pp. 45-46, Dwney. The original source was MacKay, David J. C, 2003, Information Theory, Inference, and Learning Algorithms. Cambridge University Press: New York.
 You might be in better hands with Gene Wilder.