Statistics in SQL: Simple Linear Regressions

Although linear regressions can get complicated, most jobs involving the plotting of a trendline are easy. Simple Linear Regression is handy for the SQL Programmer in making a prediction of a linear trend and giving a figure for the level probability for the prediction, and what is more, they are easy to do with the aggregation that is built into SQL.

The series so far:

  1. Statistics in SQL: Pearson’s Correlation
  2. Statistics in SQL: Kendall’s Tau Rank Correlation
  3. Statistics in SQL: Simple Linear Regressions
  4. Statistics in SQL: The Kruskal–Wallis Test
  5. Statistics in SQL: The Mann–Whitney U Test
  6. Statistics in SQL: Student's T Test

Let’s imagine that we have two variables, X and Y which we then plot using scatter graph.

C:\Users\andrew.clarke\SkyDrive\Documents\scatterPlot.gif

It looks a bit like someone has fired a shotgun at a wall but is there a relationship between the two variables? If so, what is it? There seems to be a weak positive linear relationship between the two variables here so we can be fairly confident of plotting a trendline.

Here is the data, and we will proceed to calculate the slope and intercept. We will also calculate the correlation.

It gives the result…

Alpha was calculated by this expression, …

… Beta was calculated thus.

… and Rho by this

The Pearson’s Product Moment correlation can be used to calculate the probability of this being a significant correlation via the Fisher Transformation.

The slope (beta) and intercept (alpha) can be used to calculate any point on the trendline.

We can, for example calculate the value of Y when X is 100 by the equation
y’ = a + bx

Is, in our example

We can calculate the trendline X,Y points by slightly altering the SQL

We can then plot a trend line

C:\Users\andrew.clarke\SkyDrive\Documents\scatterPlot.gif

We are, of course, assuming a linear relationship and parametric data. With Pearson’s Rho and the number of X-Y pairs, we can go on to estimate the fisher transformation (t= 1.912) and probability of 0.061825.

The PowerShell to do the Gnuplot graphs is here