# Calculating odds ratio with T-SQL and R

• Comments posted to this topic are about the item Calculating odds ratio with T-SQL and R

• Thanks for the article!

• Perhaps my maths are a bit rusty from years of disuse, but it seems the formulas stated and the formulas coded for the confidence levels don't match. The article states:

`Lower confidence interval = Log(OR) - 1.96* Standard Error* LN(OR) `

`Upper confidence interval = Log(OR) + 1.96* Standard Error* LN(OR)`

but the examples are coded as if the formulas should be:

`Lower confidence interval = (Log(OR) - 1.96* Standard Error)* LN(OR) `

`Upper confidence interval = (Log(OR) + 1.96* Standard Error)* LN(OR)`

My apologies if I've missed something!

Thanks and regards,

Reece Watkins

• Thanks for the education.

• Are you blogging about this elsewhere? I know a small group of people who might enjoy the content without the technical stuff.

412-977-3526 call/text

• Sorry for missing the brackets there. Will get that corrected, good find, thank you.

• Hello, yes I do plan to write on the same lines at curiousaboutdba.com, my personal blog. I write one every week and there are a few there. Will submit more interesting ones to be published here. My posts will have technical stuff though as my goal is to educate and learn on T-SQL versus R too. I hope that is ok. For basic understanding you can skip that part. Thank you.

• That is exactly what I wanted to know. Thanks!

412-977-3526 call/text

• Yes it is. And I have a link to that in the article too. Thank you!!

• It seems to me that the odds ratio is equivalent to a percentage. n1/n2 only tells me how many of the bottom subgroup there are for each one of the top subgroup. The 'odds' is relative to the total population, ,,, n1/(n1+n2). This is the odds of someone being in group n1 relative to the total population/universe. So maybe I am mistaken but the statistical math may have been layed out incorrectly here, based on using the term "odds" incorrectly. There could have been many subgroups here, not just two.

The definition of a confidence interval is not best explained. It read more like a refresher for those that already knew this.

Though a good illustration into the basic syntactical structure of R.

----------------------------------------------------

• Hello, The difference between probability and odds ratio are best explained http://mathforum.org/library/drmath/view/56706.html.

I intentionally did not go into too much math with confidence interval, this is a very basic level post , the sql server world is not full of statisticians and many people get very intimidated by too much explanation with strange terms. Making it simplistic is wholly intentional and clearly if you know advanced statistics this is absolutely low level for you.

Thank you.

• Odds ratio is better explained here since there is a more detailed example https://en.wikipedia.org/wiki/Odds_ratio

It makes better sense now, though I wouldn't call this simplistic necessarily.

the sql server world is not full of statisticians and many people get very intimidated by too much explanation with strange terms.

Is this a fact or an assumption? ( I am getting into statistics here 😀 )

Still for those who are new to stats is why I considered it nicer to have a more involved definition of terms. I would not underestimate the competence level of members here to pick up new knowledge.

----------------------------------------------------

• ' I would not underestimate the competence level of members here to pick up new knowledge. ' - neither do I but I do think the average sql server person generally does not need an indepth understanding of confidence levels and other statistical terms. R programming is picking up just now and the vast majority of sql server jobs do not necessarily demand it as a skill. The problem is around where to draw the line between getting terribly statistical and keeping it relevant to the readers who are basically sql server people, not statisticians. That is why I like to start from simple basics and keep new terms minimal. And also to take it to level of data analysis and presenting findings rather than explaining a whole lot of raw statistics to them. Thanks in advance for your understanding, and appreciate insightful comments/criticism on other posts I have coming.

• Diligentdba 46159 (8/31/2016)

' I would not underestimate the competence level of members here to pick up new knowledge. ' - neither do I but I do think the average sql server person generally does not need an indepth understanding of confidence levels and other statistical terms. R programming is picking up just now and the vast majority of sql server jobs do not necessarily demand it as a skill. The problem is around where to draw the line between getting terribly statistical and keeping it relevant to the readers who are basically sql server people, not statisticians. That is why I like to start from simple basics and keep new terms minimal. And also to take it to level of data analysis and presenting findings rather than explaining a whole lot of raw statistics to them. Thanks in advance for your understanding, and appreciate insightful comments/criticism on other posts I have coming.

No problem at all. I do appreciate the column and your feed back as well. Though if you'll allow me to be more specific with a few items >>>

1.

what can be deemed to be the most commonly used statistical concept - the odds ratio

I still think I see probabilities way more often. I've not heard odds ratio , for example, in election polls.

2.

Simply put, odds are expressed as ratios while probability is expressed as a fraction or a percentage of an outcome.

Here people can still be confused, what is the difference between a ratio and a fraction? A percentage value can be 400% (4/1) where a ratio is always >=0 and <=1 if I recall correctly. Is this right? I think it worth explaining with the extra line or two to get better understanding in the rest of the article.

3.

We have to be able to say that 95% of the time the correlation between smoking status and health is in the range of x and y, where x and y are considered upper and lower confidence intervals.

What is meant by 95% of the time? This is what I was thinking specifically when it came to the confidence interval. I think it means if you repeated the experiment 100 times, the ratio would fall between your lower and upper range 95 times ==>95/100 is a strong case for the odds ratio suggested. I am not sure if means 95/100 smokers will develop bad health ... though I wouldnt doubt that either. 😛

Thanks again.

----------------------------------------------------

• 1 & 2: Fraction: Chances for/Total Chances Odds: Chances for : Chances against. I am not debating what you have personally heard of

more. In my experience people use odds ratio a *lot* and a lot of people find probability and math very intimidating.

That does not necessarily mean they use the right concept mathematically, i have seen many use chances as chances for/total chances, without knowing they are technically using a probability ratio, but that is ok in my opinion atleast.

3 Am 95% confident that chances of smokers getting cancer are between 1.82 to 2.10 times higher than a non smoker.

95% is just a commonly used percentage in this context, like 20% sampling for sql server statistics.

http://www.mathbootcamps.com/interpreting-confidence-intervals/

"95% of the time, when we calculate a confidence interval in this way, the true mean will be between the two values. 5% of the time, it will not. Because the true mean (population mean) is an unknown value, we don’t know if we are in the 5% or the 95%. BUT 95% is pretty good so we say something like

“We are 95% confident that the mean time it takes all workers in this city to get to work is between 18.3 and 23.7 minutes.” This is a common shorthand for the idea that the calculations “work” 95% of the time."

Viewing 15 posts - 1 through 15 (of 15 total)