I could use some help from anyone with some data science in their background.
I'm trying to learn R. I have a simple idea to prove the usefulness of R to a DBA that I want to turn into a blog post. In 2014, the new cardinality estimator started using exponential backoff in the density for compound column keys. This is because in the most common cases, you're looking at data that has relationships, not data that is completely unique. So, for example, a region and a salesperson are likely to be related, not a unique relationship, in each sale.
But, there are going to be exceptional data, so Microsoft has a traceflag you can use to force the optimizer to use the old method of density * density.
However, wouldn't it be cool if there was a quick way to feed two columns into a query against an R function to see if the data has the types of relationships we're talking about between a region and a salesperson? Enter R
I'm absolutely convinced this is an easy problem and I'm just missing the right formula. I've been looking at the Chi-Square Test of Independence and the Chi-Square Test of Homogeneity, but I don't think they're right. If I understand this (and I'm really stretching a bit), they're for two-way tables and the data I'm looking at is a one-way table.
My question, which set of formulas should I be using? Could anyone give me a nudge in the right direction please?