SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Which Algorithm?


Which Algorithm?

Author
Message
Grant Fritchey
Grant Fritchey
SSC Guru
SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)

Group: General Forum Members
Points: 237377 Visits: 33641
I recognize that cross posting is not generally considered good form, but this forum didn't exist yesterday. I'm adding my question here.

I could use some help from anyone with some data science in their background.

I'm trying to learn R. I have a simple idea to prove the usefulness of R to a DBA that I want to turn into a blog post. In 2014, the new cardinality estimator started using exponential backoff in the density for compound column keys. This is because in the most common cases, you're looking at data that has relationships, not data that is completely unique. So, for example, a region and a salesperson are likely to be related, not a unique relationship, in each sale.

But, there are going to be exceptional data, so Microsoft has a traceflag you can use to force the optimizer to use the old method of density * density.

However, wouldn't it be cool if there was a quick way to feed two columns into a query against an R function to see if the data has the types of relationships we're talking about between a region and a salesperson? Enter R

I'm absolutely convinced this is an easy problem and I'm just missing the right formula. I've been looking at the Chi-Square Test of Independence and the Chi-Square Test of Homogeneity, but I don't think they're right. If I understand this (and I'm really stretching a bit), they're for two-way tables and the data I'm looking at is a one-way table.

My question, which set of formulas should I be using? Could anyone give me a nudge in the right direction please?

----------------------------------------------------
The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood...
Theodore Roosevelt

The Scary DBA
Author of: SQL Server Query Performance Tuning and SQL Server Execution Plans
Product Evangelist for Red Gate Software
BuckWoody
BuckWoody
Right there with Babe
Right there with Babe (749 reputation)Right there with Babe (749 reputation)Right there with Babe (749 reputation)Right there with Babe (749 reputation)Right there with Babe (749 reputation)Right there with Babe (749 reputation)Right there with Babe (749 reputation)Right there with Babe (749 reputation)

Group: General Forum Members
Points: 749 Visits: 76
Not sure I have enough info to answer this, but I have a cheat-sheet referenced on this page that might help, along with some ideas: https://buckwoody.wordpress.com/2015/12/02/knowing-which-statistical-formula-to-use/


Buck Woody
MCDBA, MCSE, Novell and Sun Certified
mattmparker
mattmparker
SSC-Enthusiastic
SSC-Enthusiastic (116 reputation)SSC-Enthusiastic (116 reputation)SSC-Enthusiastic (116 reputation)SSC-Enthusiastic (116 reputation)SSC-Enthusiastic (116 reputation)SSC-Enthusiastic (116 reputation)SSC-Enthusiastic (116 reputation)SSC-Enthusiastic (116 reputation)

Group: General Forum Members
Points: 116 Visits: 10
So, for example, a region and a salesperson are likely to be related, not a unique relationship, in each sale.


I think you're on the right track, and just getting stuck on the terminology. I think if any of these questions is the question you're asking, a chi-squared test of independence is appropriate:

- Is there a relationship between region and salesperson?
- Given the name of the salesperson, could I make a better-than-chance guess at the region in which a sale happened? (and vice versa)
- Does the salesperson-region relationship in the data reflect the database's previous assumption (uniform distribution) or its new assumption (some kind of non-uniformity)?

I'm way out of my depth on all of this, so if any of this doesn't make sense, don't hesitate to let me know.
Grant Fritchey
Grant Fritchey
SSC Guru
SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)

Group: General Forum Members
Points: 237377 Visits: 33641
BuckWoody (12/16/2015)
Not sure I have enough info to answer this, but I have a cheat-sheet referenced on this page that might help, along with some ideas: https://buckwoody.wordpress.com/2015/12/02/knowing-which-statistical-formula-to-use/


Thanks Buck. That's already one of my go-to resources as I start figuring all this out. I appreciate what you're doing over there.

----------------------------------------------------
The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood...
Theodore Roosevelt

The Scary DBA
Author of: SQL Server Query Performance Tuning and SQL Server Execution Plans
Product Evangelist for Red Gate Software
mattmparker
mattmparker
SSC-Enthusiastic
SSC-Enthusiastic (116 reputation)SSC-Enthusiastic (116 reputation)SSC-Enthusiastic (116 reputation)SSC-Enthusiastic (116 reputation)SSC-Enthusiastic (116 reputation)SSC-Enthusiastic (116 reputation)SSC-Enthusiastic (116 reputation)SSC-Enthusiastic (116 reputation)

Group: General Forum Members
Points: 116 Visits: 10
After reading up on cardinality, another couple of questions occurred to me:

- Does the density/distribution of salesperson-region relationships have high cardinality (salespeople strongly associated with one or two regions), or low (salespeople working across many regions)?
- Does the density of those relationships better match the old optimizer assumption, or the new one (exponential)?

In that case, an alternative might be to plot three densities: your empirical density, the previously-assumed density (uniform?), and the new density (exponential?). There are a few examples here, but tons more on the web: https://heuristically.wordpress.com/2012/06/13/comparing-continuous-distributions-with-r/

After that, there are a whole slew of goodness-of-fit tests for checking whether your data fit a given distribution or not; off the top of my head, you could probably start fitdistr() from the MASS package.

No clue how you'd go about generating all of those densities, though!
Grant Fritchey
Grant Fritchey
SSC Guru
SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)

Group: General Forum Members
Points: 237377 Visits: 33641
mattmparker (12/16/2015)
So, for example, a region and a salesperson are likely to be related, not a unique relationship, in each sale.


I think you're on the right track, and just getting stuck on the terminology. I think if any of these questions is the question you're asking, a chi-squared test of independence is appropriate:

- Is there a relationship between region and salesperson?
- Given the name of the salesperson, could I make a better-than-chance guess at the region in which a sale happened? (and vice versa)
- Does the salesperson-region relationship in the data reflect the database's previous assumption (uniform distribution) or its new assumption (some kind of non-uniformity)?

I'm way out of my depth on all of this, so if any of this doesn't make sense, don't hesitate to let me know.


Yeah, I'm looking to see if there is a relationship between region and salesperson, so it sounds like the chi-square test of independence is the way to go. I'm working on building a new data set for my tests. I'll post results and code back here when I get them in hand.

----------------------------------------------------
The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood...
Theodore Roosevelt

The Scary DBA
Author of: SQL Server Query Performance Tuning and SQL Server Execution Plans
Product Evangelist for Red Gate Software
Grant Fritchey
Grant Fritchey
SSC Guru
SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)

Group: General Forum Members
Points: 237377 Visits: 33641
mattmparker (12/16/2015)
After reading up on cardinality, another couple of questions occurred to me:

- Does the density/distribution of salesperson-region relationships have high cardinality (salespeople strongly associated with one or two regions), or low (salespeople working across many regions)?
- Does the density of those relationships better match the old optimizer assumption, or the new one (exponential)?

In that case, an alternative might be to plot three densities: your empirical density, the previously-assumed density (uniform?), and the new density (exponential?). There are a few examples here, but tons more on the web: https://heuristically.wordpress.com/2012/06/13/comparing-continuous-distributions-with-r/

After that, there are a whole slew of goodness-of-fit tests for checking whether your data fit a given distribution or not; off the top of my head, you could probably start fitdistr() from the MASS package.

No clue how you'd go about generating all of those densities, though!


Excellent. Thanks. I'll explore this as well. Sorry the question mixed technologies so much. I'll see what I can do with that and report back.

----------------------------------------------------
The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood...
Theodore Roosevelt

The Scary DBA
Author of: SQL Server Query Performance Tuning and SQL Server Execution Plans
Product Evangelist for Red Gate Software
Grant Fritchey
Grant Fritchey
SSC Guru
SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)SSC Guru (237K reputation)

Group: General Forum Members
Points: 237377 Visits: 33641
Thanks for all the help guys. I'm thinking that I'm going to back off and rethink my blog post. I can't get my head around what I need to do to properly prepare the data in order to run it through these algorithms. For the short term I think I'll concentrate on learning the basics of the language instead of immediately trying to apply it.

Thanks again for the assist. Sorry I'm letting you down.

----------------------------------------------------
The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood...
Theodore Roosevelt

The Scary DBA
Author of: SQL Server Query Performance Tuning and SQL Server Execution Plans
Product Evangelist for Red Gate Software
BuckWoody
BuckWoody
Right there with Babe
Right there with Babe (749 reputation)Right there with Babe (749 reputation)Right there with Babe (749 reputation)Right there with Babe (749 reputation)Right there with Babe (749 reputation)Right there with Babe (749 reputation)Right there with Babe (749 reputation)Right there with Babe (749 reputation)

Group: General Forum Members
Points: 749 Visits: 76
Not a problem at all. In fact, maybe I'll blog it - but not in the way you think. Smile

Do keep asking questions - it helps us in our approach on educating people as much as it helps you get the answer you need. We need to learn where people are having issues so that we teach that part first.


Buck Woody
MCDBA, MCSE, Novell and Sun Certified
xsevensinzx
xsevensinzx
SSChampion
SSChampion (11K reputation)SSChampion (11K reputation)SSChampion (11K reputation)SSChampion (11K reputation)SSChampion (11K reputation)SSChampion (11K reputation)SSChampion (11K reputation)SSChampion (11K reputation)

Group: General Forum Members
Points: 11530 Visits: 4359
Grant Fritchey (12/17/2015)
Thanks for all the help guys. I'm thinking that I'm going to back off and rethink my blog post. I can't get my head around what I need to do to properly prepare the data in order to run it through these algorithms. For the short term I think I'll concentrate on learning the basics of the language instead of immediately trying to apply it.

Thanks again for the assist. Sorry I'm letting you down.


Just food for thought.

Why not start with a data-adaptive approach where you start with the data, with little consideration of the model at first, and then when find useful predictors, fit the model to the data you have? Assume you're boss hands you a data set and find out if you can use what you have first, then move onto fine-tuning that data second and modeling third.

I say this because it's what I would do in Python. I would start with the data first, then load it into Python second. This at a very high-level regardless of the algorithms, get's me to a solid foundation of generating the data and streamlining it into the application.

From there, I would explore the data in something like Python Notebook (R has similar) in real-time in my web browser and try to model the data from there. The end result will hopefully be a solid foundation to start processing the data and outputing results that can be usable to answer questions like, "How likely is a salesperson related to a region Mr. AngryDBA?"

I only say that because I think you can visualize the flow of what needs to happen. It's the algorithm that is throwing you off from moving forward. Screw the algorithm, just output some data, load it into R and start exploring what you have. I think once you start doing that, you will start to find ways to fit the model to the data and completing the objective you want to accomplish here.
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum








































































































































































SQLServerCentral


Search