Log in  ::  Register  ::  Not logged in

## Which Algorithm?

 Author Message Grant Fritchey SSC Guru Group: General Forum Members Points: 376374 Visits: 34643 I recognize that cross posting is not generally considered good form, but this forum didn't exist yesterday. I'm adding my question here.I could use some help from anyone with some data science in their background.I'm trying to learn R. I have a simple idea to prove the usefulness of R to a DBA that I want to turn into a blog post. In 2014, the new cardinality estimator started using exponential backoff in the density for compound column keys. This is because in the most common cases, you're looking at data that has relationships, not data that is completely unique. So, for example, a region and a salesperson are likely to be related, not a unique relationship, in each sale. But, there are going to be exceptional data, so Microsoft has a traceflag you can use to force the optimizer to use the old method of density * density.However, wouldn't it be cool if there was a quick way to feed two columns into a query against an R function to see if the data has the types of relationships we're talking about between a region and a salesperson? Enter RI'm absolutely convinced this is an easy problem and I'm just missing the right formula. I've been looking at the Chi-Square Test of Independence and the Chi-Square Test of Homogeneity, but I don't think they're right. If I understand this (and I'm really stretching a bit), they're for two-way tables and the data I'm looking at is a one-way table.My question, which set of formulas should I be using? Could anyone give me a nudge in the right direction please? ----------------------------------------------------The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore RooseveltThe Scary DBAAuthor of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd EditionProduct Evangelist for Red Gate Software BuckWoody Ten Centuries Group: General Forum Members Points: 1027 Visits: 76 Not sure I have enough info to answer this, but I have a cheat-sheet referenced on this page that might help, along with some ideas: https://buckwoody.wordpress.com/2015/12/02/knowing-which-statistical-formula-to-use/ Buck WoodyMCDBA, MCSE, Novell and Sun Certified mattmparker SSC-Enthusiastic Group: General Forum Members Points: 170 Visits: 10 So, for example, a region and a salesperson are likely to be related, not a unique relationship, in each sale. I think you're on the right track, and just getting stuck on the terminology. I think if any of these questions is the question you're asking, a chi-squared test of independence is appropriate: - Is there a relationship between region and salesperson? - Given the name of the salesperson, could I make a better-than-chance guess at the region in which a sale happened? (and vice versa) - Does the salesperson-region relationship in the data reflect the database's previous assumption (uniform distribution) or its new assumption (some kind of non-uniformity)?I'm way out of my depth on all of this, so if any of this doesn't make sense, don't hesitate to let me know. Grant Fritchey SSC Guru Group: General Forum Members Points: 376374 Visits: 34643 BuckWoody (12/16/2015)Not sure I have enough info to answer this, but I have a cheat-sheet referenced on this page that might help, along with some ideas: https://buckwoody.wordpress.com/2015/12/02/knowing-which-statistical-formula-to-use/Thanks Buck. That's already one of my go-to resources as I start figuring all this out. I appreciate what you're doing over there. ----------------------------------------------------The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore RooseveltThe Scary DBAAuthor of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd EditionProduct Evangelist for Red Gate Software mattmparker SSC-Enthusiastic Group: General Forum Members Points: 170 Visits: 10 After reading up on cardinality, another couple of questions occurred to me: - Does the density/distribution of salesperson-region relationships have high cardinality (salespeople strongly associated with one or two regions), or low (salespeople working across many regions)? - Does the density of those relationships better match the old optimizer assumption, or the new one (exponential)?In that case, an alternative might be to plot three densities: your empirical density, the previously-assumed density (uniform?), and the new density (exponential?). There are a few examples here, but tons more on the web: https://heuristically.wordpress.com/2012/06/13/comparing-continuous-distributions-with-r/After that, there are a whole slew of goodness-of-fit tests for checking whether your data fit a given distribution or not; off the top of my head, you could probably start fitdistr() from the MASS package.No clue how you'd go about generating all of those densities, though! Grant Fritchey SSC Guru Group: General Forum Members Points: 376374 Visits: 34643 mattmparker (12/16/2015)So, for example, a region and a salesperson are likely to be related, not a unique relationship, in each sale. I think you're on the right track, and just getting stuck on the terminology. I think if any of these questions is the question you're asking, a chi-squared test of independence is appropriate: - Is there a relationship between region and salesperson? - Given the name of the salesperson, could I make a better-than-chance guess at the region in which a sale happened? (and vice versa) - Does the salesperson-region relationship in the data reflect the database's previous assumption (uniform distribution) or its new assumption (some kind of non-uniformity)?I'm way out of my depth on all of this, so if any of this doesn't make sense, don't hesitate to let me know.Yeah, I'm looking to see if there is a relationship between region and salesperson, so it sounds like the chi-square test of independence is the way to go. I'm working on building a new data set for my tests. I'll post results and code back here when I get them in hand. ----------------------------------------------------The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore RooseveltThe Scary DBAAuthor of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd EditionProduct Evangelist for Red Gate Software Grant Fritchey SSC Guru Group: General Forum Members Points: 376374 Visits: 34643 mattmparker (12/16/2015)After reading up on cardinality, another couple of questions occurred to me: - Does the density/distribution of salesperson-region relationships have high cardinality (salespeople strongly associated with one or two regions), or low (salespeople working across many regions)? - Does the density of those relationships better match the old optimizer assumption, or the new one (exponential)?In that case, an alternative might be to plot three densities: your empirical density, the previously-assumed density (uniform?), and the new density (exponential?). There are a few examples here, but tons more on the web: https://heuristically.wordpress.com/2012/06/13/comparing-continuous-distributions-with-r/After that, there are a whole slew of goodness-of-fit tests for checking whether your data fit a given distribution or not; off the top of my head, you could probably start fitdistr() from the MASS package.No clue how you'd go about generating all of those densities, though!Excellent. Thanks. I'll explore this as well. Sorry the question mixed technologies so much. I'll see what I can do with that and report back. ----------------------------------------------------The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore RooseveltThe Scary DBAAuthor of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd EditionProduct Evangelist for Red Gate Software Grant Fritchey SSC Guru Group: General Forum Members Points: 376374 Visits: 34643 Thanks for all the help guys. I'm thinking that I'm going to back off and rethink my blog post. I can't get my head around what I need to do to properly prepare the data in order to run it through these algorithms. For the short term I think I'll concentrate on learning the basics of the language instead of immediately trying to apply it.Thanks again for the assist. Sorry I'm letting you down. ----------------------------------------------------The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore RooseveltThe Scary DBAAuthor of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd EditionProduct Evangelist for Red Gate Software BuckWoody Ten Centuries Group: General Forum Members Points: 1027 Visits: 76 Not a problem at all. In fact, maybe I'll blog it - but not in the way you think. Do keep asking questions - it helps us in our approach on educating people as much as it helps you get the answer you need. We need to learn where people are having issues so that we teach that part first. Buck WoodyMCDBA, MCSE, Novell and Sun Certified xsevensinzx SSC-Insane Group: General Forum Members Points: 23796 Visits: 6176 Grant Fritchey (12/17/2015)Thanks for all the help guys. I'm thinking that I'm going to back off and rethink my blog post. I can't get my head around what I need to do to properly prepare the data in order to run it through these algorithms. For the short term I think I'll concentrate on learning the basics of the language instead of immediately trying to apply it.Thanks again for the assist. Sorry I'm letting you down.Just food for thought.Why not start with a data-adaptive approach where you start with the data, with little consideration of the model at first, and then when find useful predictors, fit the model to the data you have? Assume you're boss hands you a data set and find out if you can use what you have first, then move onto fine-tuning that data second and modeling third.I say this because it's what I would do in Python. I would start with the data first, then load it into Python second. This at a very high-level regardless of the algorithms, get's me to a solid foundation of generating the data and streamlining it into the application. From there, I would explore the data in something like Python Notebook (R has similar) in real-time in my web browser and try to model the data from there. The end result will hopefully be a solid foundation to start processing the data and outputing results that can be usable to answer questions like, "How likely is a salesperson related to a region Mr. AngryDBA?"I only say that because I think you can visualize the flow of what needs to happen. It's the algorithm that is throwing you off from moving forward. Screw the algorithm, just output some data, load it into R and start exploring what you have. I think once you start doing that, you will start to find ways to fit the model to the data and completing the objective you want to accomplish here.

## Permissions

 You can't post new topics. You can't post topic replies. You can't post new polls. You can't post replies to polls. You can't edit your own topics. You can't delete your own topics. You can't edit other topics. You can't delete other topics. You can't edit your own posts. You can't edit other posts. You can't delete your own posts. You can't delete other posts. You can't post events. You can't edit your own events. You can't edit other events. You can't delete your own events. You can't delete other events. You can't send private messages. You can't send emails. You can read topics. You can't vote in polls. You can't upload attachments. You can download attachments. You can't post HTML code. You can't edit HTML code. You can't post IFCode. You can't post JavaScript. You can post emoticons. You can't post or upload images.

##### Select a forum
 SQL Server 2017      SQL Server 2017 - Administration      SQL Server 2017 - Development SQL Server 2016      SQL Server 2016 - Administration      SQL Server 2016 - Development and T-SQL SQL Server 2014      Administration - SQL Server 2014      Development - SQL Server 2014 SQL Server 2012      SQL 2012 - General      SQL Server 2012 - T-SQL SQL Server vNext      SQL Server 15 - Administration      SQL Server 15 - Development SQL Server 2008      SQL Server 2008 - General      T-SQL (SS2K8)      June 2007 CTP      Working with Oracle      July CTP      SQL Server Newbies      Security (SS2K8)      SQL Server 2008 High Availability      SQL Server 2008 Administration      Data Corruption (SS2K8 / SS2K8 R2)      SQL Server 2008 Performance Tuning Cloud Computing      SQL Azure - Development      SQL Azure - Administration      Amazon AWS and other cloud vendors      General Cloud Computing Questions      CosmosDB      Azure Data Lake      Azure Machine Learning      Azure Data Factory Reporting Services      Reporting Services      Reporting Services 2005 Administration      Reporting Services 2005 Development      Reporting Services 2008/R2 Administration      Reporting Services 2008 Development      SSRS 2012      SSRS 2014      SSRS 2016 Programming      Connecting      General      SMO/RMO/DMO      XML      Service Broker      Powershell      Testing      TFS/Data Dude/DBPro      SSDT      Continuous Integration, Deployment, and Delivery      R Services and R Language Data Warehousing      Integration Services      Strategies and Ideas      Analysis Services      Data Transformation Services (DTS)      Performance Point      Data Mining      PowerPivot      R language      Machine Learning Database Design      Disaster Recovery      Design Ideas and Questions      Relational Theory      Hardware      Virtualization      Security and Auditing SQLServerCentral.com      Anything that is NOT about SQL!      Contests!      Editorials      SQLServerCentral.com Announcements      SQLServerCentral.com Website Issues      Suggestions      Tag Issues with Content      Podcast Feedback      SQLServerCentral.com Test Forum      Articles Requested SQL Server 2005      Administering      Backups      Business Intelligence      CLR Integration and Programming.      Data Corruption      Development      Working with Oracle      SQL Server 2005 Compact Edition      SQL Server 2005 General Discussion      SQL Server 2005 Security      SQL Server 2005 Strategies      SS2K5 Replication      SQL Server Express      SQL Server 2005 Performance Tuning      SQL Server 2005 Integration Services      T-SQL (SS2K5)      SQL Server Newbies SQL Server 7,2000      Administration      Backups      Data Corruption      General      Globalization      In The Enterprise      Working with Oracle      Security      Strategies      SQL Server Newbies      Service Packs      SQL Server CE      Performance Tuning      Replication      Sarbanes-Oxley      T-SQL      SQL Server Agent SQL Server and other platforms      MySQL      Oracle      PostgreSQL      DB2      SQL Server and Sharepoint Older Versions of SQL (v6.5, v6.0, v4.2)      Older Versions of SQL (v6.5, v6.0, v4.2) Career      Certification      Employers and Employees      Events      Job Postings      Resumes and Job Hunters      Presentations and Speaking      Retired Members Testing Center      SQL Server Security Skills      Question of the Day (QOD) Microsoft Access      Microsoft Access Products and Books      Third Party Products         CA         SQLCentric         Extreme Technologies.         Innovartis         Embarcadero         SQL Sentry         Sonasoft         Golden Gate Software         Lumigent         Red Gate Software         Quest Software         ApexSQL         Idera      Discussions about Books         Discuss Programming Books          Discuss XML Books          Discuss T-SQL Books          Discuss Data Warehousing Books          Discuss DTS Books          Discuss SQL Server 7.0 Books         Discuss SQL Server 2000 Books Notification Services      Administration Article Discussions Future Versions      SQL 12

## Search

 Copyright © 2002-2019 Redgate. All Rights Reserved. Privacy Policy. Terms of Use. Report Abuse.