Learning R: Easy Syntax, Horribly Difficult Concepts

  • Grant Fritchey

    SSC Guru

    Points: 395316

    I could use some help from anyone with some data science in their background.

    I'm trying to learn R. I have a simple idea to prove the usefulness of R to a DBA that I want to turn into a blog post. In 2014, the new cardinality estimator started using exponential backoff in the density for compound column keys. This is because in the most common cases, you're looking at data that has relationships, not data that is completely unique. So, for example, a region and a salesperson are likely to be related, not a unique relationship, in each sale.

    But, there are going to be exceptional data, so Microsoft has a traceflag you can use to force the optimizer to use the old method of density * density.

    However, wouldn't it be cool if there was a quick way to feed two columns into a query against an R function to see if the data has the types of relationships we're talking about between a region and a salesperson? Enter R

    I'm absolutely convinced this is an easy problem and I'm just missing the right formula. I've been looking at the Chi-Square Test of Independence and the Chi-Square Test of Homogeneity, but I don't think they're right. If I understand this (and I'm really stretching a bit), they're for two-way tables and the data I'm looking at is a one-way table.

    My question, which set of formulas should I be using? Could anyone give me a nudge in the right direction please?

    ----------------------------------------------------
    The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood...
    Theodore Roosevelt

    The Scary DBA
    Author of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd Edition
    Product Evangelist for Red Gate Software

  • xsevensinzx

    One Orange Chip

    Points: 25531

    Not to side track, but question:

    Why prove the usefulness of R to the DBA as opposed to a general purpose language like Python that can accomplish the same thing? I only ask because R is a bit more complex and has limited uses for the DBA versus Python than can actually be used by the DBA as a secondary scripting language both inside and outside of Windows that spans across platforms, API's and even database engines?

    I feel this awesome blog article you're writing would be targeted to someone like me who works as a data professional in the data science field, but most of us are using Python on the backend where all the data scientist are using R for the deep dives down the road. Both are awesome and useful, but until most of us start picking up 2016, we're still using Python across RDBMS, NoSQL, etc. 😎

  • Orlando Colamatteo

    SSC Guru

    Points: 182268

    Python is losing market share to R every day in data science circles and PowerShell is a more strategic language on Windows for handling those pesky sys admin tasks (I don't mind that PowerShell is not cross-platform because I spend zero cycles on *x platforms).

    It remains to be seen whether R integration inside SQL Server will be widely adopted. I am positioning SQL Server 2016 at my shop as a potential Sandbox environment where analysts can run their R-script libraries while remaining cozy inside a T-SQL-context.

    @Grant, thanks for compelling me to lookup the Chi-square Test for Independence. I for one am looking forward to said article. I wish I had something useful to add towards your original question (not there yet but am very curious about learning more of stats and quant analysis) and hope my post does not further sidetrack the intent of this thread.

    __________________________________________________________________________________________________
    There are no special teachers of virtue, because virtue is taught by the whole community. --Plato

  • xsevensinzx

    One Orange Chip

    Points: 25531

    Orlando Colamatteo (12/14/2015)


    Python is losing market share to R every day in data science circles and PowerShell is a more strategic language on Windows for handling those pesky sys admin tasks (I don't mind that PowerShell is not cross-platform because I spend zero cycles on *x platforms).

    Python is not meant to replace R. It's like trying to say NoSQL is trying to replace RDBMS. They both work together. That's why I say Python has a better use for the DBA because Python as a use is better on the back-end with RDBMS than R is, which is generally used on the front-end with the analyst.

  • Orlando Colamatteo

    SSC Guru

    Points: 182268

    In some cases both are required for complimentary functions. I am only commenting on where they overlap, that R has become the more popular choice.

    __________________________________________________________________________________________________
    There are no special teachers of virtue, because virtue is taught by the whole community. --Plato

  • Grant Fritchey

    SSC Guru

    Points: 395316

    xsevensinzx (12/14/2015)


    Not to side track, but question:

    Why prove the usefulness of R to the DBA as opposed to a general purpose language like Python that can accomplish the same thing? I only ask because R is a bit more complex and has limited uses for the DBA versus Python than can actually be used by the DBA as a secondary scripting language both inside and outside of Windows that spans across platforms, API's and even database engines?

    I feel this awesome blog article you're writing would be targeted to someone like me who works as a data professional in the data science field, but most of us are using Python on the backend where all the data scientist are using R for the deep dives down the road. Both are awesome and useful, but until most of us start picking up 2016, we're still using Python across RDBMS, NoSQL, etc. 😎

    Microsoft is investing pretty heavily in R. You can call R functions directly from SQL Server 2016. I'm just focusing where Microsoft does. It's not a knock against Python.

    ----------------------------------------------------
    The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood...
    Theodore Roosevelt

    The Scary DBA
    Author of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd Edition
    Product Evangelist for Red Gate Software

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply