Part 1: Predictive Modeling using R and SQL Server Machine Learning Services


To R or not to R?

A few months ago, I asked myself an important question – which language to learn first – R or Python? From my research, I found that R is regarded as more old-school and difficult to learn, but Python is more popular. It made perfect sense to learn R first ??

For the uninitiated, R is a programming language that makes statistical and mathematical computation easy, and is useful for machine learning/predictive analytics/statistics work.

Along the way, I found the following courses useful.

The goal has always been to explore Machine Learning Services in SQL Server, and dive deeper thereafter. This 3-part series is a walk through/review of Microsoft’s tutorial on Predictive Modeling using R and SQL Server. The first part deals with preparing data, training a model and using it for prediction.

What is Predictive Modeling?

Predictive Modeling uses statistics to predict outcomes.

In our example, we will use Machine Learning Services for SQL Server 2017 to predict number of rentals for a future date in a ski rental business. This brings up an important question – why do we want to predict? In this case, prediction will help the business be prepared from a stock, staff and facilities perspective. Prediction has numerous, life-changing applications. Read about application of AI in predicting cardiovascular disease by Microsoft for Apollo Hospitals, India.

For the ski rental prediction, we will use test data provided by MS, SQL Server 2017 with Machine Learning Services, and R Studio IDE. Please check the following URL and follow the simple instructions to set up your environment. If you have any question, please use Comments section to drop me a note.


Steps involved in building the predictive model in SQL Server

  1. Getting data
  2. Preparing data
  3. Training models
  4. Comparing results and choosing a model
  5. Deploying the Machine Learning script to SQL Server

1. Getting Data: Okay, let’s get started. In the first step, we will restore sample database – TutorialDB – using the database backup file provided by MS. After restoring the database using SSMS, we will take a look at rental data we will use for training the model.

All scripts used will be provided in the Resources section at bottom of the page.



You will see that we have 453 rows of rental stats. Data is in place, so we are good to move on to Step 2.

2. Preparing Data: Now that database is restored and data available in SQL Server, we will load data to R and transform it. Open R Studio and execute the rental data load script. After loading data, we will examine a few rows/observations and inspect data types. We will then proceed to change types of a few columns to factor.



3. Training Models: Now that data is prepared, we will chose a model that best describes dependency between variables in our dataset. During training, we provide the variables along with the outcome so that our model can train to predict the outcome. Here, we will compare predictions by two different models and choose the more accurate one as our predictive model.

For clarity, let me state that we are trying to predict RentalCount for the year 2015, given the variables – Month, Day, Weekday, Holiday and Snow.

The challenge in Machine Learning is in knowing what various models mean, and when a particular model might be more suitable. MS recommends this cheat sheet as a guide.




Comparing results: Here, we compare results to figure out which model predicted more accurately. Decision Tree performed better in this case and we will use the model to deploy our Machine Learning Solution to SQL Server in Part 2



Scripts for Part 1 (zip file):
SQL Server R tutorials:
Gitghub repo for rental prediction:
Machine Learning cheat sheet, use with caution ??