Have you ever wished you could run your data science experiments right where your data lives, without the hassle of moving huge files around? That’s exactly what Microsoft set out to solve with SQL Server Machine Learning Services (MLS). Starting with SQL Server 2016 and really hitting its stride in 2017, MLS lets you run Python and R code directly inside your database. No more exporting terabytes of data just to build a model, now, the analytics come to your data.
Why Bring Machine Learning Into SQL Server?
Let's be honest: moving data is cumbersome. As companies gather more data, transferring it between systems for analysis becomes costly and slow. MLS changes this by enabling data analysis where it already exists. This approach enhances speed, security, and convenience for data engineers and data scientists alike. Moreover, by centralizing everything, teams can collaborate more efficiently. There's no need to exchange files or worry about version discrepancies.
Here's the magic: SQL Server has a service called Launchpad. When you want to run a Python or R script, you use a special stored procedure (sp_execute_external_script). SQL Server handles creating a secure environment, transferring your data, executing your code, and returning the results, all without leaving the database.
And don't worry about missing your favorite libraries. MLS includes popular packages like pandas, scikit-learn, and forecast, and you can add more if needed.
Let’s Walk Through an Example
Imagine you’re working with the AdventureWorks sample database, and you want to predict the final bill (TotalDue) for each order based on its subtotal, tax, and freight. Here’s how you might tackle that, step by step.
Step 1: Flip the Switch
First things first, you need to tell SQL Server it’s okay to run external scripts. It’s a quick setup:
1 EXEC sp_configure 'external scripts enabled', 1; 2 RECONFIGURE; 3-- You might need to restart SQL Server after this
Step 2: Build and Run Your Model
Now for the fun part. You’ll write a Python script that grabs the first 1,000 orders, cleans up any missing data, splits it into training and test sets, and fits a simple linear regression model. Here’s what that looks like:
1 DECLARE @py NVARCHAR(MAX) = N' 2 import pandas as pd 3 from sklearn.linear_model import LinearRegression 4 from sklearn.model_selection import train_test_split 5 from sklearn import metrics 6 7# Clean up the data 8 df = InputDataSet.dropna() 9 10# Pick out the features and what we want to predict 11 X = df[["SubTotal","TaxAmt","Freight"]] 12 y = df["TotalDue"] 13 14# Split into training and test sets 15 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 16 17# Train the model 18 model = LinearRegression().fit(X_train, y_train) 19 20# Make predictions and see how we did 21 y_pred = model.predict(X_test) 22 r2 = metrics.r2_score(y_test, y_pred) 23 rmse = metrics.mean_squared_error(y_test, y_pred, squared=False) 24 25# Package up the results 26 out = X_test.copy() 27 out["Actual"] = y_test 28 out["Predicted"] = y_pred 29 out["R2"] = r2 30 out["RMSE"] = rmse 31 OutputDataSet = out 32'; 33 34 EXEC sp_execute_external_script 35 @language = N'Python', 36 @script = @py, 37 @input_data_1 = N' 38 SELECT SubTotal, TaxAmt, Freight, TotalDue 39 FROM Sales.SalesOrderHeader 40 WHERE SalesOrderID <= 1000';
Let’s break it down. First, SQL Server hands off the data to Python as a DataFrame. We clean out any incomplete rows, because missing values can throw off our model. Then, we pick which columns to use as inputs (SubTotal, TaxAmt, Freight) and which one we want to predict (TotalDue).
Splitting the data into training and test sets helps us see how well our model might perform on new, unseen data. After training, we check two key metrics: R² (which tells us how much of the variation in TotalDue our model explains) and RMSE (which gives us the average prediction error in dollars).
If you see an R² of 0.87, that means your model explains 87% of the variation in TotalDue. An RMSE of $12.34? On average, your predictions are within $12.34 of the actual value. Not bad!
Taking It Further: From Experiment to Production
Once you’re happy with your model, you can wrap it up in a stored procedure. That way, you (or your colleagues) can reuse it anytime. Want to retrain the model every night? Schedule it with SQL Agent. Need to store the trained model for later? Save it in a table. You can even set up real-time scoring for things like fraud detection or product recommendations, all without moving your data out of SQL Server.
Tips for Success
- Use SQL Server’s Resource Governor to keep your analytics from hogging all the resources.
- Keep your scripts in version control, just like any other code.
- Monitor performance and errors with Extended Events or SQL Agent alerts.
- Rest easy knowing your data never leaves the secure database environment.
Wrapping Up
SQL Server Machine Learning Services brings the power of Python and R right to your data. It’s fast, secure, and makes life easier for everyone involved in analytics. Whether you’re just exploring or building production pipelines, MLS gives you the tools to do it all, without ever leaving the comfort of your database.