The first part of the series reviewed the steps to install and configure Machine Learning Services. In this section we will review the components installed and what features it provides and provide a detailed examination of the SQL call to execute R and Python code.
SQL Server and Machine Learning Architecture
SQL Server 2017 was rewritten so that it now uses a SQL Server Platform Abstraction Layer [SQLPAL]. This was created to make it possible to use the same code for SQL Server to run the Widows or Linux operating system. All of the resources needed for SQL Server execution are contained within this layer. R and Python exist outside of SQLPAL. This means that SQL Server sends the process outside of the SQLPAL where it processes, and then communicates the results back to SQL Server. If you have allocated memory to SQL Server, this memory will not be used to for R or Python code as it operates outside of the abstraction layer.
Running R or Python, from within SQL Server, uses the SQL Server Launchpad service. This must be running on your host as your code uses this service communicate from SQLPAL to the Machine Learning Server components. In Level 1 of this Stairway Series, there is a picture of the 20 local users which are installed when Machine Learning Services are loaded. These users are needed to call the launchpad service. By default, 20 users are created, which means that you can have 20 simultaneous threads that are available to initiate a call to the R or Python executable via dedicated pipeline to call the SQL Server Launchpad service. The calls usually take fractions of a second to run, so often times the same user is employed multiple times in a second, generally MSSQLSERVER01. This is why the Launchpad service must be running for the Machine Learning Server to execute.
Based upon the language specified the SQL Server Launchpad service will call either Rlauncher.dll or PythonLauncher.dll using one of the 20 different user accounts created for Machine Learning Server. These dlls then call the open source language specified. Since they are using one of the 20 user ids created by machine learning service to execute the R or Python code, no user permissions are inherited. For this reason, no methods within the R or Python code should be written to access data as you will have to build the security within the code you write to access any other data source. All of the data needed to run your R or Python code should be passed to it. Once the R or Python code has completed, the BxlServer.exe component will fire a sqlsatellite component to communicate the results back to SQL Server.
Component parts to using sp_execute_external_script
Running the Machine Learning Server code within SQL Server requires using a call to the external services using sp_execute_external_script to execute R or Python in SQL Server. The code needs to contain the components listed in this sample, including the following parameters: Language, Script, input dataset, output dataset, and the result set. I have included a sample in Python and R which takes a value from a sql table then multiplies it by 20 and returns a result of two columns, CRSDepTime and Multiplier which is the value in the column multiplied by 20.
EXEC sp_execute_external_script @language = N'Python', @script = N'OutputDataSet = InputDataSet; OutputDataSet["Col2"] = OutputDataSet["Col1"] * Multiplier', @input_data_1 = N'SELECT TOP 100 CRSDeptime AS Col1 FROM AirlineDemoSmall WHERE CRSDeptime > 1', @params = N'@Multiplier INT', @Multiplier = 20 WITH RESULT SETS( ([CRSDepTime] float not null, [Multiplier] float not null) );
EXEC sp_execute_external_script @language = N'R' , @script = N' x <- as.matrix(AirportDelayInput); y <- array(Multiplier); z <- as.data.frame(x %*% y); OutputDataSet <- data.frame(x, z);' , @input_data_1 = N'SELECT top 100 CRSDeptime AS Col1 from AirlineDemoSmall where CRSDeptime > 1;' , @input_data_1_name = N'AirportDelayInput' , @params = N'@Multiplier int' , @Multiplier = 20 WITH RESULT SETS( ([CRSDepTime] float not null, [Multiplier] float not null));
Here is what the results look like for both sets of code
For both sets of code, sp_execute_external_script is used with these parameters:
@language – Specify the language to use
@script – This contains the R or Python code
@input_data_1 – specifies a dataset to be passed into the stored procedure which is referenced in the code as the variable InputDataSet statement containing the input data. You can only pass one dataset.
@input_data_1_name – optional parameter which allows you to name the input variable to something other than InputDataSet. If you are passing in more than one dataset, you will need set this parameter.
@params – optional command separated list of variables which are passed into the stored procedure and can be referenced by the code.
@origin – the value for each of the parameters defined in the @params step
@output_data_1_name = Optional name for the output. The code can return only one output, and the output must be a dataframe for R and a Panda dataframe in Python.
WITH RESULT SETS – this optionalclause is used to define column names of the code. The column names used in the code are never preserved with the data is returned in R or Python. If this command is not included, the dataset returned will have the column names Unknown1, Unknown2 etc.
In the code listed above, the Python code and the R code do the same thing. The big difference is the R code contains the optional parameter for the input name to rename the input to something other than InputDataSet
Machine Learning Server also includes the ability to run code on the server when connected to it on the client when running R code. This feature is called changing compute context. This means that if your desktop computer only has 8 GB of RAM and you want to analyze 9 GB of data, then you would need to process the data somewhere else with R as you will not have enough memory to process the data in R in memory. If the server as has 128 GB of RAM, it might make sense to analyze the data there. Changing the compute context will allow you to write the code on your PC, but run it on the server which has more memory. This means that when running R or Python code, the resources used may not be the resources on the local machine running the code, but the resources of the server, as within code you can change the code to physically run on the Machine Learning Services server from a laptop connected to it using either the RevoScaleR for R or the revoscalepy package for Python. We will show the code needed to do this in a later level in this stairway.
The version of R that Microsoft installs is a little different than the open source version of R because it contains additional functionality not present in standard R while still supporting all of the standard R functionality. The version of R contains a number of different libraries which are designed to support larger datasets byt proving the capability to chunk the processing of code in R to run data in and out of memory which is called chunking. In the open source version of R, if the data and code consumes more than the available memory, the process will quit. Microsoft has included a number of different libraries, which all start with the letters rx which are specifically designed to improve the performance of R and one of those capabilities is the ability to cache items in and out of memory. If you chose not to use any of the rx functions provided, R will use the same features on SQL Server as it would be running a standalone version of R.
The python implementation within SQL Server includes a license of the very popular data science Anaconda libraries from Continuum Analytics, which includes sci-kit learn and Pandas, providing a rich set of libraries without needing to install any additional libraries. The package that Microsoft wrote for Machine Learning Services is called, microsoftml, and this package contains a collection of Python functions used in machine learning solution. These functions designed to contain libraries needed for training, algorithm selection and scoring. While it is certainly possible to perform this kind of analysis with other packages, these packages can supplement others found included in the SQL Server Machine Learning Server Anaconda distribution, primarily those in sci-kit learn which include things like linear regression and decision trees.
Machine Learning Server, which is installed with part of SQL Server 2017, provides the capability for SQL Server to execute R or Python from within SQL Server to run code on the server using the compute context capabilities and provides specialized libraries created by Microsoft for R and Python to support writing Machine Learning code in both languages. In the next part of the series we will take an in-depth look at R and the extra capabilities provided with Machine Learning Services.