Blog Post

A Rickety Stairway to SQL Server Data Mining, Part 14.6: Custom Mining Functions

,

by Steve Bolton

…………In the last installment of this amateur series of mistutorials on SQL Server Data Mining (SSDM), I explained how to implement the Predict method, which controls how the various out-of-the-box prediction functions included with SSDM are populated in custom algorithms. This limits users to returning the same standard columns returned by calls to prediction functions against the nine algorithms that ship with SSDM. The PredictHistogram function, for example, will return a single prediction column, plus columns labeled $Support, $Probability, $AdjustedProbability, $Variance and $StDev, just as it does with the out-of-the-box algorithms. Although we can control the output of the standard prediction functions to return any values we want, it may be more suitable in some cases to specify our own output format with more flexibility. This is where the ability to define your own custom mining functions comes in handy. Another important use case is when you need to perform custom calculations that wouldn’t normally be implemented in standard prediction functions like PredictStDev or PredictSupport. Plug-in programmers could conceivably perform any complicated calculation they like in the Predict method and return it in the results for PredictVariance, but this would be awkward and illogical if the result is not a measure of variance. Thankfully, the SSDM plug-in architecture we’ve been discussing in the past articles includes a mechanism for defining your own custom mining functions, which addresses all of these needs.
…………Aside from the calculation of cumulative statistics, developing a custom mining function is actually quite simple. The development process begins by declaring a public function in your AlgorithmBase class ordained with a MiningAlgorithmClassAttribute, which looks like the attribute declaration in my sample code: <MiningFunction(“My Custom Mining Function Description”)>. The text within the quotation marks will appear in the DESCRIPTION column of the DMSCHEMA_MINING_FUNCTIONS schema rowset, for the row corresponding to the new function. The parameter list can contain variables of your choosing, like the RecordCount integer I use in my sample code for limiting the rows returned by MyCustomMiningFunction, but special rules apply in a few cases. Parameters decorated with the MiningTableColumnRefAttribute or MiningScalarColumnRefAttribute tags must be AttributeGroup objects, such as the <MiningScalarColumnRefAttribute()> TempAttributeGroup AsAttributeGroup declaration in my sample code. I believe MiningTableColumnRefAttribute tags are used for feeding nested tables to custom mining functions, but the PluginModel mining model we’ve been working with in this segment on custom mining algorithms doesn’t have one, nor do I see any benefit in adding one because of the risk of cluttering the tutorial needlessly. MiningScalarColumnRefAttribute, however, is a more common tag that allows you to pass a reference to a specific model column; in the prediction I’ll execute to test the function, I’ll pass a value of  “InputStat” to identify the first column in the PluginModel. The Flag attribute can also be used to identify optional flags passed as parameters to the function, as I do in the parameter declaration <Flag()> INCLUDE_COUNT AsBoolean. Apparently, these flags must be Booleans, because I received the following warning in the FUNCTION_SIGNATURE column of DMSCHEMA_MINING_FUNCTIONS whenever I use other data types with a flag I called ATTRIBUTE_NUMBER that I later removed: “Invalid Flag type (must be Boolean) in method MyCustomMiningFunction:ATTRIBUTE_NUMBER.” The algorithm will still deploy correctly if these parameters are set wrong but will return error messages when prediction queries are executed against the custom functions. The parameter list may also include a MiningCase object, which is not exposed to end users and causes SSDM to automatically iterate over the case set supplied in a prediction join, OpenQuery, or other such statement. As we shall see, one of the primary challenges with custom mining functions is to calculate cumulative statistics by iterating over this MiningCase object.
…………You can basically return values you want from your function, but if they’re scalar, they must be declared as System.Object. To return more than one value your only choice is to declare the return type as System.Data.DataTable. If your revised AlgorithmBase code compiles and deploys correctly, then the next time the service starts you should see your choice reflected in the RETURNS_TABLE column of DMSCHEMA_MINING_FUNCTIONS. Note how the query in Figure 1 reflects the name of the method we added to AlgorithmBase in the FUNCTION_Name column, the string from the MiningAlgorithmClassAttribute in the DESCRIPTION and the lists of parameters in FUNCTION_SIGNATURE. Some other standard functions enabled for our sample plug-in are also listed in the results (minus the prediction functions I enabled for the last tutorial, which were removed to shorten the list).

Figure 1: The Custom Mining Function Depicted in DMSCHEMA_MINING_FUNCTIONS
DMSCHEMA_MINING_FUNCTIONSResult

…………Note that the MyCustomMiningFunction is listed twice; this is because it is declared in AlgorithmBase and in SupportedFunction.CustomFunction Base, as part of the array returned by AlgorithmMetadataBase. Removing the function from AlgorithmBase without removing it from SupportedFunction.CustomFunction Base from the array returned by AlgorithmMetadataBase.GetSupportedStandardFunctions led to this error on msmdsrv, when the algorithm failed to initialize: “The data mining algorithm provider (ProgID: 06da68d6-4ef0-4cea-b4dd-1a7c62801ed2) for the Internal_Name_For_My_Algorithm algorithm cannot be loaded. COM error: COM error: DMPluginWrapper; Parameter name: in_FunctionOrdinal Actual value was 10000.” Another “gotcha” you may encounter is the warning that “PREDICTION JOIN is not supported” for a particular mining model when you try to use the function in a query. After many comparisons to the original sample code provided in the plug-in software development kit (SDK) by Bogdan Crivat, one of the original developers of SSDM, I was able to narrow this down to the cases in which the GetCaseIdModeled method of AlgorithmMetadataBase returns True. Simply changing the return value for that method ought to fix the problem. Another common mistake you may encounter while testing custom mining functions is leaving off the ON clause in the PREDICTION JOIN, which may result in this warnng: “Parser: The end of the input was reached.” As renowned SQL Server Analysis Services (SSAS) guru Chris Webb says in his helpful post Error messages in MDX SELECT statements and what they mean, this can also be due to missing parentheses, brackets and semi-colon marks. All of these myriad issues were fairly inconsequential in comparison to those that arose when I debugged the AlgorithmNavigationBase class in A Rickety Stairway to SQL Server Data Mining, Part 14.4: Node Navigation, which is far easier than troubleshooting AlgorithmBase and less difficult by several orders of magnitude than dealing with exceptions in AlgorithmMetadataBase. The biggest challenge I encountered while writing the sample code depicted in Figure 2 was calculating cumulative values across iterations over the MiningCase object, which were problematic to persist.

Figure 2: Sample Code for MyCustomMiningFunction
<MiningFunction(“My Custom Mining Function Description”)>
Public Function MyCustomMiningFunction(InputCase As MiningCase, RecordCount As Integer, <MiningScalarColumnRefAttribute()> TempAttributeGroup As AttributeGroup, <Flag()> INCLUDE_COUNT As Boolean) As System.Data.DataTable

            Static CaseCount As UInteger = 0 ‘the 0th case will be the Prepare statement
            for whatever reason, the first InputCase is always Nothing
            we can use that opportunity to set up the data table one time only
            If IsNothing(InputCase) = True Then
                MyCustomMiningFunctionDataTable = New System.Data.DataTable ‘on the first iteration, reset the datatable
                MyCustomMiningFunctionDataTable.Columns.Add(“Skewness”, GetType(Double))
                MyCustomMiningFunctionDataTable.Columns.Add(“Kurtosis”, GetType(Double))
                MyCustomMiningFunctionDataTable.Columns.Add(ExcessKurtosis, GetType(Double))
                MyCustomMiningFunctionDataTable.Columns.Add(JBStatistic, GetType(Double))

                If INCLUDE_COUNT = True Then
                    MyCustomMiningFunctionDataTable.Columns.Add(TotalRows, GetType(UInteger))
                End If

                once the data type is set, we can return it one time, in the event of a Prepare statement
                ‘I will have to double-check later and make sure this statement doesn’t need to execute for each row during Prepared statements
                ‘Bogdan: “•If the execution is intended for a Prepare statement, it returns an empty string. A Prepare execution must return data that has the same type and schema (in case of tables) with the real result.”
                so if it is only for a prepare statement, we’d send it back the empty data table?
                chm: on Context.ExecuteForPrepare: “Determines whether the current query is preparing the execution or it is fully executing. Should only be used in the implementation of custom functions”
                If Context.ExecuteForPrepare = True Then
                    Return MyCustomMiningFunctionDataTable
                End If
            Else
                dealing with actual cases; add them to the cumulatives so we

can perform calculations at the end
                for now we will deal with a single attribute for the sake of simplicity; it would be fairly easy though to extend this to deal with more than one attribute at a time

                Try
                    If CaseCount < RecordCount - 1 Then RecordCount is 1-based, CaseCount is not

                        both variables are 0-based
                        If CaseCount = 0 Then
                            ReDim TempSkewnessKurtosisClassArray(0) ‘initialize the arrays one time; this won’t work during the Preparation phase when InputCase = 0, because other variables will be set to 1
                            ReDim TempDataPointArray(0, RecordCount – 1)
                            TempSkewnessKurtosisClassArray(0) = New SkewnessKurtosisClass
                        End If

                        TempDataPointArray(0, CaseCount) = New DataPointClass
                        TempDataPointArray(0, CaseCount).DataPoint = InputCase.DoubleValue
                        TempSkewnessKurtosisClassArray(0).TotalValue = TempSkewnessKurtosisClassArray(0).TotalValue + InputCase.DoubleValue
                        CaseCount = CaseCount + 1

                    Else

                        add the last set
                        If IsNothing(TempSkewnessKurtosisClassArray(0)) Then TempSkewnessKurtosisClassArray(0) = New SkewnessKurtosisClass
                        TempDataPointArray(0, CaseCount) = New DataPointClass
                        TempDataPointArray(0, CaseCount).DataPoint = InputCase.DoubleValue
                        TempSkewnessKurtosisClassArray(0).TotalValue = TempSkewnessKurtosisClassArray(0).TotalValue + InputCase.DoubleValue
                        CaseCount = CaseCount + 1

                        if we’re on the final case, calculate all of the stats
                        Dim TempRow As System.Data.DataRow
                        TempRow = MyCustomMiningFunctionDataTable.NewRow()
                        TempSkewnessKurtosisClassArray(0).Mean

= SkewnessKurtosisClassArray(0).CalculateAverage(RecordCount, TempSkewnessKurtosisClassArray(0).TotalValue)
                        TempSkewnessKurtosisClassArray(0).StandardDeviation = SkewnessKurtosisClassArray(0).CalculateAverage(RecordCount, TempSkewnessKurtosisClassArray(0).Mean)
                        TempRow.Item(“Skewness”) = TempSkewnessKurtosisClassArray(0).CalculateSkewness(TempDataPointArray, 0, TempSkewnessKurtosisClassArray(0).Mean, TempSkewnessKurtosisClassArray(0).StandardDeviation)
                        TempRow.Item(“Kurtosis”) = TempSkewnessKurtosisClassArray(0).CalculateKurtosis(TempDataPointArray, 0, TempSkewnessKurtosisClassArray(0).Mean, TempSkewnessKurtosisClassArray(0).StandardDeviation)
                        TempRow.Item(ExcessKurtosis) = TempSkewnessKurtosisClassArray(0).CalculateExcessKurtosis(TempSkewnessKurtosisClassArray(0).Kurtosis)
                       TempRow.Item(JBStatistic) = TempSkewnessKurtosisClassArray(0).CalculateJarqueBeraTest(RecordCount, TempSkewnessKurtosisClassArray(0).Skewness, TempSkewnessKurtosisClassArray(0).ExcessKurtosis) 

                        If INCLUDE_COUNT = True Then
                            TempRow.Item(TotalRows) = RecordCount
                        End If

                        MyCustomMiningFunctionDataTable.Rows.Add(TempRow) ‘add the row
                        CaseCount = 0 ‘reset the counter
                        Array.Clear(TempSkewnessKurtosisClassArray, 0, TempSkewnessKurtosisClassArray.Length) ‘clear the arrays so that there is no overhead for using a class-scoped variable

                        Array.Clear(TempDataPointArray, 0, TempDataPointArray.Length)
                        Return MyCustomMiningFunctionDataTable

                        then reset the counters so that the function starts off fresh the next time it is called
                        ‘any code I set after this to reset the DataTable won’t be hit – so hopefully SSDM does garbage collection on its own; the best I can do is reinitialize the values in the event of an error
                        Exit Function ‘so we don’t Return it again
                    End If

                Catch ex As Exception

                    if there’s an exception the static and class-scoped variables for this function must be reset, otherwise they will interfere with the next execution
                    CaseCount = 0 ‘reset the counter
                    Array.Clear(TempSkewnessKurtosisClassArray,0, TempSkewnessKurtosisClassArray.Length) ‘clear the arrays so that there is no overhead for using a class-scoped variable
                    Array.Clear(TempDataPointArray, 0, TempDataPointArray.Length)
                    MyCustomMiningFunctionDataTable.Rows.Clear() ‘clear all the rows
                    MyCustomMiningFunctionDataTable = Nothing

                End Try
            End If

            Return MyCustomMiningFunctionDataTable

        End Function

…………Since there are already a handful of better tutorials out there that demonstrate how to write a custom mining function (most notably Crivat’s sample code in the SDK) I wanted to make a contribution of sorts by pushing the boundaries of the architecture. Since the very basic mining method I developed in A Rickety Stairway to SQL Server Data Mining, Part 14.2: Writing a Bare Bones Plugin Algorithm merely returned some basic global stats for each column that I wanted to learn more about, like skewness, kurtosis and excess kurtosis, I figured it would be interesting to calculate the same figures for the values supplied in a prediction join. For good measure, I added a method to the SkewnessKurtosisClass I used in Post 14.2 to perform Jarque-Bera tests, which use measures of lopsidedness like skewness and kurtosis to determine how well datasets fit a normal distribution. I used the formula available at the Wikipedia page Jarque-Bera Test, which is fairly easy to follow in comparison to more complicated goodness-of-fit tests. I soon discovered, however, that keeping track of the cumulative stats required to calculate these measures is problematic when using custom mining functions. SQL Server will make numerous calls to your function, depending on how many input cases there are, but there isn’t a straightforward solution to keeping track of values across these calls. As a result, I had to create two class-scoped variable arrays that mirrored the AlgorithmBase’s SkewnessKurtosisClassArray and DataPointArray variables, which I explained how to use in Post 14.2. The static CaseCount variable I declared at the beginning of the function is used in conditional logic throughout the outer Else…End If statement to control the population of the TempDataPointArray. The DataPointClass objects that accumulate in this array with each iteration of the function are fed en masse to the routines that calculate skewness, kurtosis, excess kurtosis and the Jarque-Bera test when the CaseCount reaches the final input case. On this final pass, those four stats are also added to the single row that comprises the data table the function returns. This MyCustomMiningFunctionDataTable variable is defined as an instance of System.Data.DataTable in the outer If…Else block, which will only be triggered when InputCase has a value of Nothing. This condition is only met on the first pass through the function, which is apparently designed solely to provide a means of declaring the return variable. The Context.ExecuteForPrepare condition is only true on this first pass. Special exception-handling logic is required to reset the values of the class-scoped variables and the static CaseCount in the event of an error, which is comprised by the Catch…End Try block at the end of the method. The last five lines in the End If immediately above the Catch…End Try block perform pretty much the same function, but only after the final pass executes successfully. If these two blocks of code were not included, the CaseCount would continue to accumulate and the calculations of the TempDataPointArray and TempSkewnessKurtosisClassArray would at best throw exceptions, or worse yet, incorrect results.
…………Ideally, it would be easier to retrieve the count of input rows from the function itself or some variable exposed by SSDM, or at least find a way to determine when the function has exited, but thus far I have not yet found such a means. One alternative I tried was to use a CaseSet as an input parameter, but the Description column of the DMSCHEMA_MINING_FUNCTIONS rowset returned an “Invalid parameter type” error. Defining InputCase as an array of MiningCase objects will produce a different error when you run a Data Mining Extensions (DMX) query, including the text, “Too many arguments or flags were supplied for the function.” Nor can you return multiple rows, one at a time; this is why there are three Return statements in the function, one for the initial ExecuteForPrepare phase, one for returning all of the data en masse one time only in a single table, and another for cases in which an empty table is returned because there are no results to report. Returning a value of Nothing instead of the return type declared in the ExecuteForPrepare phase will raise an error, whereas returning an empty table with no rows is also undesirable, since it adds an extra line to the output. If you return a null value instead of the declared return type, you will also receive an error in SQL Server Management Studio (SSMS) worded like this: “COM error: COM error: DMPluginWrapper; Custom Function MyCustomMiningFunction should return a table, but it returned NULL.” There may still be errors lurking in the conditional logic I implemented in this function, but it at least amounts to the crude beginnings of a workaround of the problem of calculating stats dependent on cumulative values in custom mining functions. Would-be plug-in developers are basically stuck with iterating over the input cases one row at a time, rather than operating on an entire dataset, so some type of workaround involving static or class-scoped variables may be necessary when calculating cumulatives. Displaying the results with this type of hack is also problematic, as Figure 3 demonstrates. It is necessary to add another column to the results, as the first reference to “InputStat” does, otherwise FLATTENED results will only display a single blank row, but this means each row in the input set (taken from an OpenQuery on the PluginModel’s data source) will have a corresponding row in the results. In this case, each of the 20 input cases in our dataset is represented by the skewness value calculated for the training set of the InputStat column. It is possible to omit this superfluous column by removing the FLATTENED statement, but then it returns one empty nested table for each row in the input set. Either way, the only row we’re interested in is the last one, which depicts the skewness, kurtosis, excess kurtosis and Jarque-Bera statistic for the entire dataset. The TotalRows column is also populated with the value the user input for RecordCount when the INCLUDE_COUNT flag is specified; this takes just six lines of code in two separate section of Figure 3, once to add the column to the return table and another to assign the value. To date, this is still impractical because I haven’t found a way to use a TOP, ORDER BY, sub-select other clause to return only the row with the cumulative statistics we calculated, but my ugly hack at least shows that cumulatives can be calculated with custom mining functions. That’s not something that’s ordinarily done, but this comparatively short topic seemed an ideal place to contribute a little bleeding edge experimentation. The “Rickety” disclaimer I affixed to this whole series is sometime apt, because I’m learning as I go; the only reason I’m fit to write amateur tutorials of this kind is that there so little information out there on SSDM, which is an underrated feature of SQL Server that is badly in need of free press of any kind. In the next installment of this series, we’ll delve into other advanced plug-in functionality, like custom mining flags, algorithm parameters and feature selection.

Figure 3: Sample Query Result for the Custom Mining Function (With the Cumulatives Hack)
CustomFunctionQueryResults

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating