SQLServerCentral Article

Data Profiling Column Functional Dependency T-SQL

,

A while back I was reading one of Jamie Thomson most excellent posts on Conchango blogs SSIS Junkie SSIS: Data Profiling Task: Part 7 - Functional Dependency and it occurred to me it would be interesting to provide a poor man’s Functional Dependency using on TSQL.

 

First let’s define Functional Dependency.

Functional Dependency

Determines the extent to which the values in one column (the dependent column) depend on the values in another column or set of columns (the determinant column) ; This profile also helps you identify problems in your data, such as values that are not valid. For example, you profile the dependency between a column that contains Country or Region Codes and a column that contains States/Provinces. The same Country or Region Codes should always have the same States/Provinces, but the profile discovers violations of this dependency.

The Problem

When analyzing the capabilities needed to determine one columns dependency on another you would obviously need recursion or the ability to link one row to the next.

 

The Approach

Naturally when facing a difficult problem in T-SQL I turned to the SQL Server Central vast library.(Blatant Suck Up)

A brief search of existing articles SQL Server Central returned Linking to the Previous Row By David McKinney, 2008/03/13 and Common Table Expressions in SQL Server 2005 By Srinivas Sampath, 2007/03/09

 

Armed with these two articles I now had most of the techniques I need to develop a query or stored procedure to provide Column Functional Dependency.

 

The Code for T-SQL Column Functional Dependency

The heart of this effort is to utilizeCommon Table Expressions(CTE) or the “with”clause as defined on MSDN as:

 

“Specifies a temporary named result set, known as a common table expression (CTE). This is derived from a simple query and defined within the execution scope of a single SELECT, INSERT, UPDATE, or DELETE statement. This clause can also be used in a CREATE VIEW statement as part of its defining SELECT statement. A common table expression can include references to itself. This is referred to as a recursive common table expression.”

 

Basically the “with” clause provides you with the ability to logically create CTE’s or result sets ,logically develop your code and employ recursive techniques.

 

A detailed treatment of CTE’s is available at and Common Table Expressions in SQL Server 2005 By Srinivas Sampath, 2007/03/09

 

Setup Test Data

I have used the same test setup that Jamie Thomson provides for testing the SQL Server 2008 Data Profiling Task.

 

select ROW_NUMBER() OVER (partition by sp.Name order by a.PostalCode) RowNum
, a.PostalCode
, sp.Name AS StateProvinceName
, sp.CountryRegionCode
into Addresses
from Person.[Address] a
inner join Person.StateProvince sp
on a.StateProvinceID = sp.StateProvinceID
update Addresses
set CountryRegionCode = 'ZZ'
where RowNum = 1

This will create a test table called Addresses, which contains a Functional Dependency scenario for Province or States and Countries.

 

 

 

 

 

 

 

Example Functional Dependency Stored Procedure

 

 USE [AdventureWorks]
GO /****** Object: Table [dbo].[DependencyColumnTable] Script Date: 01/14/2009 19:07:54 ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[DependencyColumnTable]') AND type in (N'U'))
DROP TABLE [dbo].[DependencyColumnTable]
GO /****** Object: StoredProcedure [dbo].[uspGetBillOfMaterials] Script Date: 01/14/2009 14:26:33 ******/
SET ANSI_NULLS ON
GO SET QUOTED_IDENTIFIER ON
GO USE [AdventureWorks]
GO /****** Object: StoredProcedure [dbo].[ColumnDependency] Script Date: 01/14/2009 17:33:24 ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[ColumnDependency]') AND type in (N'P', N'PC'))
DROP PROCEDURE [dbo].[ColumnDependency]
GO USE [AdventureWorks]
GO /****** Object: StoredProcedure [dbo].[ColumnDependency] Script Date: 01/14/2009 17:33:24 ******/
SET ANSI_NULLS ON
GO SET QUOTED_IDENTIFIER ON
GO CREATE PROCEDURE [dbo].[ColumnDependency]
@TableName VARCHAR(128) ,
@Determinant VARCHAR(128) ,
@Dependent VARCHAR(128)

AS
Declare @SQL VarChar(MAX) SELECT @SQL = ' ;WITH DependencyCompare AS ('
SELECT @SQL = @SQL + ' SELECT '
SELECT @SQL = @SQL + @Determinant + ' ,' + @Dependent + ', '
SELECT @SQL = @SQL + 'ROW_NUMBER() OVER (Partition BY ' + @Dependent + ' ORDER BY ' + @Determinant + ' ) AS rownumcalc FROM '
SELECT @SQL = @SQL + @TableName SELECT @SQL = @SQL + ' ) SELECT
currow.' + @Dependent + ' AS curDet,
prevrow.' + @Dependent + ' AS prevDet,
nextrow.' + @Dependent + ' AS nextDet,
currow.' + @Determinant + ' AS curDep,
prevrow.' + @Determinant + ' AS prevDep, nextrow.' + @Determinant
+ ' AS nextDep,
currow.rownumcalc ,
ViolationCount =
Case
When currow.' + @Determinant + ' = prevrow.' + @Determinant +
' and currow.' + @Dependent + ' = prevrow.' + @Dependent +
' then 0
When prevrow.' + @Dependent + ' is null
then 0
Else 1
End'
+
' INTO DependencyColumnTable
FROM
DependencyCompare currow '
+ ' LEFT JOIN DependencyCompare nextrow ON currow.rownumcalc = nextrow.rownumcalc - 1 AND currow.'
+ @Dependent
+ ' = nextrow.'
+ @Dependent
+ ' LEFT JOIN DependencyCompare prevrow ON currow.rownumcalc = prevrow.rownumcalc + 1 AND currow.'
+ @Dependent
+ ' = prevrow.'
+ @Dependent
+ ' Order by currow.'
+ @Dependent Print @SQL Exec (@SQL) GO

 

Sample Execution

Exec ColumnDependency
@TableName = 'Addresses',
@Dependent = 'StateProvinceName',
@Determinant = 'CountryRegionCode '

 

The code here is basically leveraging the recursive capabilities of CTE. Our objective here is to compare current row to previous row columns to determine when a Dependent(StateProvinceName) column value changes and the Determinant(CountryRegionCode) column valus has NOT changed. This is a Violation.

 

A detailed explanation of the technique of linking to previous records and be reviewed at SQL Server Central returned Linking to the Previous Row By David McKinney, 2008/03/13.

 

 

One issue I had was needing to dynamically populate the From clause or Table.

 

I used the technique presented by Introduction to Dynamic SQL (Part 1) By Damian Maclennen on 20 June 2001. This technique relies in creating a @SQL variable and executing the variable as part of the stored procedure.

 

 

 

Create Procedure GenericTableSelect
@TableName VarChar(100)
AS
Declare @SQL VarChar(1000)
SELECT @SQL = 'SELECT * FROM '
SELECT @SQL = @SQL + @TableName
Exec ( @SQL)
GO

`

 

 

Example Functional Dependency Results

 

 

 

 

 

As you can see here Alabama is our Determinant column which normally has an associated Dependent Values of US. You can see when the Dependent values changes to ZZ it is tagged a Violation Count as ‘1’

 

 

Example Functional Summary

 

Now that we have created a table with the Functional Dependency Violations identified we want to create a summary or OLAP Ready table that can be used for analysis or reporting, containing the same percentages available in the SSIS 2008 Data Profiling Task Functional Option.

 

Again we have relied on the CTE in order to create a reusable query that can create a single table that has the required for analysis such as Totlal Determinint , Dependents categroized by Supporting Dependents and Violating Dependents

 

.

/* Sample Execution
exec ColumnDependencyOLAP
*/
/* This SP will create a Summary Table and calculate the Support and Vio;ation percentages.
exec ColumnDependencyOLAP
*/USE [AdventureWorks]
GO /****** Object: StoredProcedure [dbo].[uspGetBillOfMaterials] Script Date: 01/14/2009 14:26:33 ******/
SET ANSI_NULLS ON
GO SET QUOTED_IDENTIFIER ON
GO USE [AdventureWorks]
GO /****** Object: StoredProcedure [dbo].[ColumnDependencyOLAP] Script Date: 01/14/2009 17:33:24 ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[ColumnDependencyOLAP]') AND type in (N'P', N'PC'))
DROP PROCEDURE [dbo].[ColumnDependencyOLAP]
GO /****** Object: StoredProcedure [dbo].[ColumnDependencyOLAP] Script Date: 01/14/2009 17:33:24 ******/
SET ANSI_NULLS ON
GO SET QUOTED_IDENTIFIER ON
GO CREATE PROCEDURE [dbo].[ColumnDependencyOLAP]


AS
Declare @SQL VarChar(MAX)
SELECT @SQL = '
;WITH DependencyCompareDetail AS
(
Select count(ViolationCount) as Violations , curDet , curDep,
Case when ViolationCount = 0
then ''Support Percentage''
else ''Violation Percentage''
end as Violation ,
ROW_NUMBER() OVER (Partition BY curDet ORDER BY curDet ,curDep) AS rownumcalc
FROM [AdventureWorks].[dbo].[DependencyColumnTable]
--Where ViolationCountNull = ''N''
group by curDet, curDep , ViolationCount
)
,DependencyCompareSummary AS (
Select count(ViolationCount) as total , curDet ,
ROW_NUMBER() OVER (Partition BY curDet ORDER BY curDet ) AS rownumcalc
FROM [AdventureWorks].[dbo].[DependencyColumnTable]
--Where ViolationCountNull = ''N''
group by curDet
)
,DependencyCompareOLAP AS
(
Select
DependencyCompareDetail.curDet ,
DependencyCompareDetail.curDep,
DependencyCompareDetail.Violation,
DependencyCompareDetail.Violations ,
DependencyCompareSummary.total ,
ROW_NUMBER() OVER (Partition BY DependencyCompareDetail.curDet ORDER BY DependencyCompareDetail.curDep) AS rownumcalc
From DependencyCompareDetail
LEFT JOIN DependencyCompareSummary
ON DependencyCompareDetail.curDet =DependencyCompareSummary.curDet
)
Select
curDet as Determinant,
curDep as Dependent,
Violation as ''Dependency Type'',
Violations as ''Dependeny Counts'' ,
total as ''Total Determinant Count'',
cast(cast(Violations AS Decimal ) /cast(total AS Decimal )* 100 as decimal (5, 2)) as PercentViolation
INTO DependencyCompareOLAP
from DependencyCompareOLAP
Order by curDet , curDep ' Print @SQL Exec (@SQL) GO Exec ColumnDependencyOLAP

 

 

Example Functional Summary Results

 

 

 

 

Here are the results from the SSIS 2008 Data Profiling Task for Functional Dependency.

 

 

As you can see the Support Percentages for Alabama are exactly the same.

 

Roll your Own Interactive Scorecard

 

 

I would also add from an analytical perspective you can quickly create a Pivot Table in Microsoft Excel 2007. Simply open Excel 2007, select the Data Tab .

Then select From Other Sources then follow the wizard and create a connection to the DependencyCompareOLAP table to create the Pivot table and Chart below.

 

As you can see here is a quick way to determine the outliers like Alabama.

 

 

Or Manitoba in Canada

 

 

Further Enhancements

 

You also can add additional columns to the table and stored procedures, such as table name, date and time and create a repository that would allow you to track the Violations incrementally or over time, and provide proactive processes to respond to continuing anomalies.

Obviously we can extend the code to support more than one determininat column.

Also in future post we can explore using this query and a few more to develop a Dynamic Metadata Repository.

 

Additional Resources:

Here are several additional links that were helpful.

 

  1. Introduction to Dynamic SQL (Part 1) By Damian Maclennen on 20 June 2001
  2. SQL Server 2005 Recursion and WITH Clause By Robbe D. Morris
  3. The SQL WITH Clause, (Order Your Thoughts, Reuse Your Code) Submitted by Kevin Meade on Tue, 2007-06-19 14:59

 


 

Ira Warren Whiteside
Actuality Business Intelligence

"karo yaa na karo, koshish jaisa kuch nahi hai"
"Do, or do not. There is no try."

 

Resources

Rate

4 (10)

You rated this post out of 5. Change rating

Share

Share

Rate

4 (10)

You rated this post out of 5. Change rating