SQLServerCentral Article

Data Lineage Scripts for Microsoft SQL Server and Azure SQL

,

Data Lineage is a process of understanding data's lifecycle, from origin to destination. It tracks where data originates, how it flows through organisation systems and how it changes.

Why is data lineage important?

The information gained from data lineage is crucial for understanding data management, metadata and data analytics. Lineage will help you understand data and use it effectively. Without accessing the overview of data flow, it becomes much more tedious for analysts to find data and the data potential for business.

What are the immediate benefits of data lineage?

  • Better and more accurate analytics. By letting analytics teams and business users know where data comes from and what it means, data lineage improves their ability to find the data they need for BI and data science uses. That leads to better analytics results and makes it more likely that data analysis work will deliver meaningful information to drive business decision-making.
  • Better data security and privacy overview. Organisations can use data lineage information to identify sensitive data that requires particularly strong security and assess potential risks.
  • Stronger data governance. Data lineage also aids in tracking data and carrying out other key parts of the governance process.
  • Improved data management. In addition to data quality improvement, data lineage improves data engineering and IT tasks, like data migration, data consolidation, and detecting potential data-related problems (missing values, skewed data distribution, ...).

What data lineage script brings

  • using native T-SQL to analyse and  collect information about sources and data flow from SQL query
  • provide a simplified view of the SQL query
  • help you better document end-to-end mappings and data flows through your organisation's systems.

Data lineage script structure

The Data lineage script consists of three main parts:

  • standalone function for removing unnecessary or irrelevant characters for lineage
  • removing comments
  • extracting the predicates and tables

Removing specific characters

This standalone function is embedded in the Data Lineage script. You can use it as a standalone function in any given scenario. It will strip any unneeded characters for further process in analysing the query.

CREATE OR ALTER FUNCTION dbo.fn_removelistChars
/*
Author: Tomaz Kastrun
Created: 06.JUN.2022
Desc: Function for removing list of unwanted characters
Usage:
SELECT dbo.fn_removelistChars('Tol~99""''''j\e.j/e[,t&eks]t,ki')
*/(
@txt AS VARCHAR(max)
)
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @list VARCHAR(200) = '^a-zA-Z0-9+@#\/_?!:.''-]'
    WHILE PATINDEX(@list,@txt) > 0
SET @txt = REPLACE(cast(cast(cast(cast(cast(cast(@txt as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast(SUBSTRING(@txt as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast(PATINDEX(@list,@txt as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max))))))))))))),1),'')
RETURN @txt
END;
GO   

Data Lineage script

The Data lineage script is twofold. In the first part, you will find the section to remove any kind of comments. These can be a two hyphens (--) for single-line comments or slash-dot (/* */) multi-line comments. Nested comments are also supported and will be removed from the script for further data lineage creation.

In the second part, you will find the while loop that will iterate through the lines of code and analyse the data sources and the corresponding clauses. The script can also analyse the columns relevant for data streamlining and data flow.  At the end, the script will return all the relevant information regarding data sources for your query.

CREATE OR ALTER PROCEDURE dbo.TSQL_data_lineage 
/*
Author: Tomaz Kastrun
Date: August 2022
GitHub: github.com/tomaztk
Blogpost: 
Description:
Removing all comments from your T-SQL Query for a given procedure for better code visibility and readability - separate function 
    Remove all unused characters.
    Create data lineage for inputed T-SQL query
Usage:
EXEC dbo.TSQL_data_lineage 
@InputQuery = N'  SELECT * FROM master.dbo.spt_values '
*/(
@InputQuery NVARCHAR(MAX) 
)
AS
BEGIN
/* ******************************
*
* 2. Remove comments characters
*
******************************** */
DROP TABLE IF EXISTS dbo.SQL_query_table
CREATE TABLE dbo.SQL_query_table (
    id INT IDENTITY(1,1) NOT NULL
    ,query_txt NVARCHAR(4000)
)
    -- Breaks the procedure into lines with linebreak
    -- INSERT INTO dbo.SQL_query_table
    -- EXEC sp_helptext  
    --     @objname =  @InputQuery
        -- Breaks the query into lines with linebreak
            DECLARE @MAX_nof_break INT = (select len(@InputQuery) - len(REPLACE(cast(cast(cast(cast(cast(cast(@InputQuery as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( CHAR(10) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( '' as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max)))))))))))))))
            DECLARE @start_nof_break INT = 1
            declare @iq2 NVARCHAR(max) = @InputQuery
            declare @max_len int = (SELECT len(@InputQuery))
            declare @start_pos int = 0
            declare @br_pos int = 0

            while (@MAX_nof_break >= @start_nof_break)
            BEGIN
                SET @br_pos = (SELECT charindex( char(10), @iq2) )
                INSERT INTO dbo.SQL_query_table(query_txt)
                    SELECT  substring(@InputQuery,@start_pos, @br_pos )
                
                SET @start_pos = @start_pos + @br_pos  
                SET @iq2 = SUBSTRING(@InputQuery, @start_pos, @max_len)
                SET @start_nof_break = @start_nof_break + 1
            END

    --- STart removing comments
    DECLARE @proc_text varchar(MAX) = ''
    DECLARE @proc_text_row varchar(MAX)
    DECLARE @proc_no_comment varchar(MAX) = ''
    DECLARE @comment_count INT = 0

    SELECT @proc_text = @proc_text + CASE 
                                    WHEN LEN(@proc_text) > 0 THEN '\n' 
                                    ELSE '' END + query_txt
    FROM dbo.SQL_query_table

    DECLARE @i INT  = 1
    DECLARE @rowcount INT = (SELECT LEN(@proc_text))
    WHILE (@i <= @rowcount) 
        BEGIN
            IF SUBSTRING(@proc_text,@i,2) = '/*'
                BEGIN
                    SELECT @comment_count = @comment_count + 1
                END
            ELSE IF SUBSTRING(@proc_text,@i,2) = '*/'  
                BEGIN
                    SELECT @comment_count = @comment_count - 1  
                END
            ELSE IF @comment_count = 0
                SELECT @proc_no_comment = @proc_no_comment + SUBSTRING(@proc_text,@i,1)
            IF SUBSTRING(@proc_text,@i,2) = '*/' 
            SELECT @i = @i + 2
            ELSE
            SELECT @i = @i + 1
        END

    WHILE (@i <= @rowcount) 
        BEGIN
            IF SUBSTRING(@proc_text,@i,4) = '/*/*'
                BEGIN
                    SELECT @comment_count = @comment_count + 2
                END
            ELSE IF SUBSTRING(@proc_text,@i,4) = '*/*/'  
                BEGIN
                    SELECT @comment_count = @comment_count - 2 
                END
            ELSE IF @comment_count = 0
                SELECT @proc_no_comment = @proc_no_comment + SUBSTRING(@proc_text,@i,1)
            IF SUBSTRING(@proc_text,@i,4) = '*/*/' 
            SELECT @i = @i + 2
            ELSE
            SELECT @i = @i + 1
        END
    DROP TABLE IF EXISTS  #tbl_sp_no_comments
    CREATE TABLE #tbl_sp_no_comments (
                rn INT IDENTITY(1,1)
                ,sp_text VARCHAR(8000)
                )

    WHILE (LEN(@proc_no_comment) > 0)
        BEGIN
            INSERT INTO  #tbl_sp_no_comments (sp_text)
            SELECT SUBSTRING( @proc_no_comment, 0, CHARINDEX('\n', @proc_no_comment))
            
            SELECT @proc_no_comment = SUBSTRING(@proc_no_comment, CHARINDEX('\n',@proc_no_comment) + 2, LEN(@proc_no_comment))
        END

    DROP TABLE IF EXISTS  #tbl_sp_no_comments_fin
    CREATE TABLE #tbl_sp_no_comments_fin 
                (rn_orig INT IDENTITY(1,1)
                ,rn INT
                ,sp_text_fin VARCHAR(8000))

    DECLARE @nofRows INT =  (SELECT COUNT(*) FROM #tbl_sp_no_comments)
    DECLARE @ii INT = 1
    WHILE (@nofRows >= @ii)
    BEGIN
        DECLARE @LastLB INT = 0
        DECLARE @Com INT = 0 
        SET @Com = (SELECT CHARINDEX('--', sp_text,@com) FROM #tbl_sp_no_comments WHERE rn = @ii)
        SET @LastLB = (SELECT CHARINDEX(CHAR(10), sp_text, @LastLB) FROM #tbl_sp_no_comments WHERE rn = @ii)
        INSERT INTO #tbl_sp_no_comments_fin (rn, sp_text_fin)
        SELECT 
            rn
            ,CASE WHEN @Com = 0 THEN sp_text
                WHEN @Com <> 0 THEN SUBSTRING(sp_text, 0, @Com) END as new_sp_text
        FROM #tbl_sp_no_comments
        WHERE 
            rn = @ii
        SET @ii = @ii + 1
    END
DROP TABLE IF EXISTS  dbo.Query_results_no_comment
SELECT 
    rn
    ,sp_text_fin  
INTO dbo.Query_results_no_comment
FROM #tbl_sp_no_comments_fin
WHERE
    DATADATADATADATADATADATADATALENGTH(sp_text_fin) > 0 
AND LEN(sp_text_fin) > 0

/* ******************************
*
* 3. Create data lineage 
*
******************************** */
DECLARE @orig_q VARCHAR(MAX) 
SELECT @orig_q = COALESCE(@orig_q + ', ', '') + sp_text_fin
FROM dbo.Query_results_no_comment
order by rn asc
DROP TABLE IF EXISTS dbo.LN_Query

DECLARE @stmt2 NVARCHAR(MAX)
SET @stmt2 = REPLACE(cast(cast(cast(cast(cast(cast(REPLACE(@orig_q as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( CHAR(13) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( ' ' as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max))))))))))))), CHAR(10), ' ')

SELECT 
     TRIM(REPLACE(cast(cast(cast(cast(cast(cast(value as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( ' ' as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast('' as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max)))))))))))))) as val
    ,dbo.fn_removelistChars(value) as val_f
    ,row_number() over (ORDER BY (SELECT 1)) as rn
INTO dbo.LN_Query
from string_split(REPLACE(cast(cast(cast(cast(cast(cast(@stmt2 as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( CHAR(13) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( ' ' as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max))))))))))))), ' ' )
WHERE
    REPLACE(cast(cast(cast(cast(cast(cast(value as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( ' ' as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast('' as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max))))))))))))) <> ' ' 
OR REPLACE(cast(cast(cast(cast(cast(cast(value as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( ' ' as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast('' as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max))))))))))))) <> ' '

DECLARE @table TABLE (command_ VARCHAR(200), location_ VARCHAR(200), order_ INT)
DECLARE @command_i VARCHAR(200) = ''
DECLARE @next_step BIT = 0 -- FALSE (1 = TRUE)
DECLARE @previous VARCHAR(200) = ''
DECLARE @order INT = 1
DECLARE @previous_cmd VARCHAR(200) = ''
DECLARE @previous_step BIT = 0 -- FALSE
DECLARE @ttok VARCHAR(100) = ''

DECLARE @i_row INT = 1
DECLARE @max_row INT = (SELECT MAX(rn) FROM dbo.LN_Query)
DECLARE @row_commands_1 NVARCHAR(1000) = 'select,delete,insert,drop,create,select,truncate,exec,execute'
DECLARE @row_commands_2 NVARCHAR(1000) = 'select,not,if,exists,select'
DECLARE @row_commands_3 NVARCHAR(1000) = 'from,join,into,table,exists,sys.dm_exec_sql_text,sys.dm_exec_cursors,exec,execute'

WHILE (@max_row >= @i_row)
BEGIN
DECLARE @command VARCHAR(1000) = (SELECT val FROM dbo.LN_Query WHERE rn = @i_row)
IF @command IN (SELECT REPLACE(cast(cast(cast(cast(cast(cast(TRIM(LOWER(value)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( ' ' as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast('' as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max))))))))))))) FROM STRING_SPLIT(@row_commands_1, ','))
BEGIN
IF LOWER(@command) = 'select'
BEGIN
SET @command = 'select'
END
SET @command_i = @command
END
IF (@next_step = 1)
BEGIN
IF @command NOT IN (SELECT REPLACE(cast(cast(cast(cast(cast(cast(TRIM(LOWER(value)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( ' ' as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast(' ' as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max))))))))))))) FROM STRING_SPLIT(@row_commands_2,','))
BEGIN
IF (LOWER(@previous) = 'into')
SET @command_i = 'select into'
IF (@command NOT LIKE '' OR @command NOT LIKE '')

SET @ttok = ' ' + @command + ' as ('
 IF (@ttok NOT IN (SELECT @stmt2))
INSERT INTO @table (command_, location_, order_)
SELECT 
 @command_i
,@command
,@order
SET @command_i = @command_i
END
SET @next_step = 0
IF @command  IN ('sys.dm_exec_sql_text','sys.dm_exec_cursors')
BEGIN SET @next_step = 1  END
END
IF (@command IN (SELECT REPLACE(cast(cast(cast(cast(cast(cast(TRIM(LOWER(value)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast( ' ' as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)) as nvarchar(max)),cast(cast(cast(cast(cast(cast('' as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max as nvarchar(max))))))))))))) FROM STRING_SPLIT(@row_commands_3,',')))
BEGIN
SET @next_step = 1
END
SET @previous_cmd  = @command_i
SET @previous = @command
SET @i_row = @i_row + 1
END
DROP TABLE IF EXISTS dbo.final_result
-- Final results
SELECT *
,row_number() over (order by (select 1)) as rn 
INTO dbo.final_result
FROM @table

SELECT 
  [command_] AS Clause_name
 ,[location_] AS Object_Name
 ,rn AS order_DL
 FROM dbo.final_result

END;
GO

 

Running the script

Once you create the procedure for data lineage, have your T-SQL query ready, including all the comments, object names, CTE tables and any other objects.

DECLARE @test_query VARCHAR(MAX) = '
-- This is a sample query to test data lineage
SELECT 
    s.[BusinessEntityID]
    ,p.[Title]
    ,p.[FirstName]
    ,p.[MiddleName]
   -- ,p.[LastName]
    ,p.[Suffix]
    ,e.[JobTitle] as JobName
    ,p.[EmailPromotion]
    ,s.[SalesQuota]
    ,s.[SalesYTD]
    ,s.[SalesLastYear]
,( SELECT GETDATE() ) AS DateNow
,( select count(*)  FROM [AdventureWorks2014].sales.[SalesPerson] ) as totalSales
/*
 Testing some additional comments!
*/FROM [AdventureWorks2014].sales.[SalesPerson] s
    LEFT JOIN [AdventureWorks2014].[HumanResources].[Employee] e 
    ON e.[BusinessEntityID] = s.[BusinessEntityID]
INNER JOIN [AdventureWorks2014].[Person].[Person] AS p
ON p.[BusinessEntityID] = s.[BusinessEntityID]
'

 

And simply execute the procedure with a single input parameter.

EXEC dbo.TSQL_data_lineage 
  @InputQuery = @test_query

 

The data lineage script will return to you the results of the tables (and columns) used in the query.

 

About the script

The script is written in T-SQL and therefore does not need any scripting language. You can run the script on SQL Server 2016 and later. All editions are supported. Furthermore, you can run the script on Azure SQL Server, Azure SQL Database, Azure MI and Azure Synapse.

When you will be running the script, the query will not be validated nor run against the server. The query will be treated as a string and respectively analysed.

Conclusion

Enterprises and organisations are struggling with the quality of data analysis and potential security risks. Usually, both also come from complex data silos, data engineering and low visibility on data flows. By governing the data flow and understanding data origins, some of the issues can be addressed easier. And start today with this script.

You can track future updates on the Github repository.

 

Resources

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating