Table-Column Data Profiler

  • Comments posted to this topic are about the item Table-Column Data Profiler

  • Many thanks for this really useful piece of code. Have been doing this with SSIS Profile task, but this is much more handy. JK


    Tks,

    JK

  • Glad someone found it helpful. Here is an update that includes (Distinct) Value Counts:

    DECLARE @Cat_Sch_TableName varchar(128) = 'HRIS.NTC.D20130201'

    -- Initialize fully qualified object name; script can be used like a stored procedure

    DECLARE @sqlstring varchar(8000), @ColumnName varchar(128), @ColumnID smallint,

    @SequenceNumber smallint

    -- cleanup temp tables

    IF OBJECT_ID('Tempdb..#RawColumns') IS NOT NULL DROP TABLE #RawColumns

    IF OBJECT_ID('Tempdb..#ProfiledColumns') IS NOT NULL DROP TABLE #ProfiledColumns

    -- Populate working list of columns

    SELECT ColumnID = IDENTITY(int, 1,1),

    COLUMN_NAME AS ColumnName,

    ORDINAL_POSITION AS SequenceNumber

    INTO #RawColumns

    FROM information_schema.COLUMNS

    WHERE TABLE_CATALOG+'.'+TABLE_SCHEMA+'.'+TABLE_NAME=@Cat_Sch_TableName

    -- Initialize output table

    CREATE TABLE #ProfiledColumns(

    ColumnName varchar(128) NOT NULL,

    SequenceNumber varchar(10),

    NullCount int ,

    ModalValue varchar(255),

    ModalCount int,

    ValueCount int,

    MinValue varchar(255),

    MaxValue varchar(255),

    MaxLength smallint )

    -- Loop through columns

    WHILE (EXISTS(

    SELECT *

    FROM #RawColumns

    WHERE ColumnName IS NOT NULL ))

    BEGIN

    SELECT @ColumnID = ColumnID,

    @ColumnName = ColumnName,

    @SequenceNumber = SequenceNumber

    FROM #RawColumns -- assign column data to variables

    SET @sqlstring=

    'INSERT INTO #ProfiledColumns ' +

    'SELECT '''+@ColumnName+''',' +

    CAST(@SequenceNumber AS varchar(10)) + ' , ' +

    'SUM(CASE WHEN ['+@ColumnName+'] IS NULL THEN 1 ELSE 0 END) ,'+

    '(SELECT TOP 1 ['+@ColumnName+'] FROM '+@Cat_Sch_TableName+' GROUP BY ['+@ColumnName+'] ORDER BY COUNT(*) DESC),' +

    '(SELECT TOP 1 COUNT(*) FROM '+@Cat_Sch_TableName+' GROUP BY ['+@ColumnName+'] ORDER BY COUNT(*) DESC),' +

    'COUNT(DISTINCT ['+@ColumnName+']),' +

    'CAST(MIN(['+@ColumnName+']) AS VARCHAR(255)),'+

    'CAST(MAX(['+@ColumnName+']) AS VARCHAR(255)),'+

    'MAX(LEN(CAST(['+@ColumnName+'] AS VARCHAR(255)))) '+

    'FROM ' + @Cat_Sch_TableName -- build query string

    DELETE FROM #RawColumns WHERE ColumnID=@ColumnID -- one column down, next...

    EXECUTE (@sqlstring) -- execute built query

    END -- End loop block

    -- Display data profile with condensed data type

    SELECT

    ColumnName,

    DATA_TYPE +

    CASE WHEN DATA_TYPE LIKE '%char' THEN '('+LTRIM(CHARACTER_MAXIMUM_LENGTH)+')'

    WHEN COALESCE(NUMERIC_SCALE,0) <> 0 THEN '('+LTRIM(NUMERIC_PRECISION)+','+LTRIM(NUMERIC_SCALE)+')' ELSE ''

    END AS DataTypeName,

    NullCount,

    ModalValue,

    ModalCount,

    ValueCount,

    MinValue,

    MaxValue,

    MaxLength

    FROM #ProfiledColumns

    JOIN information_schema.COLUMNS

    ON SequenceNumber=ORDINAL_POSITION

    WHERE TABLE_CATALOG+'.'+TABLE_SCHEMA+'.'+TABLE_NAME=@Cat_Sch_TableName

    ORDER BY CAST(SequenceNumber AS SMALLINT)

  • Handy, but it throws an error on a bit field because you cannot perform a min or max on them

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • Thanks forgot about bit. This should work on bit (other data types?) as well...

    DECLARE @Cat_Sch_TableName varchar(128) = 'HRIS.dbo.E20130201'

    -- Initialize fully qualified object name; script can be used like a stored procedure

    DECLARE @sqlstring varchar(8000), @ColumnName varchar(128), @ColumnID smallint,

    @SequenceNumber smallint

    -- cleanup temp tables

    IF OBJECT_ID('Tempdb..#RawColumns') IS NOT NULL DROP TABLE #RawColumns

    IF OBJECT_ID('Tempdb..#ProfiledColumns') IS NOT NULL DROP TABLE #ProfiledColumns

    -- Populate working list of columns

    SELECT ColumnID = IDENTITY(int, 1,1),

    COLUMN_NAME AS ColumnName,

    ORDINAL_POSITION AS SequenceNumber

    INTO #RawColumns

    FROM information_schema.COLUMNS

    WHERE TABLE_CATALOG+'.'+TABLE_SCHEMA+'.'+TABLE_NAME=@Cat_Sch_TableName

    -- Initialize output table

    CREATE TABLE #ProfiledColumns(

    ColumnName varchar(128) NOT NULL,

    SequenceNumber varchar(10),

    NullCount int ,

    ModalValue varchar(255),

    ModalCount int,

    ValueCount int,

    MinValue varchar(255),

    MaxValue varchar(255),

    MaxLength smallint )

    -- Loop through columns

    WHILE (EXISTS(

    SELECT *

    FROM #RawColumns

    WHERE ColumnName IS NOT NULL ))

    BEGIN

    SELECT @ColumnID = ColumnID,

    @ColumnName = ColumnName,

    @SequenceNumber = SequenceNumber

    FROM #RawColumns -- assign column data to variables

    SET @sqlstring=

    'INSERT INTO #ProfiledColumns ' +

    'SELECT '''+@ColumnName+''',' +

    CAST(@SequenceNumber AS varchar(10)) + ' , ' +

    'SUM(CASE WHEN ['+@ColumnName+'] IS NULL THEN 1 ELSE 0 END) ,'+

    '(SELECT TOP 1 ['+@ColumnName+'] FROM '+@Cat_Sch_TableName+' GROUP BY ['+@ColumnName+'] ORDER BY COUNT(*) DESC),' +

    '(SELECT TOP 1 COUNT(*) FROM '+@Cat_Sch_TableName+' GROUP BY ['+@ColumnName+'] ORDER BY COUNT(*) DESC),' +

    'COUNT(DISTINCT ['+@ColumnName+']),' +

    'CAST(MIN(CAST(['+@ColumnName+'] AS sql_variant)) AS VARCHAR(8000)),'+

    'CAST(MAX(CAST(['+@ColumnName+'] AS sql_variant)) AS VARCHAR(8000)),'+

    'MAX(LEN(CAST(['+@ColumnName+'] AS VARCHAR(255)))) '+

    'FROM ' + @Cat_Sch_TableName -- build query string

    DELETE FROM #RawColumns WHERE ColumnID=@ColumnID -- one column down, next...

    EXECUTE (@sqlstring) -- execute built query

    END -- End loop block

    -- Display data profile with condensed data type

    SELECT

    ColumnName,

    DATA_TYPE +

    CASE WHEN DATA_TYPE LIKE '%char' THEN '('+LTRIM(CHARACTER_MAXIMUM_LENGTH)+')'

    WHEN COALESCE(NUMERIC_SCALE,0) <> 0 THEN '('+LTRIM(NUMERIC_PRECISION)+','+LTRIM(NUMERIC_SCALE)+')' ELSE ''

    END AS DataTypeName,

    NullCount,

    ModalValue,

    ModalCount,

    ValueCount,

    MinValue,

    MaxValue,

    MaxLength

    FROM #ProfiledColumns

    JOIN information_schema.COLUMNS

    ON SequenceNumber=ORDINAL_POSITION

    WHERE TABLE_CATALOG+'.'+TABLE_SCHEMA+'.'+TABLE_NAME=@Cat_Sch_TableName

    ORDER BY CAST(SequenceNumber AS SMALLINT)

  • Handy script, Thanks!! Many for the DB's I run into have nvarchar(max) columns. Any feel for how to address these?

    Steve Pirazzi
    ONEWARE, Inc.
    http://www.ONEWARE.com

  • Hi Todd,

    I just tried your script and I like it!

    A couple of questions:

    1. Can it be modified to loop through all tables in a database automatically and show the results?

    2. For the datatypes that it can't handle (uniqueidentifier, image, text) it generates an exception and stops processing. Is it possible that it could simply skip the columns with these datatypes and produce an output of the remaining columns?

    I ask because I'm not a strong coder and not sure how to get at this.

    Thanks again for an excellent script Todd

    Steve

  • Thanks Steve

    I am not guaranteeing anything because I never allow use of varchar(max) and have no way to test it, BUT

    for Varchar(max) types to work, change this statement:

    CREATE TABLE #ProfiledColumns(

    ColumnName varchar(128) NOT NULL,

    SequenceNumber varchar(10),

    NullCount int ,

    ModalValue varchar(max),

    ModalCount int,

    ValueCount int,

    MinValue varchar(max),

    MaxValue varchar(max),

    MaxLength smallint

    And this statement:

    SET @sqlstring=

    'INSERT INTO #ProfiledColumns ' +

    'SELECT '''+@ColumnName+''',' +

    CAST(@SequenceNumber AS varchar(10)) + ' , ' +

    'SUM(CASE WHEN ['+@ColumnName+'] IS NULL THEN 1 ELSE 0 END) ,'+

    '(SELECT TOP 1 ['+@ColumnName+'] FROM '+@Cat_Sch_TableName+' GROUP BY ['+@ColumnName+'] ORDER BY COUNT(*) DESC),' +

    '(SELECT TOP 1 COUNT(*) FROM '+@Cat_Sch_TableName+' GROUP BY ['+@ColumnName+'] ORDER BY COUNT(*) DESC),' +

    'COUNT(DISTINCT ['+@ColumnName+']),' +

    'CAST(MIN(CAST(['+@ColumnName+'] AS sql_variant)) AS VARCHAR(max)),'+

    'CAST(MAX(CAST(['+@ColumnName+'] AS sql_variant)) AS VARCHAR(max)),'+

    'MAX(LEN(CAST(['+@ColumnName+'] AS VARCHAR(255)))) '+

    'FROM ' + @Cat_Sch_TableName -- build query string

  • SteveBell (2/20/2013)


    Hi Todd,

    I just tried your script and I like it!

    Thanks!

    SteveBell (2/20/2013)


    A couple of questions:

    Uhoh..

    SteveBell (2/20/2013)


    1. Can it be modified to loop through all tables in a database automatically and show the results?

    Yes but not by me yet; you will need to set up an outer loop using a Cursor or a WHILE statement or somesuch. Maybe later...

    SteveBell (2/20/2013)


    2. For the datatypes that it can't handle (uniqueidentifier, image, text) it generates an exception and stops processing. Is it possible that it could simply skip the columns with these datatypes and produce an output of the remaining columns?

    Try adding this bolded line:

    WHERE TABLE_CATALOG+'.'+TABLE_SCHEMA+'.'+TABLE_NAME=@Cat_Sch_TableName

    AND DATA_TYPE NOT IN ('uniqueidentifier','image','text') -- excluded types

  • Thanks again Todd,

    This is very helpful and much appreciated.

  • Todd,

    Thanks for the suggested changes. I’m going to try to get the unsupported data type columns to show up in the output, but not analyze the data in the columns. Hopefully this will allow someone without explicit knowledge of the table to be aware of their presence. I’ll post back if I get it working.

    Steve Pirazzi
    ONEWARE, Inc.
    http://www.ONEWARE.com

  • To suppress Min, Max and Modal Values, a CASE statement or an OUTER JOIN on data_types is needed in the SQL string area. If I had time I would do it myself.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply