Column stats or density for each table

  • Hi to all sql lovers.
    Is there any way to get for each table and each column for example this:
    SELECT COUNT(DISTINCT ColumnName) AS UniqueValues FROM Table 
    I don't want to do it in some loop/cursor, because this will destroy the server. Is there any sys.stats similar table? Thank you...

  • You could do something like this and then copy and paste the SQL_TEXT column into SSMS for execution, and you could modify the WHERE clause to suit your objectives:
    SELECT C.TABLE_NAME, C.ORDINAL_POSITION, C.COLUMN_NAME, T.SQL_TEXT
    FROM INFORMATION_SCHEMA.COLUMNS AS C
        CROSS APPLY (
                    VALUES ('SELECT COUNT(DISTINCT ' + C.COLUMN_NAME + ' AS ' + C.COLUMN_NAME + ' FROM ' + C.TABLE_NAME + ';')
                    ) AS T (SQL_TEXT)
    WHERE C.TABLE_SCHEMA = 'dbo'
    ORDER BY C.TABLE_NAME, C.ORDINAL_POSITION;

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • I made the same solution.
    1st load all my necessary tables and columns
    WITH
         DimTables AS (
             SELECT
                 t.[name]     AS TableName
                 ,s.[name]     AS SchemaName
                 ,t.[Object_ID] AS TableID
             FROM
                 DWH_Staging.sys.tables t
                 JOIN DWH_Staging.sys.schemas s
                    ON s.[schema_id] = t.[schema_id]
             WHERE
                 t.[schema_id] = 6                     -- Only for dim. now
                 AND t.[temporal_type] = 2             -- Only non-temporal tables
         )
         ,DimColumns AS (
             SELECT
                 t.SchemaName     AS SchemaName
                 ,t.TableName     AS TableName
                 ,c.[name]     AS ColumnName
             FROM
                 DWH_Staging.sys.columns c
                 JOIN DimTables t ON t.TableID = c.[object_id]
             WHERE c.[name] NOT IN ('BeginDate','EndDate','Created','Modified','CountryID','ActiveID','SourceID','ValidFromID','ValidToID','Domain')
         )
      -- Load to the temp table for evaluating
      SELECT
         c.SchemaName  AS SchemaName
         ,c.TableName  AS TableName
         ,c.ColumnName AS ColumnName
      INTO #Columns
      FROM DimColumns c
      ORDER BY c.TableName

    Then simple while and....
            SET @sql = 'SELECT DISTINCT ' + @columnName + ' INTO #DistinctCounts FROM [DWH_Staging].[' + @schemaName + '].[' + @tableName + ']'
             EXEC (@sql)

    But it takes 20second.....works but slow 🙂

  • 20 Seconds to count ALL the table's columns number of distinct values ?   You might not have all that much data, or just not that many overall columns.   I would have been happy if it ran in half an hour.  That could be a LOT of work for the server to do.   That said, however, I'm pretty sure it's not going to go any faster.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • povamartin - Friday, September 22, 2017 8:54 AM

    Then simple while and....
            SET @sql = 'SELECT DISTINCT ' + @columnName + ' INTO #DistinctCounts FROM [DWH_Staging].[' + @schemaName + '].[' + @tableName + ']'
             EXEC (@sql)

    But it takes 20second.....works but slow 🙂

    Just realized that code will run only for the first column and then every other column it will fail because the temp table already exists so SELECT INTO doesn't work.   You could create the temp table as the first part of the @sql variable and then add the remaining statements, possibly including UNION ALL, as a single insert.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • How about we try the following:
    WITH DimTables AS (

        SELECT t.[name]        AS TableName
            , s.[name]        AS SchemaName
            , t.[Object_ID]    AS TableID
        FROM DWH_Staging.sys.tables AS t
            INNER JOIN DWH_Staging.sys.schemas AS s
                ON s.[schema_id] = t.[schema_id]
        WHERE t.[schema_id] = 6            -- Only for dim. now
            AND t.[temporal_type] = 2    -- Only non-temporal tables
    ),
        DimColumns AS (

            SELECT t.SchemaName    AS SchemaName
                , t.TableName    AS TableName
                , c.[name]        AS ColumnName
            FROM DWH_Staging.sys.columns AS c
                INNER JOIN DimTables AS t
                    ON t.TableID = c.[object_id]
            WHERE c.[name] NOT IN ('BeginDate','EndDate','Created','Modified','CountryID','ActiveID','SourceID','ValidFromID','ValidToID','Domain')
    )
    -- Load to the temp table for evaluating
    SELECT c.SchemaName    AS SchemaName
        , c.TableName    AS TableName
        , c.ColumnName    AS ColumnName
    INTO #Columns
    FROM DimColumns AS c
    ORDER BY c.TableName;

    DECLARE @SQL AS varchar(max) =
        'CREATE TABLE ##Results (ID int IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED, TableName varchar(128), ColumnName varchar(128), UniqueValues bigint);
    INSERT INTO ##Results (TableName, ColumnName, UniqueValues)
    ';

    SELECT @SQL = @SQL + 'SELECT ''' + C.TableName + ''' AS TableName, ''' + C.ColumnName + ''' COUNT(DISTINCT ' + C.ColumnName + ') AS UniqueValues
    FROM ' + C.SchemaName + '.' + C.TableName +
        CASE WHEN LEAD(C.TableName, 1, NULL) OVER (ORDER BY C.TableName, C.ColumnName) IS NOT NULL THEN ' UNION ALL ' ELSE '; SELECT * FROM ##Results;' END
    FROM #Columns AS C
    ORDER BY C.TableName, C.ColumnName;

    --PRINT @SQL;
    EXEC (@SQL);

    Let me know what results that produces.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • sgmunson - Friday, September 22, 2017 10:43 AM

    povamartin - Friday, September 22, 2017 8:54 AM

    Then simple while and....
            SET @sql = 'SELECT DISTINCT ' + @columnName + ' INTO #DistinctCounts FROM [DWH_Staging].[' + @schemaName + '].[' + @tableName + ']'
             EXEC (@sql)

    But it takes 20second.....works but slow 🙂

    Just realized that code will run only for the first column and then every other column it will fail because the temp table already exists so SELECT INTO doesn't work.   You could create the temp table as the first part of the @sql variable and then add the remaining statements, possibly including UNION ALL, as a single insert.

    I truncate the temp table at the start of cycle and that UPDATE number of unique values for specific Table and Column in other temp table, so the result work into #DistinctCount table always. We have a few tables now, cca 30 fact tables and 40 dimensions, about milion of rows of data in the fact tables, the whole DB is now about 70GB. But it will grow up lately by adding new countries (now is this DWH made only for Czech Republic).

  • Complete procedure below.
    But I will try to concat the SQL statement a run it in one EXEC, we will see.


    ALTER PROCEDURE [ctr].[ColumnVariabilityCheck]
        @withOutput        INT = 1
        ,@onlyErrors    INT = 1

    AS

    BEGIN

      --------------------------------------------------------------------------
      -- Temp tables cleaning
      --------------------------------------------------------------------------
      IF OBJECT_ID ('tempdb..#Columns') IS NOT NULL
         DROP TABLE #Columns

      --------------------------------------------------------------------------
      -- Variables
      --------------------------------------------------------------------------
      DECLARE @tableName                    VARCHAR(50)
      DECLARE @columnName                    VARCHAR(50)
      DECLARE @schemaName                    VARCHAR(3)
      DECLARE @sql                        VARCHAR(1000)
      DECLARE @MinimalVariabilityLimit    INT = 2            -- (-1) for Unknown and 1 for other value

      --------------------------------------------------------------------------
      -- Load all columns and tables in specific schema
      --------------------------------------------------------------------------
      ;
      WITH
         DimTables AS (
             SELECT
                 t.[name]            AS TableName
                 ,s.[name]            AS SchemaName
                 ,t.[Object_ID]        AS TableID
             FROM
                 DWH_Staging.sys.tables t
                 JOIN DWH_Staging.sys.schemas s
                    ON s.[schema_id] = t.[schema_id]
             WHERE
                 t.[schema_id] = 6                            -- Only for dim. now
                 AND t.[temporal_type] = 2                    -- Only non-temporal tables
         )
         ,DimColumns AS (
             SELECT
                 t.SchemaName        AS SchemaName
                 ,t.TableName        AS TableName
                 ,c.[name]            AS ColumnName
             FROM
                 DWH_Staging.sys.columns c
                 JOIN DimTables t
                    ON t.TableID = c.[object_id]
             WHERE c.[name] NOT IN (
                'BeginDate','EndDate'
                ,'Created','Modified'
                ,'CountryID','ActiveID','SourceID'
                ,'ValidFromID','ValidToID'
                ,'Domain'
             )
         )
      -- Load to the temp table for evaluating
      SELECT
         c.SchemaName        AS SchemaName
         ,c.TableName        AS TableName
         ,c.ColumnName    AS ColumnName
         ,NULL          AS Executed        -- Filled when EXEC is made
         ,NULL          AS UniqueRows    -- COUNT(DISTINCT)
      INTO #Columns
      FROM DimColumns c
      ORDER BY c.TableName

      --------------------------------------------------------------------------
      -- While cycle to check all columns
      --------------------------------------------------------------------------
        -- Duration 20 second, not bad but needs to improve that
      WHILE (SELECT COUNT(*) FROM #Columns WHERE Executed IS NULL) > 0
            BEGIN

                -- Clear previous results
                IF OBJECT_ID ('tempdb..#DistinctCounts') IS NOT NULL
                    DROP TABLE #DistinctCounts

                -- Load column
                SELECT TOP 1
                    @tableName        = TableName
                    ,@columnName    = ColumnName
                    ,@schemaName    = SchemaName
                FROM #Columns
                WHERE Executed IS NULL

                -- SQL statement prepare and execution
                -- It is inserted to temp table because for no results on screen during all control scripts
                SET @sql = 'SELECT DISTINCT ' + @columnName + ' INTO #DistinctCounts FROM [DWH_Staging].[' + @schemaName + '].[' + @tableName + ']'
                EXEC (@sql)

                -- Actualize executions
                UPDATE #Columns
                    SET
                        Executed    = 1    
                        ,UniqueRows = @@ROWCOUNT
                WHERE
                    TableName = @tableName
                    AND ColumnName = @columnName
                    AND SchemaName = @schemaName
            END

      --------------------------------------------------------------------------
      -- Output results
      --------------------------------------------------------------------------
        IF (@withOutput = 1)
            BEGIN
                SELECT
                    c.SchemaName
                    ,c.TableName
                    ,c.ColumnName
                    ,'SELECT ' + c.ColumnName + ' FROM [DWH_Staging].[' + c.SchemaName + '].[' + c.TableName + ']' AS SelectStatement
                    ,c.UniqueRows
                FROM #Columns c
                WHERE
                    ((@onlyErrors = 1) AND (c.UniqueRows < @MinimalVariabilityLimit))
                    OR (@onlyErrors = 0)
            END

      RETURN
    END

  • Did you even try my posted code?   Given what's being done, I'm not sure the presence of the loop vs. my set-based code will have a difference in execution time or not, but it seems unnecessary to invoke a WHILE loop.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • sgmunson - Monday, September 25, 2017 6:52 AM

    Did you even try my posted code?   Given what's being done, I'm not sure the presence of the loop vs. my set-based code will have a difference in execution time or not, but it seems unnecessary to invoke a WHILE loop.

    I need to rewrite a little bit your code, it ends with error

    Msg 120, Level 15, State 1, Line 2
    The select list for the INSERT statement contains fewer items than the insert list. The number of SELECT values must match the number of INSERT columns.

    But I agree with you, the LOOP is very bad solution.

  • povamartin - Monday, September 25, 2017 7:20 AM

    sgmunson - Monday, September 25, 2017 6:52 AM

    Did you even try my posted code?   Given what's being done, I'm not sure the presence of the loop vs. my set-based code will have a difference in execution time or not, but it seems unnecessary to invoke a WHILE loop.

    I need to rewrite a little bit your code, it ends with error

    Msg 120, Level 15, State 1, Line 2
    The select list for the INSERT statement contains fewer items than the insert list. The number of SELECT values must match the number of INSERT columns.

    But I agree with you, the LOOP is very bad solution.

    Yep, I see the error... one stinkin' lousy comma...
    Here's the fix:
    WITH DimTables AS (

      SELECT t.[name]   AS TableName
       , s.[name]   AS SchemaName
       , t.[Object_ID]  AS TableID
      FROM DWH_Staging.sys.tables AS t
       INNER JOIN DWH_Staging.sys.schemas AS s
        ON s.[schema_id] = t.[schema_id]
      WHERE t.[schema_id] = 6    -- Only for dim. now
       AND t.[temporal_type] = 2  -- Only non-temporal tables
    ),
      DimColumns AS (

       SELECT t.SchemaName  AS SchemaName
        , t.TableName  AS TableName
        , c.[name]   AS ColumnName
       FROM DWH_Staging.sys.columns AS c
        INNER JOIN DimTables AS t
          ON t.TableID = c.[object_id]
       WHERE c.[name] NOT IN ('BeginDate','EndDate','Created','Modified','CountryID','ActiveID','SourceID','ValidFromID','ValidToID','Domain')
    )
    -- Load to the temp table for evaluating
    SELECT c.SchemaName  AS SchemaName
      , c.TableName  AS TableName
      , c.ColumnName  AS ColumnName
    INTO #Columns
    FROM DimColumns AS c
    ORDER BY c.TableName;

    DECLARE @SQL AS varchar(max) =
      'CREATE TABLE ##Results (ID int IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED, TableName varchar(128), ColumnName varchar(128), UniqueValues bigint);
    INSERT INTO ##Results (TableName, ColumnName, UniqueValues)
    ';

    SELECT @SQL = @SQL + 'SELECT ''' + C.TableName + ''' AS TableName, ''' + C.ColumnName + ''' AS ColumnName, COUNT(DISTINCT ' + C.ColumnName + ') AS UniqueValues
    FROM ' + C.SchemaName + '.' + C.TableName +
      CASE WHEN LEAD(C.TableName, 1, NULL) OVER (ORDER BY C.TableName, C.ColumnName) IS NOT NULL THEN ' UNION ALL ' ELSE '; SELECT * FROM ##Results;' END
    FROM #Columns AS C
    ORDER BY C.TableName, C.ColumnName;

    --PRINT @SQL;
    EXEC (@SQL);

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • There was also problem, that some columns were names as USER so I need to do it like this  [ + ColumnName ]
    And the results are: 23s,17s, 16s, 16s, 20s, 17s
    probably your solution is quickier, co I will do it this way. We will see after some time and data increasing. Thank you a lot 🙂
    Wish you a good day from Czech Republic.

  • povamartin - Monday, September 25, 2017 8:51 AM

    There was also problem, that some columns were names as USER so I need to do it like this  [ + ColumnName ]
    And the results are: 23s,17s, 16s, 16s, 20s, 17s
    probably your solution is quickier, co I will do it this way. We will see after some time and data increasing. Thank you a lot 🙂
    Wish you a good day from Czech Republic.

    Okay, sounds good.   Glad I could help.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

Viewing 13 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply