Efficient way to gat all unique values and their counts for a dynamic list of fields using one query statement

  • I am looking for an efficient way of getting a count of unique values for many fields from a table hopefully in one EFFICIENT query.

    I am successfully getting the data I need right now in my prototype however, it is very inefficient. I am passing millions of rows and it takes about 1 minute to run all of the individual queries. I am sure there is a better way to accomplish what I am doing but I am not sure how best to tackle this.

    Here is what I am doing now.

    I have a table that contains a list of approx. 20 different fields that I need to get counts of unique values. Let’s just look at 3 of them.

    So my “Driver” table has the list of fields, let’s say they are:

    Region

    State

    City

    I extract those fields one by one from the “Driver table” using a cursor (I know, performance stinks) and then gather the totals, inserting them into a temp table using a dynamic SQL statement within a stored procedure. So essentially it looks like this

    Select FieldName from DriverTable using cursor

    Select ‘Field from Driver Table’ as ‘Field Name’, ‘Field from Driver Table’ , count(*)

    From Customer

    Where ‘Field from Driver Table’ is not null and status = ‘Active’

    and date = curdate()

    Group by ‘Field from Driver Table’

    Get next Field From Driver Table’

    So I end up with a temp table that has all of the unique values for:

    Region and their counts

    State and their counts

    City and their counts

    Etc.

    As you can imagine, looping through the cursor and running each of these individual queries is slow.

    Is there a better way?

  • Tim Kovacich (12/17/2015)


    I am looking for an efficient way of getting a count of unique values for many fields from a table hopefully in one EFFICIENT query.

    I am successfully getting the data I need right now in my prototype however, it is very inefficient. I am passing millions of rows and it takes about 1 minute to run all of the individual queries. I am sure there is a better way to accomplish what I am doing but I am not sure how best to tackle this.

    Here is what I am doing now.

    I have a table that contains a list of approx. 20 different fields that I need to get counts of unique values. Let’s just look at 3 of them.

    So my “Driver” table has the list of fields, let’s say they are:

    Region

    State

    City

    I extract those fields one by one from the “Driver table” using a cursor (I know, performance stinks) and then gather the totals, inserting them into a temp table using a dynamic SQL statement within a stored procedure. So essentially it looks like this

    Select FieldName from DriverTable using cursor

    Select ‘Field from Driver Table’ as ‘Field Name’, ‘Field from Driver Table’ , count(*)

    From Customer

    Where ‘Field from Driver Table’ is not null and status = ‘Active’

    and date = curdate()

    Group by ‘Field from Driver Table’

    Get next Field From Driver Table’

    So I end up with a temp table that has all of the unique values for:

    Region and their counts

    State and their counts

    City and their counts

    Etc.

    As you can imagine, looping through the cursor and running each of these individual queries is slow.

    Is there a better way?

    Does your "driver table" change on a regular basis? If not, you're making it WAY harder than it need to be. Something as simple as the following will give you what you're after...

    SELECT

    Region = COUNT(DISTINCT c.Region),

    State = COUNT(DISTINCT c.State),

    City = COUNT(DISTINCT c.City)

    FROM

    Customer c

    If the required columns are changing on a regular basis and this something that needs to be done on an automated basis, then the above query can be built on the fly using dynamic sql.

    Just note that you can build the entire query and simply execute it once rather that one column at a time...

    SET NOCOUNT ON;

    IF OBJECT_ID('tempdb..#DriverTable', 'U') IS NOT NULL

    DROP TABLE #DriverTable;

    CREATE TABLE #DriverTable (

    ID INT,

    SchemaName SYSNAME,

    TableName SYSNAME,

    ColumnName SYSNAME

    );

    INSERT #DriverTable (ID, SchemaName, TableName, ColumnName) VALUES

    (1, 'dbo', 'Customer', 'Region'), (2, 'dbo', 'Customer', 'State'), (3, 'dbo', 'Customer', 'City');

    DECLARE @sql VARCHAR(8000) = 'SELECT

    ' ;

    SELECT @sql = @sql +

    STUFF (

    (SELECT

    ',' + dt.ColumnName + ' = COUNT(DISTINCT ' + dt.TableName + '.' + dt.ColumnName + CHAR(10)

    FROM

    #DriverTable dt

    ORDER BY

    dt.ID

    FOR XML PATH ('')), 1, 1, '');

    SELECT @sql = @sql + 'FROM ' + MIN(dt.SchemaName) + '.' + MIN(dt.TableName)

    FROM #DriverTable dt;

    PRINT @sql;

    --EXEC (@sql);

    Printed dynamic sql...

    SELECT

    Region = COUNT(DISTINCT Customer.Region

    ,State = COUNT(DISTINCT Customer.State

    ,City = COUNT(DISTINCT Customer.City

    FROM dbo.Customer

  • Thank you for the quick turnaround. Sorry, I was not clear .

    The results I get with this suggestion are:

    RegionState City

    6 3 6

    What I am after is more like this:

    From the Region Query, the expected result would look something like:

    Field value Count

    SouthWest 8

    North 9

    South 10

    From the State Query, the expected result would look something like:

    Field value Count

    New York 60

    Texas 4

    Arizona 10

    etc.

    Also, to answer your question, yes the field names are dynamic

    Thanks

  • I'd create indexes on all the columns you need to count that way. Then the COUNT query can use the existing indexes to give you a very quick totals.

    If you don't want to do that, I'd probably just cursor once thru every row in the table getting all the counts at once dynamically based on the control table contents.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Something like this is pretty easy if you know about a personal friend of mine... the SQL_Variant datatype. It can be a bit of a fair weather friend so read up on it before you use it in other places.

    Before we start, we need a typical million row table to test with. I only made 6 columns but it'll give you a feel for what's possible. Not to worry... it only takes 3 seconds to build on my humble laptop.

    --=======================================================================================

    -- Create and populate a million row test table with random data

    -- This is NOT a part of the solution. It's just a table to demo with.

    --=======================================================================================

    --===== Do this in a nice, safe place that everyone has.

    USE tempdb

    ;

    --===== If the test table exists, drop it to make reruns in SSMS easier.

    IF OBJECT_ID('tempdb.dbo.JBMTest','U') IS NOT NULL

    DROP TABLE dbo.JBMTest

    ;

    --===== Create and populate the table on-the-fly.

    SELECT TOP 1000000

    SomeID = IDENTITY(INT,1,1)

    ,ColA = CHAR(ABS(CHECKSUM(NEwID()))% 5+65)+CHAR(ABS(CHECKSUM(NEwID()))%5+65)

    ,ColB = CHAR(ABS(CHECKSUM(NEwID()))%10+65)+CHAR(ABS(CHECKSUM(NEwID()))%5+65)

    ,ColC = CHAR(ABS(CHECKSUM(NEwID()))% 7+65)+CHAR(ABS(CHECKSUM(NEwID()))%5+65)

    ,ColD = ABS(CHECKSUM(NEwID()))% 6+1

    ,ColE = CAST(ABS(CHECKSUM(NEwID()))%30+10500 AS CHAR(5)) --ZipCode Simulation

    INTO dbo.JBMTest

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    ;

    --===== Let's see the datatypes for the table.

    -- It'll handle a lot more than just CHAR and INT, though.

    EXEC sp_help 'dbo.JBMTest'

    ;

    Now comes the fun part. Basically, we'll create a dynamic "unpivot" using the column names from the table (sans those columns with the wrong datatypes and attributes for this) using a CROSS APPLY to form a nice, healthy, NVP (Name Value Pair) table and then simply do a normal count by ColumnName (attribute) and value.

    Most of the details are in the comments in the code below. This only takes about 6 seconds on my laptop.

    --=======================================================================================

    -- Now we get to a solution.

    --=======================================================================================

    --===== Declare some obviously named variables.

    -- The first one could be a parameter for a stored procedure.

    DECLARE @TableName SYSNAME = N'[dbo].[JBMTest]'

    ,@SQL VARCHAR(MAX)

    ,@SQLCA VARCHAR(MAX)

    ;

    --===== Create the CROSS APPLY code from the non-Identity, non-Blob columns.

    -- The CROSS APPLY will be used to "unpivot" the data.

    -- Order is NOT important here.

    SELECT @SQLCA = ISNULL(@SQLCA + ' UNION ALL','')

    + REPLACE(REPLACE('

    SELECT "<<name>>",CAST(<<name>> AS SQL_VARIANT)'

    ,'"','''')

    ,'<<name>>',QUOTENAME(name))

    FROM sys.columns

    WHERE object_id = (OBJECT_ID(@TableName))

    AND is_identity = 0 --No IDENTITY columns allowed. Makes no sense.

    AND is_rowguidcol = 0 --No ROWGUID columns either.

    AND max_length BETWEEN 1 AND 8000 --Rejects blobs and SQL_Variant

    AND system_type_id NOT IN ( --=== Reject other incompatible types

    SELECT system_type_id

    FROM sys.types

    WHERE name IN ('hierarchyid','geometry','geography','timestamp')

    )

    ;

    --===== Add the CROSS APPLY code to the rest of the code

    SELECT @SQL = '

    SELECT ca.Attribute, ca.Value, Occurances = COUNT(*)

    FROM ' + @TableName + '

    CROSS APPLY ('

    + @SQLCA + '

    ) ca (Attribute, Value)

    GROUP BY ca.Attribute,ca.Value

    ORDER BY ca.Attribute,ca.Value

    ;'

    ;

    --===== Print the code and then execute it

    PRINT @SQL;

    EXEC (@SQL)

    ;

    Let me know if that's even close to what you need.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff,

    I just wanted to let you know I love your personal friend. The idea you sent was an excellent suggestion. I went from my original post which was taking approx. 1 minute to run, to your post which takes approx. 9 seconds. In addition, my one minute version was tested using approx. 10 fields and your model was for the majority of the table (minus the inappropriate data types) so approx. 150 fields. Just a huge performance increase!! Thank you very much!!

  • Tim Kovacich (1/7/2016)


    Hi Jeff,

    I just wanted to let you know I love your personal friend. The idea you sent was an excellent suggestion. I went from my original post which was taking approx. 1 minute to run, to your post which takes approx. 9 seconds. In addition, my one minute version was tested using approx. 10 fields and your model was for the majority of the table (minus the inappropriate data types) so approx. 150 fields. Just a huge performance increase!! Thank you very much!!

    Very glad I could help and thank you very much for the feedback. I love it when people post back some performance stats whether they're good or bad because I ran a test on test data and you ran a test on a real life situation. Your feedback has helped ME tremendously and that, of course, will be used to help others.

    Thanks again, Tim.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply