December 17, 2015 at 9:18 pm
I am looking for an efficient way of getting a count of unique values for many fields from a table hopefully in one EFFICIENT query.
I am successfully getting the data I need right now in my prototype however, it is very inefficient. I am passing millions of rows and it takes about 1 minute to run all of the individual queries. I am sure there is a better way to accomplish what I am doing but I am not sure how best to tackle this.
Here is what I am doing now.
I have a table that contains a list of approx. 20 different fields that I need to get counts of unique values. Let’s just look at 3 of them.
So my “Driver” table has the list of fields, let’s say they are:
Region
State
City
I extract those fields one by one from the “Driver table” using a cursor (I know, performance stinks) and then gather the totals, inserting them into a temp table using a dynamic SQL statement within a stored procedure. So essentially it looks like this
Select FieldName from DriverTable using cursor
Select ‘Field from Driver Table’ as ‘Field Name’, ‘Field from Driver Table’ , count(*)
From Customer
Where ‘Field from Driver Table’ is not null and status = ‘Active’
and date = curdate()
Group by ‘Field from Driver Table’
Get next Field From Driver Table’
So I end up with a temp table that has all of the unique values for:
Region and their counts
State and their counts
City and their counts
Etc.
As you can imagine, looping through the cursor and running each of these individual queries is slow.
Is there a better way?
December 17, 2015 at 10:09 pm
Tim Kovacich (12/17/2015)
I am looking for an efficient way of getting a count of unique values for many fields from a table hopefully in one EFFICIENT query.I am successfully getting the data I need right now in my prototype however, it is very inefficient. I am passing millions of rows and it takes about 1 minute to run all of the individual queries. I am sure there is a better way to accomplish what I am doing but I am not sure how best to tackle this.
Here is what I am doing now.
I have a table that contains a list of approx. 20 different fields that I need to get counts of unique values. Let’s just look at 3 of them.
So my “Driver” table has the list of fields, let’s say they are:
Region
State
City
I extract those fields one by one from the “Driver table” using a cursor (I know, performance stinks) and then gather the totals, inserting them into a temp table using a dynamic SQL statement within a stored procedure. So essentially it looks like this
Select FieldName from DriverTable using cursor
Select ‘Field from Driver Table’ as ‘Field Name’, ‘Field from Driver Table’ , count(*)
From Customer
Where ‘Field from Driver Table’ is not null and status = ‘Active’
and date = curdate()
Group by ‘Field from Driver Table’
Get next Field From Driver Table’
So I end up with a temp table that has all of the unique values for:
Region and their counts
State and their counts
City and their counts
Etc.
As you can imagine, looping through the cursor and running each of these individual queries is slow.
Is there a better way?
Does your "driver table" change on a regular basis? If not, you're making it WAY harder than it need to be. Something as simple as the following will give you what you're after...
SELECT
Region = COUNT(DISTINCT c.Region),
State = COUNT(DISTINCT c.State),
City = COUNT(DISTINCT c.City)
FROM
Customer c
If the required columns are changing on a regular basis and this something that needs to be done on an automated basis, then the above query can be built on the fly using dynamic sql.
Just note that you can build the entire query and simply execute it once rather that one column at a time...
SET NOCOUNT ON;
IF OBJECT_ID('tempdb..#DriverTable', 'U') IS NOT NULL
DROP TABLE #DriverTable;
CREATE TABLE #DriverTable (
ID INT,
SchemaName SYSNAME,
TableName SYSNAME,
ColumnName SYSNAME
);
INSERT #DriverTable (ID, SchemaName, TableName, ColumnName) VALUES
(1, 'dbo', 'Customer', 'Region'), (2, 'dbo', 'Customer', 'State'), (3, 'dbo', 'Customer', 'City');
DECLARE @sql VARCHAR(8000) = 'SELECT
' ;
SELECT @sql = @sql +
STUFF (
(SELECT
',' + dt.ColumnName + ' = COUNT(DISTINCT ' + dt.TableName + '.' + dt.ColumnName + CHAR(10)
FROM
#DriverTable dt
ORDER BY
dt.ID
FOR XML PATH ('')), 1, 1, '');
SELECT @sql = @sql + 'FROM ' + MIN(dt.SchemaName) + '.' + MIN(dt.TableName)
FROM #DriverTable dt;
PRINT @sql;
--EXEC (@sql);
Printed dynamic sql...
SELECT
Region = COUNT(DISTINCT Customer.Region
,State = COUNT(DISTINCT Customer.State
,City = COUNT(DISTINCT Customer.City
FROM dbo.Customer
December 18, 2015 at 9:23 am
Thank you for the quick turnaround. Sorry, I was not clear .
The results I get with this suggestion are:
RegionState City
6 3 6
What I am after is more like this:
From the Region Query, the expected result would look something like:
Field value Count
SouthWest 8
North 9
South 10
From the State Query, the expected result would look something like:
Field value Count
New York 60
Texas 4
Arizona 10
etc.
Also, to answer your question, yes the field names are dynamic
Thanks
December 21, 2015 at 12:23 pm
I'd create indexes on all the columns you need to count that way. Then the COUNT query can use the existing indexes to give you a very quick totals.
If you don't want to do that, I'd probably just cursor once thru every row in the table getting all the counts at once dynamically based on the control table contents.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
December 21, 2015 at 10:20 pm
Something like this is pretty easy if you know about a personal friend of mine... the SQL_Variant datatype. It can be a bit of a fair weather friend so read up on it before you use it in other places.
Before we start, we need a typical million row table to test with. I only made 6 columns but it'll give you a feel for what's possible. Not to worry... it only takes 3 seconds to build on my humble laptop.
--=======================================================================================
-- Create and populate a million row test table with random data
-- This is NOT a part of the solution. It's just a table to demo with.
--=======================================================================================
--===== Do this in a nice, safe place that everyone has.
USE tempdb
;
--===== If the test table exists, drop it to make reruns in SSMS easier.
IF OBJECT_ID('tempdb.dbo.JBMTest','U') IS NOT NULL
DROP TABLE dbo.JBMTest
;
--===== Create and populate the table on-the-fly.
SELECT TOP 1000000
SomeID = IDENTITY(INT,1,1)
,ColA = CHAR(ABS(CHECKSUM(NEwID()))% 5+65)+CHAR(ABS(CHECKSUM(NEwID()))%5+65)
,ColB = CHAR(ABS(CHECKSUM(NEwID()))%10+65)+CHAR(ABS(CHECKSUM(NEwID()))%5+65)
,ColC = CHAR(ABS(CHECKSUM(NEwID()))% 7+65)+CHAR(ABS(CHECKSUM(NEwID()))%5+65)
,ColD = ABS(CHECKSUM(NEwID()))% 6+1
,ColE = CAST(ABS(CHECKSUM(NEwID()))%30+10500 AS CHAR(5)) --ZipCode Simulation
INTO dbo.JBMTest
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
;
--===== Let's see the datatypes for the table.
-- It'll handle a lot more than just CHAR and INT, though.
EXEC sp_help 'dbo.JBMTest'
;
Now comes the fun part. Basically, we'll create a dynamic "unpivot" using the column names from the table (sans those columns with the wrong datatypes and attributes for this) using a CROSS APPLY to form a nice, healthy, NVP (Name Value Pair) table and then simply do a normal count by ColumnName (attribute) and value.
Most of the details are in the comments in the code below. This only takes about 6 seconds on my laptop.
--=======================================================================================
-- Now we get to a solution.
--=======================================================================================
--===== Declare some obviously named variables.
-- The first one could be a parameter for a stored procedure.
DECLARE @TableName SYSNAME = N'[dbo].[JBMTest]'
,@SQL VARCHAR(MAX)
,@SQLCA VARCHAR(MAX)
;
--===== Create the CROSS APPLY code from the non-Identity, non-Blob columns.
-- The CROSS APPLY will be used to "unpivot" the data.
-- Order is NOT important here.
SELECT @SQLCA = ISNULL(@SQLCA + ' UNION ALL','')
+ REPLACE(REPLACE('
SELECT "<<name>>",CAST(<<name>> AS SQL_VARIANT)'
,'"','''')
,'<<name>>',QUOTENAME(name))
FROM sys.columns
WHERE object_id = (OBJECT_ID(@TableName))
AND is_identity = 0 --No IDENTITY columns allowed. Makes no sense.
AND is_rowguidcol = 0 --No ROWGUID columns either.
AND max_length BETWEEN 1 AND 8000 --Rejects blobs and SQL_Variant
AND system_type_id NOT IN ( --=== Reject other incompatible types
SELECT system_type_id
FROM sys.types
WHERE name IN ('hierarchyid','geometry','geography','timestamp')
)
;
--===== Add the CROSS APPLY code to the rest of the code
SELECT @SQL = '
SELECT ca.Attribute, ca.Value, Occurances = COUNT(*)
FROM ' + @TableName + '
CROSS APPLY ('
+ @SQLCA + '
) ca (Attribute, Value)
GROUP BY ca.Attribute,ca.Value
ORDER BY ca.Attribute,ca.Value
;'
;
--===== Print the code and then execute it
PRINT @SQL;
EXEC (@SQL)
;
Let me know if that's even close to what you need.
--Jeff Moden
Change is inevitable... Change for the better is not.
January 7, 2016 at 11:23 am
Hi Jeff,
I just wanted to let you know I love your personal friend. The idea you sent was an excellent suggestion. I went from my original post which was taking approx. 1 minute to run, to your post which takes approx. 9 seconds. In addition, my one minute version was tested using approx. 10 fields and your model was for the majority of the table (minus the inappropriate data types) so approx. 150 fields. Just a huge performance increase!! Thank you very much!!
January 8, 2016 at 7:38 pm
Tim Kovacich (1/7/2016)
Hi Jeff,I just wanted to let you know I love your personal friend. The idea you sent was an excellent suggestion. I went from my original post which was taking approx. 1 minute to run, to your post which takes approx. 9 seconds. In addition, my one minute version was tested using approx. 10 fields and your model was for the majority of the table (minus the inappropriate data types) so approx. 150 fields. Just a huge performance increase!! Thank you very much!!
Very glad I could help and thank you very much for the feedback. I love it when people post back some performance stats whether they're good or bad because I ran a test on test data and you ran a test on a real life situation. Your feedback has helped ME tremendously and that, of course, will be used to help others.
Thanks again, Tim.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 7 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply