Efficient way to count distinct

Question

Efficient way to count distinct

ollyjolly

SSC Eights!

Points: 833
More actions
July 3, 2018 at 6:50 pm

#365252

I am trying to get a query to run which counts distinct values for two columns.
I have a temp table named #Paid that contains over 60 million rows and when I group the columns and perform a distinct count it should return over 2 million rows. The distinct counts are being done on the E_Id column which is varchar (32) and P_Id column which is varchar (80).
The current query takes a very long time to execute. After 5 hours I stopped the query as nothing was returning. Here is the SQL code with count distinct for both columns.

SELECT P.Yr
,P.YrMnth
,P.Market
,P.AType
,P.Groups
,P.Code
,P.Classification
,P.CType
,P.ServCategory
,P.DiaCategory
,count(distinct P.E_Id) As Distinct_E
,count(distinct P.P_Id) As Distinct_P
FROM #Paid P
Group by P.Yr
,P.YrMnth
,P.Market
,P.AType
,P.Groups
,P.Code
,P.Classification
,P.CType
,P.ServCategory
,P.DiaCategory

I also tried using subqueries to count the distinct values and applied a clustered index to the #Paid table using the with the join fields from the distinct count subqueries. Here is the code I am using but this too runs for hours.
SELECT P.Yr
,P.YrMnth
,P.Market
,P.AType
,P.Groups
,P.Code
,P.Classification
,P.CType
,P.ServCategory
,P.DiaCategory
,sum(c.ct) As Distinct_E
,sum(d.ct) As Distinct_P
FROM #Paid P
left join (select d.Yr, d.YrMnth, d.Market, d.AType, d.Groups, d.Code,
   d.Classification, d.CType, d.ServCategory, d.DiaCategory,
   d.E_Id, count(*) as ct

   from (Select P.Yr, P.YrMnth, P.Market, P.AType, P.Groups, P.Code, P.Classification,
   P.CType, P.ServCategory, P.DiaCategory, P.E_Id
     FROM #Paid_Claims PC
     ) as d
   group by d.Yr, d.YrMnth, d.Market, d.AType, d.Groups, d.Code,
   d.Classification, d.CType, d.ServCategory, d.DiaCategory,
   d.E_Id
) as c
on c.Code = P.Code AND c.Classification = P.Classification AND c.ServCategory = P.ServCategory
AND c.DiaCategory = P.DiaCategory AND c.Groups = P.Groups AND c.YrMnth = P.YrMnth AND c.Market = P.Market AND c.CType = P.CType
AND c.AType = P.AType AND c.Yr = P.Yr
left join (select d.Yr, d.YrMnth, d.Market, d.AType, d.Groups, d.Code,
   d.Classification, d.CType, d.ServCategory, d.DiaCategory,
   d.P_Id, count(1) as ct

   from (Select P.Yr, P.YrMnth, P.Market, P.AType, P.Groups, P.Code, P.Classification,
   P.CType, P.ServCategory, P.DiaCategory, P.P_Id
     FROM #Paid_Claims PC
     ) as d
   group by d.Yr, d.YrMnth, d.Market, d.AType, d.Groups, d.Code,
   d.Classification, d.CType, d.ServCategory, d.DiaCategory,
   d.P_Id
) as d
on d.Code = Pd.Code AND d.Classification = Pd.Classification AND d.ServCategory = Pd.ServCategory
AND d.DiaCategory = Pd.DiaCategory AND d.Groups = Pd.Groups AND d.YrMnth = Pd.YrMnth AND d.Market = Pd.Market AND d.CType = Pd.CType
AND d.AType = Pd.AType AND d.Yr = P.Yr
Group by P.Yr
,P.YrMnth
,P.Market
,P.AType
,P.Groups
,P.Code
,P.Classification
,P.CType
,P.ServCategory
,P.DiaCategory

What I don't understand is that I am able to get other portions of the query to run in a timely manner even those that include 40 million and more records, but trying to group by a large record set with distinct counts on a long varchar is not working very well. Either there has to be a better way to achieve this or I have overlooked something in my query.
Has anyone used either method or another method to count distinct values?

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004621 More actions · Answer 1

ollyjolly - Tuesday, July 3, 2018 6:50 PM
I am trying to come up with a more efficient method to count distinct values for two columns. My temp table #Paid contains over 60 million rows and when I group the columns I return 2 million rows. The current query takes a very long time (hours) before it returns any data.
Here is the SQL code with count distinct for both columns.
SELECT P.Yr
,P.YrMnth
,P.Market
,P.AType
,P.Groups
,P.Code
,P.Classification
,P.CType
,P.ServCategory
,P.DiaCategory
,count(distinct P.E_Id) As Distinct_E
,count(distinct P.P_Id) As Distinct_P
FROM #Paid P
Group by P.Yr
,P.YrMnth
,P.Market
,P.AType
,P.Groups
,P.Code
,P.Classification
,P.CType
,P.ServCategory
,P.DiaCategory
I have also attempted to apply a clustered index to the #Paid table and then create subqueries to get a distinct count. Here is the code.
SELECT P.Yr
,P.YrMnth
,P.Market
,P.AType
,P.Groups
,P.Code
,P.Classification
,P.CType
,P.ServCategory
,P.DiaCategory
,sum(c.ct) As Distinct_E
,sum(d.ct) As Distinct_P
FROM #Paid P
left join (select d.Yr, d.YrMnth, d.Market, d.AType, d.Groups, d.Code,
  d.Classification, d.CType, d.ServCategory, d.DiaCategory,
  d.E_Id, count(*) as ct  from (Select P.Yr, P.YrMnth, P.Market, P.AType, P.Groups, P.Code, P.Classification,
  P.CType, P.ServCategory, P.DiaCategory, P.E_Id
    FROM #Paid_Claims PC
    ) as d
  group by d.Yr, d.YrMnth, d.Market, d.AType, d.Groups, d.Code,
  d.Classification, d.CType, d.ServCategory, d.DiaCategory,
  d.E_Id
) as c
on c.Code = P.Code AND c.Classification = P.Classification AND c.ServCategory = P.ServCategory
AND c.DiaCategory = P.DiaCategory AND c.Groups = P.Groups AND c.YrMnth = P.YrMnth AND c.Market = P.Market AND c.CType = P.CType
AND c.AType = P.AType AND c.Yr = P.Yr

left join (select d.Yr, d.YrMnth, d.Market, d.AType, d.Groups, d.Code,
  d.Classification, d.CType, d.ServCategory, d.DiaCategory,
  d.P_Id, count(1) as ct  from (Select P.Yr, P.YrMnth, P.Market, P.AType, P.Groups, P.Code, P.Classification,
  P.CType, P.ServCategory, P.DiaCategory, P.P_Id
    FROM #Paid_Claims PC
    ) as d
  group by d.Yr, d.YrMnth, d.Market, d.AType, d.Groups, d.Code,
  d.Classification, d.CType, d.ServCategory, d.DiaCategory,
  d.P_Id
) as d
on d.Code = Pd.Code AND d.Classification = Pd.Classification AND d.ServCategory = Pd.ServCategory
AND d.DiaCategory = Pd.DiaCategory AND d.Groups = Pd.Groups AND d.YrMnth = Pd.YrMnth AND d.Market = Pd.Market AND d.CType = Pd.CType
AND d.AType = Pd.AType AND d.Yr = P.Yr
Group by P.Yr
,P.YrMnth
,P.Market
,P.AType
,P.Groups
,P.Code
,P.Classification
,P.CType
,P.ServCategory
,P.DiaCategory
Is using subqueries the best method? Is there anther way this can be done that will not result in my query running for a long time?

Yes. You can create an "Indexed View", which is sometimes referred to as a "Materialized View". Be advised that they will slow down INSERTs, UPDATEs, and DELETEs to some extent. There are a whole lot of caveats and restrictions to using Indexed Views but, with the understanding that it was a quick look-see on my part, I see nothing in your code to prevent it's usage. The biggest hint I can give you is that your COUNT(DISTINCT somecolumn) would need to be changed to COUNT_BIG(DISTINCT somecolumn).

Here are some reference links:
https://docs.microsoft.com/en-us/sql/relational-databases/views/create-indexed-views?view=sql-server-2017
https://www.google.com/search?q=indexed+views+sql+server&btnK=Google+Search&oq=indexed+views+sql+server

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ollyjolly SSC Eights! Points: 833 More actions · Answer 2

Thank you Jeff. I will look through the information on indexed views and attempt to implement. I also need to make sure that I have permissions to create views. If so I will attempt the indexed view and use COUNT_BIG. Thank you again.

ollyjolly SSC Eights! Points: 833 More actions · Answer 3

As I was afraid....it looks like I am limited to SELECT permission. Attempted to create a view and got the dreaded....

CREATE VIEW permission denied in database

DesNorton SSC-Insane Points: 24541 More actions · Answer 4

Depending on the distribution of your data, this script by Paul White might be of use
Super-fast DISTINCT using a recursive CTE

How to post data/code on a forum to get the best help.

Jeff Moden SSC Guru Points: 1004621 More actions · Answer 5

ollyjolly - Tuesday, July 3, 2018 9:15 PM
As I was afraid....it looks like I am limited to SELECT permission. Attempted to create a view and got the dreaded....
CREATE VIEW permission denied in database

So write the code for the indexed view and submit it (along with the justification for the need) to the people that can deploy it to production. Heh... this is the age of "DevOps", you know. (Always has been but people are just now figuring it out).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ollyjolly SSC Eights! Points: 833 More actions · Answer 6

Jeff I will certainly look info this. I know the process with my IT department to get something through code review and in to production is a very long process.

In the meantime the recursive CTE as DesNorton posted might be an alternative.

In looking at the syntax used for a recursive CTE. As I have not used a recursive CTE before I am a little unclear on how to structure based on my query.

My initial thought is that I would need to add all the fields I am waiting to get a distinct count over. Similar to this?
WITH RecursiveCTE AS ( SELECT Yr = MIN(P.Yr), YrMnth = MIN(P.YrMnth), Additional fields (Market, AType, Groups etc....). FROM #Paid P UNION ALL SELECT R.Yr, R.YrMnth, Additional fields (Market, AType, Groups etc....). FROM ( -- A cunning way to use TOP in the recursive part of a CTE SELECT R.Yr, R.YrMnth, Additional fields (Market, AType, Groups etc....) , rn = ROW_NUMBER() OVER (ORDER BY P.Yr, P.YrMnth, Additional fields (Market, AType, Groups etc....)) FROM #Paid P JOIN RecursiveCTE R ON R.Yr <> P.Yr AND R.YrMnth <> P.YrMnth, Additional fields (Market, AType, Groups etc....) ) R WHERE R.rn = 1 ) SELECT * FROM RecursiveCTE OPTION (MAXRECURSION 0);

Jeff Moden SSC Guru Points: 1004621 More actions · Answer 7

The trouble is that you're grouping by 10 other columns. You also have no filter criteria and so you will be scanning either the entire table or a covering index that contains 60 million rows. The recursive CTE used for the "super fast" distinct is operating on 1 column of a single column clustered index.

For such a thing to work for you, you'll need the perfect index. Considering that you not allowed to even make a view, how long will it take you to push an index through?

Also, have you tried the query without the DISTINCT just to set a baseline expectation? I could be wrong but I'm thinking that it's going to take a bit longer than you'd expect,

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ollyjolly SSC Eights! Points: 833 More actions · Answer 8

As in a select of the 60 million rows without any DISTINCT or group by from the #Paid temp table?

As I am using a temp table to pull from I created a clustered index on the #paid table.

CREATE CLUSTERED INDEX Indx_Paid ON #Paid (Code, Classification, ServCategory, DiaCategory, Groups, YrMnth, Market, CType, AType)

I ordered the fields in the clustered index based on number of distinct values in each field. The Code column has 253 distinct values followed by the Classification column which has 90 distinct values.....all the way to the AType column which only contains 2 distinct values.

Could changing the order of the fields in the index make an improvement or maybe use a nonclustered index?

Jeff Moden SSC Guru Points: 1004621 More actions · Answer 9

ollyjolly - Wednesday, July 4, 2018 11:51 AM
As in a select of the 60 million rows without any DISTINCT or group by from the #Paid temp table?
As I am using a temp table to pull from I created a clustered index on the #paid table.
CREATE CLUSTERED INDEX Indx_Paid ON #Paid (Code, Classification, ServCategory, DiaCategory, Groups, YrMnth, Market, CType, AType)

I ordered the fields in the clustered index based on number of distinct values in each field. The Code column has 253 distinct values followed by the Classification column which has 90 distinct values.....all the way to the AType column which only contains 2 distinct values.
Could changing the order of the fields in the index make an improvement or maybe use a nonclustered index?

Ummm... I missed that before... you're copying 60 million non-aggregated rows to TempDB... I'm thinking that may be the first part of solving this problem. But, skipping that little nuance...

There are two ways that you may be able to incorporate Paul's good code ... one is to include a full subquery that contains his rCTE for each COUNT(DISTINCT) you have in your SELECT list or do a CROSS APPLY for each one.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ollyjolly SSC Eights! Points: 833 More actions · Answer 10

Ok thanks Jeff.

With regard to Paul's code or the Cross Apply you have to pardon my ignorance as I have not used either before. So using my prior query as shown below is this what you mean by adding a full subquery? Also, is the example logic I have for the recursive CTE correct?

; WITH RecursiveCTE1 AS ( SELECT Yr = MIN(P.Yr), YrMnth = MIN(P.YrMnth), Additional fields (Market, AType, Groups etc....). FROM #Paid P UNION ALL SELECT R.Yr, R.YrMnth, Additional fields (Market, AType, Groups etc....). FROM ( -- A cunning way to use TOP in the recursive part of a CTE SELECT R.Yr, R.YrMnth, Additional fields (Market, AType, Groups etc....) , rn = ROW_NUMBER() OVER (ORDER BY P.Yr, P.YrMnth, Additional fields (Market, AType, Groups etc....)) FROM #Paid P JOIN RecursiveCTE R ON R.Yr <> P.Yr AND R.YrMnth <> P.YrMnth, Additional fields (Market, AType, Groups etc....) ) R WHERE R.rn = 1)

, RecursiveCTE2 AS ( SELECT Yr = MIN(P.Yr), YrMnth = MIN(P.YrMnth), Additional fields (Market, AType, Groups etc....). FROM #Paid P UNION ALL SELECT R.Yr, R.YrMnth, Additional fields (Market, AType, Groups etc....). FROM ( -- A cunning way to use TOP in the recursive part of a CTE SELECT R.Yr, R.YrMnth, Additional fields (Market, AType, Groups etc....) , rn = ROW_NUMBER() OVER (ORDER BY P.Yr, P.YrMnth, Additional fields (Market, AType, Groups etc....)) FROM #Paid P JOIN RecursiveCTE R ON R.Yr <> P.Yr AND R.YrMnth <> P.YrMnth, Additional fields (Market, AType, Groups etc....) ) R WHERE R.rn = 1 )

SELECT P.Yr ,P.YrMnth ,P.Market ,P.AType ,P.Groups ,P.Code ,P.Classification ,P.CType ,P.ServCategory ,P.DiaCategory ,sum(c.ct) As Distinct_E ,sum(d.ct) As Distinct_P FROM #Paid P --------------INSERT PAUL'S RECURSIVE CTE HERE

left join ( select * FROM RecursiveCTE1 ) as c on c.Code = P.Code AND c.Classification = P.Classification AND c.ServCategory = P.ServCategory AND c.DiaCategory = P.DiaCategory AND c.Groups = P.Groups AND c.YrMnth = P.YrMnth AND c.Market = P.Market AND c.CType = P.CType AND c.AType = P.AType AND c.Yr = P.Yr

--------------INSERT PAUL'S RECURSIVE CTE HERE

left join ( select * FROM RecursiveCTE1 ) as d on d.Code = Pd.Code AND d.Classification = Pd.Classification AND d.ServCategory = Pd.ServCategory AND d.DiaCategory = Pd.DiaCategory AND d.Groups = Pd.Groups AND d.YrMnth = Pd.YrMnth AND d.Market = Pd.Market AND d.CType = Pd.CType AND d.AType = Pd.AType AND d.Yr = P.Yr

Group by P.Yr ,P.YrMnth ,P.Market ,P.AType ,P.Groups ,P.Code ,P.Classification ,P.CType ,P.ServCategory ,P.DiaCategory

ScottPletcher SSC Guru Points: 101192 More actions · Answer 11

Before loading the #Paid table, cluster it on all the GROUP BY columns:

Yr
,YrMnth
,Market
,AType
,Groups
,Code
,Classification
,CType
,ServCategory
,DiaCategory

That will avoid sorting when you run the main queries to get distinct values.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Jeff Moden SSC Guru Points: 1004621 More actions · Answer 12

ScottPletcher - Thursday, July 5, 2018 8:30 AM
Before loading the #Paid table, cluster it on all the GROUP BY columns:
Yr
,YrMnth
,Market
,AType
,Groups
,Code
,Classification
,CType
,ServCategory
,DiaCategory
That will avoid sorting when you run the main queries to get distinct values.

Not a bad idea but folks need to be careful here. The table contains 60+ million rows, according to the OP. That a lot of stuff to have in TempDB (which is where the Op is operating from). If you simply add a clustered index, it will take twice as much TempDB space because the heap won't be dropped until the CI is rebuilt. It would be much better, if possible, to do the initial build and population of this table in a minimally logged fashion with the CI in place. If this causes explosive growth on TempDB, the DBAs may come looking for the OP with a bat.

The real key here is why in the hell is anyone looking for such a multi-year report, who the hell is going to review or use the 2 million row result set, and why are they recalculating history that will not change every time they run it? If this is a recurring task and it's actually an important task, what they really need to do is accelerate the process for either warehousing the historical aggregates or accelerate the process of getting the Indexed View in place.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ollyjolly SSC Eights! Points: 833 More actions · Answer 13

ScottPletcher - Thursday, July 5, 2018 8:30 AM

Before loading the #Paid table, cluster it on all the GROUP BY columns:

Yr
,YrMnth
,Market
,AType
,Groups
,Code
,Classification
,CType
,ServCategory
,DiaCategory

That will avoid sorting when you run the main queries to get distinct values.

/quote]

I do have the following index after the #Paid insert statement. Here is the index.

CREATE CLUSTERED INDEX Indx_Paid ON #Paid (Code, Classification, ServCategory, DiaCategory, Groups, YrMnth, Yr, Market, CType, AType)

The order of the fields is based on number of distinct values in each field. The Code column has 253 distinct values followed by the Classification column which has 90 distinct values.....all the way to the AType column which only contains 2 distinct values.

ScottPletcher SSC Guru Points: 101192 More actions · Answer 14

ollyjolly - Thursday, July 5, 2018 8:56 AM

ScottPletcher - Thursday, July 5, 2018 8:30 AM
Before loading the #Paid table, cluster it on all the GROUP BY columns:
Yr
,YrMnth
,Market
,AType
,Groups
,Code
,Classification
,CType
,ServCategory
,DiaCategory
That will avoid sorting when you run the main queries to get distinct values.
/quote]

I do have the following index after the #Paid insert statement. Here is the index.

CREATE CLUSTERED INDEX Indx_Paid ON #Paid (Code, Classification, ServCategory, DiaCategory, Groups, YrMnth, Yr, Market, CType, AType)

The order of the fields is based on number of distinct values in each field. The Code column has 253 distinct values followed by the Classification column which has 90 distinct values.....all the way to the AType column which only contains 2 distinct values.

But that's not how you're GROUPing them. You'll want the clus index to exactly match the grouping so that the query avoids a sort (another sort -- the rows will of course have to be sorted once when they are loaded, but that is true for any clus index).

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Efficient way to count distinct

Cookies on SQLServerCentral