Efficient way to count distinct

  • I am trying to get a query to run which counts distinct values for two columns. 

     I have a temp table named #Paid that contains over 60 million rows and when I group the columns and perform a distinct count it should return over 2 million rows. The distinct counts are being done on the E_Id column which is varchar (32) and P_Id column which is varchar (80).

    The current query takes a very long time to execute. After 5 hours I stopped the query as nothing was returning.  Here is the SQL code with count distinct for both columns. 

     

    SELECT P.Yr
      ,P.YrMnth
      ,P.Market
      ,P.AType
      ,P.Groups
      ,P.Code
      ,P.Classification
      ,P.CType
      ,P.ServCategory
      ,P.DiaCategory
      ,count(distinct P.E_Id) As Distinct_E
      ,count(distinct P.P_Id) As Distinct_P
      FROM #Paid P
      Group by P.Yr
      ,P.YrMnth
      ,P.Market
      ,P.AType
      ,P.Groups
      ,P.Code
      ,P.Classification
      ,P.CType
      ,P.ServCategory
      ,P.DiaCategory

     
    I also tried using subqueries to count the distinct values and applied a clustered index to the #Paid table using the with the join fields from the distinct count subqueries. Here is the code I am using but this too runs for hours.

    SELECT P.Yr
      ,P.YrMnth
      ,P.Market
      ,P.AType
      ,P.Groups
      ,P.Code
      ,P.Classification
      ,P.CType
      ,P.ServCategory
      ,P.DiaCategory
      ,sum(c.ct) As Distinct_E
      ,sum(d.ct) As Distinct_P
      FROM #Paid P

     left join (select d.Yr, d.YrMnth, d.Market, d.AType, d.Groups, d.Code,
       d.Classification, d.CType, d.ServCategory, d.DiaCategory,
       d.E_Id, count(*) as ct
     
       from (Select P.Yr, P.YrMnth, P.Market, P.AType, P.Groups, P.Code, P.Classification,
       P.CType, P.ServCategory, P.DiaCategory, P.E_Id
         FROM #Paid_Claims PC
         ) as d
       group by d.Yr, d.YrMnth, d.Market, d.AType, d.Groups, d.Code,
       d.Classification, d.CType, d.ServCategory, d.DiaCategory,
       d.E_Id
      ) as c
     on c.Code = P.Code AND  c.Classification = P.Classification AND c.ServCategory = P.ServCategory
      AND c.DiaCategory = P.DiaCategory AND c.Groups = P.Groups AND c.YrMnth = P.YrMnth AND c.Market = P.Market  AND c.CType = P.CType
      AND c.AType = P.AType AND c.Yr = P.Yr

     left join (select d.Yr, d.YrMnth, d.Market, d.AType, d.Groups, d.Code,
       d.Classification, d.CType, d.ServCategory, d.DiaCategory,
       d.P_Id, count(1) as ct
     
       from (Select P.Yr, P.YrMnth, P.Market, P.AType, P.Groups, P.Code, P.Classification,
       P.CType, P.ServCategory, P.DiaCategory, P.P_Id
         FROM #Paid_Claims PC
         ) as d
       group by d.Yr, d.YrMnth, d.Market, d.AType, d.Groups, d.Code,
       d.Classification, d.CType, d.ServCategory, d.DiaCategory,
       d.P_Id
      ) as d
     on d.Code = Pd.Code AND  d.Classification = Pd.Classification AND d.ServCategory = Pd.ServCategory
      AND d.DiaCategory = Pd.DiaCategory AND d.Groups = Pd.Groups AND d.YrMnth = Pd.YrMnth AND d.Market = Pd.Market  AND d.CType = Pd.CType
      AND d.AType = Pd.AType AND d.Yr = P.Yr

     Group by P.Yr
      ,P.YrMnth
      ,P.Market
      ,P.AType
      ,P.Groups
      ,P.Code
      ,P.Classification
      ,P.CType
      ,P.ServCategory
      ,P.DiaCategory

     

    What I don't understand is that I am able to get other portions of the query to run in a timely manner even those that include 40 million and more records, but trying to group by a large record set with distinct counts on a long varchar is not working very well. Either there has to be a better way to achieve this or I have overlooked something in my query.

    Has anyone used either method or another method to count distinct values?

     

  • ollyjolly - Tuesday, July 3, 2018 6:50 PM

    I am trying to come up with a more efficient method to count distinct values for two columns. My temp table #Paid contains over 60 million rows and when I group the columns I return 2 million rows. The current query takes a very long time (hours) before it returns any data.

    Here is the SQL code with count distinct for both columns.

    SELECT P.Yr
     ,P.YrMnth
     ,P.Market
     ,P.AType
     ,P.Groups
     ,P.Code
     ,P.Classification
     ,P.CType
     ,P.ServCategory
     ,P.DiaCategory
     ,count(distinct P.E_Id) As Distinct_E
     ,count(distinct P.P_Id) As Distinct_P
     FROM #Paid P
     Group by P.Yr
     ,P.YrMnth
     ,P.Market
     ,P.AType
     ,P.Groups
     ,P.Code
     ,P.Classification
     ,P.CType
     ,P.ServCategory
     ,P.DiaCategory

    I have also attempted to apply a clustered index to the #Paid table and then create subqueries to get a distinct count. Here is the code.


    SELECT P.Yr
     ,P.YrMnth
     ,P.Market
     ,P.AType
     ,P.Groups
     ,P.Code
     ,P.Classification
     ,P.CType
     ,P.ServCategory
     ,P.DiaCategory
     ,sum(c.ct) As Distinct_E
     ,sum(d.ct) As Distinct_P
     FROM #Paid P

     left join (select d.Yr, d.YrMnth, d.Market, d.AType, d.Groups, d.Code,
      d.Classification, d.CType, d.ServCategory, d.DiaCategory,
      d.E_Id, count(*) as ct  from (Select P.Yr, P.YrMnth, P.Market, P.AType, P.Groups, P.Code, P.Classification,
      P.CType, P.ServCategory, P.DiaCategory, P.E_Id
        FROM #Paid_Claims PC
        ) as d
      group by d.Yr, d.YrMnth, d.Market, d.AType, d.Groups, d.Code,
      d.Classification, d.CType, d.ServCategory, d.DiaCategory,
      d.E_Id
     ) as c
    on c.Code = P.Code AND  c.Classification = P.Classification AND c.ServCategory = P.ServCategory
     AND c.DiaCategory = P.DiaCategory AND c.Groups = P.Groups AND c.YrMnth = P.YrMnth AND c.Market = P.Market  AND c.CType = P.CType
     AND c.AType = P.AType AND c.Yr = P.Yr


    left join (select d.Yr, d.YrMnth, d.Market, d.AType, d.Groups, d.Code,
      d.Classification, d.CType, d.ServCategory, d.DiaCategory,
      d.P_Id, count(1) as ct  from (Select P.Yr, P.YrMnth, P.Market, P.AType, P.Groups, P.Code, P.Classification,
      P.CType, P.ServCategory, P.DiaCategory, P.P_Id
        FROM #Paid_Claims PC
        ) as d
      group by d.Yr, d.YrMnth, d.Market, d.AType, d.Groups, d.Code,
      d.Classification, d.CType, d.ServCategory, d.DiaCategory,
      d.P_Id
     ) as d
    on d.Code = Pd.Code AND  d.Classification = Pd.Classification AND d.ServCategory = Pd.ServCategory
     AND d.DiaCategory = Pd.DiaCategory AND d.Groups = Pd.Groups AND d.YrMnth = Pd.YrMnth AND d.Market = Pd.Market  AND d.CType = Pd.CType
     AND d.AType = Pd.AType AND d.Yr = P.Yr

     Group by P.Yr
     ,P.YrMnth
     ,P.Market
     ,P.AType
     ,P.Groups
     ,P.Code
     ,P.Classification
     ,P.CType
     ,P.ServCategory
     ,P.DiaCategory

    Is using subqueries the best method? Is there anther way this can be done that will not result in my query running for a long time?

    Yes.  You can create an "Indexed View", which is sometimes referred to as a "Materialized View".  Be advised that they will slow down INSERTs, UPDATEs, and DELETEs to some extent.  There are a whole lot of caveats and restrictions to using Indexed Views but, with the understanding that it was a quick look-see on my part, I see nothing in your code to prevent it's usage.  The biggest hint I can give you is that your COUNT(DISTINCT somecolumn) would need to be changed to COUNT_BIG(DISTINCT somecolumn). 

    Here are some reference links:
    https://docs.microsoft.com/en-us/sql/relational-databases/views/create-indexed-views?view=sql-server-2017
    https://www.google.com/search?q=indexed+views+sql+server&btnK=Google+Search&oq=indexed+views+sql+server

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thank you Jeff. I will look through the information on indexed views and attempt to implement. I also need to make sure that I have permissions to create views. If so I will attempt the indexed view and use COUNT_BIG. Thank you again.

  • As I was afraid....it looks like I am limited to SELECT permission. Attempted to create a view and got the dreaded....

    CREATE VIEW permission denied in database

  • Depending on the distribution of your data, this script by Paul White might be of use
    Super-fast DISTINCT using a recursive CTE

  • ollyjolly - Tuesday, July 3, 2018 9:15 PM

    As I was afraid....it looks like I am limited to SELECT permission. Attempted to create a view and got the dreaded....

    CREATE VIEW permission denied in database

    So write the code for the indexed view and submit it (along with the justification for the need) to the people that can deploy it to production.  Heh... this is the age of "DevOps", you know. 😉 (Always has been but people are just now figuring it out).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff I will certainly look info this. I know the process with my IT department to get something through code review and in to production is a very long process.

    In the meantime the recursive CTE as DesNorton posted might be an alternative.

    In looking at the syntax used for a recursive CTE. As I have not used a recursive CTE before I am a little unclear on how to structure based on my query.

    My initial thought is that I would need to add all the fields I am waiting to get a distinct count over. Similar to this? 

    WITH RecursiveCTE
    AS (
    SELECT  Yr = MIN(P.Yr), YrMnth = MIN(P.YrMnth), Additional fields (Market, AType, Groups etc....).
            FROM    #Paid P
            UNION ALL
    SELECT R.Yr, R.YrMnth, Additional fields (Market, AType, Groups etc....).
            FROM (
    -- A cunning way to use TOP in the recursive part of a CTE 
                    SELECT  R.Yr, R.YrMnth, Additional fields (Market, AType, Groups etc....)
                            , rn = ROW_NUMBER() OVER (ORDER BY  P.Yr, P.YrMnth, Additional fields (Market, AType, Groups etc....))
    FROM #Paid P
            JOIN RecursiveCTE R
    ON R.Yr <> P.Yr AND R.YrMnth <> P.YrMnth, Additional fields (Market, AType, Groups etc....)
    ) R
    WHERE R.rn = 1
    )
    SELECT *
    FROM RecursiveCTE
    OPTION (MAXRECURSION 0);

  • The trouble is that you're grouping by 10 other columns.  You also have no filter criteria and so you will be scanning either the entire table or a covering index that contains 60 million rows.  The recursive CTE used for the "super fast" distinct is operating on 1 column of a single column clustered index. 

    For such a thing to work for you, you'll need the perfect index.  Considering that you not allowed to even make a view, how long will it take you to push an index through?

    Also, have you tried the query without the DISTINCT just to set a baseline expectation?  I could be wrong but I'm thinking that it's going to take a bit longer than you'd expect,

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • As in a select of the 60 million rows without any DISTINCT or group by from the #Paid temp table?

    As I am using a temp table to pull from I created a clustered index on the #paid table.

    CREATE CLUSTERED INDEX Indx_Paid ON #Paid (Code, Classification, ServCategory, DiaCategory, Groups, YrMnth, Market, CType, AType)

    I ordered the fields in the clustered index based on number of distinct values in each field. The Code column has 253 distinct values followed by the Classification column which has 90 distinct values.....all the way to the AType column which only contains 2 distinct values.

    Could changing the order of the fields in the index make an improvement or maybe use a nonclustered index?

  • ollyjolly - Wednesday, July 4, 2018 11:51 AM

    As in a select of the 60 million rows without any DISTINCT or group by from the #Paid temp table?

    As I am using a temp table to pull from I created a clustered index on the #paid table.

    CREATE CLUSTERED INDEX Indx_Paid ON #Paid (Code, Classification, ServCategory, DiaCategory, Groups, YrMnth, Market, CType, AType)

    I ordered the fields in the clustered index based on number of distinct values in each field. The Code column has 253 distinct values followed by the Classification column which has 90 distinct values.....all the way to the AType column which only contains 2 distinct values.

    Could changing the order of the fields in the index make an improvement or maybe use a nonclustered index?

    Ummm... I missed that before... you're copying 60 million non-aggregated rows to TempDB... I'm thinking that may be the first part of solving this problem.  But, skipping that little nuance...

    There are two ways that you may be able to incorporate Paul's good code ... one is to include a full subquery that contains his rCTE for each COUNT(DISTINCT) you have in your SELECT list or do a CROSS APPLY for each one.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ok thanks Jeff.

    With regard to Paul's code or the Cross Apply you have to pardon my ignorance as I have not used either before. So using my prior query as shown below is this what you mean by adding a full subquery? Also, is the example logic I have for the recursive CTE correct?  

    ; WITH RecursiveCTE1 AS (
    SELECT Yr = MIN(P.Yr), YrMnth = MIN(P.YrMnth), Additional fields (Market, AType, Groups etc....).
    FROM #Paid P
    UNION ALL
    SELECT R.Yr, R.YrMnth, Additional fields (Market, AType, Groups etc....).
    FROM (
    -- A cunning way to use TOP in the recursive part of a CTE
    SELECT R.Yr, R.YrMnth, Additional fields (Market, AType, Groups etc....)
    , rn = ROW_NUMBER() OVER (ORDER BY P.Yr, P.YrMnth, Additional fields (Market, AType, Groups etc....))
    FROM #Paid P
    JOIN RecursiveCTE R
    ON R.Yr <> P.Yr AND R.YrMnth <> P.YrMnth, Additional fields (Market, AType, Groups etc....)
    ) R
    WHERE R.rn = 1) 

    ,  RecursiveCTE2 AS (
    SELECT Yr = MIN(P.Yr), YrMnth = MIN(P.YrMnth), Additional fields (Market, AType, Groups etc....).
    FROM #Paid P
    UNION ALL
    SELECT R.Yr, R.YrMnth, Additional fields (Market, AType, Groups etc....).
    FROM (
    -- A cunning way to use TOP in the recursive part of a CTE
    SELECT R.Yr, R.YrMnth, Additional fields (Market, AType, Groups etc....)
    , rn = ROW_NUMBER() OVER (ORDER BY P.Yr, P.YrMnth, Additional fields (Market, AType, Groups etc....))
    FROM #Paid P
    JOIN RecursiveCTE R
    ON R.Yr <> P.Yr AND R.YrMnth <> P.YrMnth, Additional fields (Market, AType, Groups etc....)
    ) R
    WHERE R.rn = 1
    ) 

    SELECT P.Yr
    ,P.YrMnth
    ,P.Market
    ,P.AType
    ,P.Groups
    ,P.Code
    ,P.Classification
    ,P.CType
    ,P.ServCategory
    ,P.DiaCategory
    ,sum(c.ct) As Distinct_E
    ,sum(d.ct) As Distinct_P
    FROM #Paid P
    --------------INSERT PAUL'S RECURSIVE CTE HERE

    left join ( select * FROM RecursiveCTE1 ) as c
    on c.Code = P.Code AND c.Classification = P.Classification AND c.ServCategory = P.ServCategory
    AND c.DiaCategory = P.DiaCategory AND c.Groups = P.Groups AND c.YrMnth = P.YrMnth AND c.Market = P.Market AND c.CType = P.CType
    AND c.AType = P.AType AND c.Yr = P.Yr

    --------------INSERT PAUL'S RECURSIVE CTE HERE

    left join ( select * FROM RecursiveCTE1 ) as d
    on d.Code = Pd.Code AND d.Classification = Pd.Classification AND d.ServCategory = Pd.ServCategory
    AND d.DiaCategory = Pd.DiaCategory AND d.Groups = Pd.Groups AND d.YrMnth = Pd.YrMnth AND d.Market = Pd.Market AND d.CType = Pd.CType
    AND d.AType = Pd.AType AND d.Yr = P.Yr

    Group by P.Yr
    ,P.YrMnth
    ,P.Market
    ,P.AType
    ,P.Groups
    ,P.Code
    ,P.Classification
    ,P.CType
    ,P.ServCategory
    ,P.DiaCategory

  • Before loading the #Paid table, cluster it on all the GROUP BY columns:

    Yr
    ,YrMnth
    ,Market
    ,AType
    ,Groups
    ,Code
    ,Classification
    ,CType
    ,ServCategory
    ,DiaCategory

    That will avoid sorting when you run the main queries to get distinct values.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher - Thursday, July 5, 2018 8:30 AM

    Before loading the #Paid table, cluster it on all the GROUP BY columns:

    Yr
    ,YrMnth
    ,Market
    ,AType
    ,Groups
    ,Code
    ,Classification
    ,CType
    ,ServCategory
    ,DiaCategory

    That will avoid sorting when you run the main queries to get distinct values.

    Not a bad idea but folks need to be careful here.  The table contains 60+ million rows, according to the OP.  That a lot of stuff to have in TempDB (which is where the Op is operating from).  If you simply add a clustered index, it will take twice as much TempDB space because the heap won't be dropped until the CI is rebuilt.  It would be much better, if possible, to do the initial build and population of this table in a minimally logged fashion with the CI in place.  If this causes explosive growth on TempDB, the DBAs may come looking for the OP with a bat. 😉

    The real key here is why in the hell is anyone looking for such a multi-year report, who the hell is going to review or use the 2 million row result set, and why are they recalculating history that will not change every time they run it?  If this is a recurring task and it's actually an important task, what they really need to do is accelerate the process for either warehousing the historical aggregates or accelerate the process of getting the Indexed View in place.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ScottPletcher - Thursday, July 5, 2018 8:30 AM

    Before loading the #Paid table, cluster it on all the GROUP BY columns:

    Yr
    ,YrMnth
    ,Market
    ,AType
    ,Groups
    ,Code
    ,Classification
    ,CType
    ,ServCategory
    ,DiaCategory

    That will avoid sorting when you run the main queries to get distinct values.

    /quote]

    I do have the following index after the #Paid insert statement. Here is the index.

    CREATE CLUSTERED INDEX Indx_Paid ON #Paid (Code, Classification, ServCategory, DiaCategory, Groups, YrMnth, Yr, Market, CType, AType)

    The order of the fields is based on number of distinct values in each field. The Code column has 253 distinct values followed by the Classification column which has 90 distinct values.....all the way to the AType column which only contains 2 distinct values.

  • ollyjolly - Thursday, July 5, 2018 8:56 AM

    ScottPletcher - Thursday, July 5, 2018 8:30 AM

    Before loading the #Paid table, cluster it on all the GROUP BY columns:

    Yr
    ,YrMnth
    ,Market
    ,AType
    ,Groups
    ,Code
    ,Classification
    ,CType
    ,ServCategory
    ,DiaCategory

    That will avoid sorting when you run the main queries to get distinct values.

    /quote]

    I do have the following index after the #Paid insert statement. Here is the index.

    CREATE CLUSTERED INDEX Indx_Paid ON #Paid (Code, Classification, ServCategory, DiaCategory, Groups, YrMnth, Yr, Market, CType, AType)

    The order of the fields is based on number of distinct values in each field. The Code column has 253 distinct values followed by the Classification column which has 90 distinct values.....all the way to the AType column which only contains 2 distinct values.

    But that's not how you're GROUPing them.  You'll want the clus index to exactly match the grouping so that the query avoids a sort (another sort -- the rows will of course have to be sorted once when they are loaded, but that is true for any clus index).

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply