How To DISTINCT COUNT with Windows Functions (i.e. OVER and PARTITION BY)

  • SQL 2012 Reference = http://msdn.microsoft.com/en-us/library/ms175997(v=sql.110).aspx

    I'm trying to get a unique count of rows in a child table in a multi-table join just like the example shown below from the above Microsoft article on COUNT(). I'm using the Windows functions (as opposed to old fashion aggregation with GROUP BY) because I need both aggregate data and non-aggregated data to be included in the results.

    In the below Example (from example C in the above Microsoft Document on T-SQL) it shows that you can do the following

    USE AdventureWorks2012;

    GO

    SELECT

    DISTINCT Name,

    COUNT(edh.BusinessEntityID) OVER (PARTITION BY edh.DepartmentID) AS EmployeesPerDept

    FROM HumanResources.EmployeePayHistory AS eph

    JOIN HumanResources.EmployeeDepartmentHistory AS edh

    ON eph.BusinessEntityID = edh.BusinessEntityID

    JOIN HumanResources.Department AS d

    ON d.DepartmentID = edh.DepartmentID

    WHERE edh.EndDate IS NULL

    ORDER BY Name;

    This is using the DISTINCT keyword so as to get a distinct count where as without it the Count() coould be over stated by the number of rows retruend by the other tables in the join.

    Is there a way to do the equivelant in T-SQL for SQL Server 2008R2? I've tried the above and it errors out under SQL 2008R2 and examplce C above is NOT in the same Microsoft article when you change the platform to SQL 2008R2 so I'm assuming this is a feature nehancement in SQL 2012. How then can you do this under SQL 2008R2?

    Thanks

    Kindest Regards,

    Just say No to Facebook!
  • YSL I think i'd tackle it by changing one of the tables into a simple GROUP BY, that returns the distinct count inside it;

    something like this is syntactically correct, but i didn't have the database aviaalble for testing.

    SELECT

    DISTINCT eph.Name,

    edh.CNT AS EmployeesPerDept

    FROM HumanResources.EmployeePayHistory AS eph

    JOIN (SELECT DepartmentID, COUNT(DISTINCT BusinessEntityID) AS CNT FROM HumanResources.EmployeeDepartmentHistory) AS edh

    ON eph.BusinessEntityID = edh.BusinessEntityID

    JOIN HumanResources.Department AS d

    ON d.DepartmentID = edh.DepartmentID

    WHERE edh.EndDate IS NULL

    ORDER BY eph.Name;

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Thanks for the suggesting Lowell. Thats what I've been working on as a work-a-round but I was hoping there might be a better answer. I must admit I was suprised that I could not find any other posts in this on the forums. Since Microsoft added this capability to SQL 2012 I can only imagine it must have been something people have been asking for.

    Thanks again.

    Kindest Regards,

    Just say No to Facebook!
  • What if you used a DENSE_RANK instead of a count (ordering by the columns you want a distinct count on) then wrapping that query with a select that grabs the MAX value of the DENSE_RANK.

    I know this worked for me, but is it a valid solution?

    DECLARE @DumbTable TABLE (cat CHAR(1), id TINYINT, rw TINYINT);

    INSERT INTO @DumbTable SELECT 'A' as cat, 3 as id, 1 as rw

    INSERT INTO @DumbTable SELECT 'A' as cat, 3 as id, 5 as rw

    INSERT INTO @DumbTable SELECT 'A' as cat, 4 as id, 4 as rw

    INSERT INTO @DumbTable SELECT 'B' as cat, 2 as id, 4 as rw

    INSERT INTO @DumbTable SELECT 'C' as cat, 4 as id, 1 as rw

    INSERT INTO @DumbTable SELECT 'C' as cat, 3 as id, 2 as rw

    INSERT INTO @DumbTable SELECT 'C' as cat, 5 as id, 3 as rw

    SELECT DISTINCT cat

    , MAX(catDenseRank) OVER (PARTITION BY cat) AS idCount

    , maxRw

    FROM(

    SELECT cat

    , DENSE_RANK() OVER (PARTITION BY cat ORDER BY id) as catDenseRank

    , MAX(rw) OVER (PARTITION BY cat) as maxRw

    FROM @DumbTable

    ) AS dumbDataWrapper

    On a side note....why can't you post DR*P TABLE or UNION [ALL] SELECT anymore...or is that just me.

  • I'm not really sure what your problem is. The query you posted would run on SQL 2008 R2, if you only have the tables. When I try I get:

    Msg 208, Level 16, State 1, Line 1

    Invalid object name 'HumanResources.EmployeePayHistory'.

    But I have long ago given up trying to keep up with all the AdventureWorks databases. However, the error message tells me that the syntax is good.

    I'm not really sure that the query mirrors what you actually want to achieve. Rather than using AdventureWorks, maybe you could post a CREATE TABLE statement, an INSERT statement with sample data and the desired result to highlight what you are looking for?

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

  • I don't have adventure works installed either, so will make a guess. Does this work?

    USE AdventureWorks2012;

    GO

    SELECT

    Name

    ,(SELECT DISTINCT

    COUNT(edh.BusinessEntityID) OVER (PARTITION BY edh.DepartmentID) AS TempCount

    FROM

    HumanResources.EmployeeDepartmentHistory)

    AS EmployeesPerDept

    ,*

    FROM HumanResources.EmployeePayHistory AS eph

    JOIN HumanResources.EmployeeDepartmentHistory AS edh

    ON eph.BusinessEntityID = edh.BusinessEntityID

    JOIN HumanResources.Department AS d

    ON d.DepartmentID = edh.DepartmentID

    WHERE edh.EndDate IS NULL

    ORDER BY Name;

    Also, in a worse case scenario, you should be able to use the ROW_NUMBER windowing function in a nested SELECT and then add a WHERE clause to remove any duplicates after the fact.

     

  • Steven Willis (7/31/2013)


    SELECT

    Name

    ,(SELECT DISTINCT

    COUNT(edh.BusinessEntityID) OVER (PARTITION BY edh.DepartmentID) AS TempCount

    FROM

    HumanResources.EmployeeDepartmentHistory)

    AS EmployeesPerDept

     

    Looks like that would die with "subquery returned more than one row"...

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

  • Bit late 😀

    I was looking for a similar question and found:DECLARE @DumbTable TABLE (cat CHAR(1), id TINYINT, rw TINYINT);

    INSERT INTO @DumbTable SELECT 'A' as cat, 3 as id, 1 as rw

    INSERT INTO @DumbTable SELECT 'A' as cat, 3 as id, 5 as rw

    INSERT INTO @DumbTable SELECT 'A' as cat, 4 as id, 4 as rw

    INSERT INTO @DumbTable SELECT 'B' as cat, 2 as id, 4 as rw

    INSERT INTO @DumbTable SELECT 'C' as cat, 4 as id, 1 as rw

    INSERT INTO @DumbTable SELECT 'C' as cat, 3 as id, 2 as rw

    INSERT INTO @DumbTable SELECT 'C' as cat, 5 as id, 3 as rw

    Code copied from previous anwser.

    Solution:

    --

    -- Get a count of distinct id's partitioned by cat.

    --

    SELECT --DISTINCT -- Distinct can be switched on / off.

    cat

    , dense_rank() over (partition by cat order by id)

    + dense_rank() over (partition by cat order by id desc)

    - 1 as count_distinct_id

    FROM @DumbTable

    dense_rank + dense_rank desc gives the position from the 'start' plus the position from the end, for a given set this gives a 'constant'. This 'constant' is one more than the number of distinct values. (For one value this is 1 + 1 - 1) For three rows with two values this gives (1 + 2 - 1 or 2 + 1 - 1)

    Ben

    Although the anwer is late, others searching and finding this thread might find this solution usefull.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply