SQL statement for new vs duplicate data

  • I'm looking for a SQL statement that will output 'new vs duplicate' data on a monthly bases, but I can't seem to come up SQL statement that can accomplish what I need. See example below to get a better idea of what I'm trying to do since this is hard to explain.

    Example data:

    id, num, create_date

    // January

    1, 34250, 2012-01-01

    2, 25383, 2012-01-03

    3, 93482, 2012-01-13

    4, 52934, 2012-01-20

    5, 34823, 2012-01-23

    // Febuary

    6, 09943, 2012-02-03

    7, 32845, 2012-02-05

    8, 34250, 2012-02-16 -- same num as id 1

    9, 57325, 2012-02-28

    10, 34823, 2012-02-28 -- same num as id 5

    // March

    11, 95234, 2012-03-05

    12, 25383, 2012-03-03-- same num as id 2

    13, 32845, 2012-03-16 -- same num as id 7

    14, 34250, 2012-03-28-- same num as id 1, 8

    15, 12412, 2012-03-28

    // April

    16, 53535, 2012-04-05

    17, 23232, 2012-04-03

    18, 57325, 2012-04-16 -- same num as id 9

    19, 34250, 2012-04-28-- same num as id 1, 8, 14

    20, 12343, 2012-04-28

    Example results from data above would be something like below (just looking for new vs duplicate numbers):

    January: 5 new and 0 duplicates (no duplicates will ever be found the first month)

    Febuary: 3 new and 2 duplicates (checks for duplicates in the month of Jan only)

    March: 2 new and 3 duplicates (checks for duplicates in the month of Jan & Feb only)

    April: 3 new and 2 duplicates (checks for duplicates in the month of Jan, Feb & March only)

    ...and so on for each month after that

    Hopefully you can see what I'm trying to do. Also, the total amount (new + duplicate) will always be 5 based on the data above. Any real smart person out there that can help?

    Edit: Put in wrong forum (I'm using SQL Server 2005)

  • <deleted>....not working as required

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • avitron2020 (5/7/2013)


    I'm looking for a SQL statement that will output 'new vs duplicate' data on a monthly bases, but I can't seem to come up SQL statement that can accomplish what I need. See example below to get a better idea of what I'm trying to do since this is hard to explain.

    Example data:

    id, num, create_date

    // January

    1, 34250, 2012-01-01

    2, 25383, 2012-01-03

    3, 93482, 2012-01-13

    4, 52934, 2012-01-20

    5, 34823, 2012-01-23

    // Febuary

    6, 09943, 2012-02-03

    7, 32845, 2012-02-05

    8, 34250, 2012-02-16 -- same num as id 1

    9, 57325, 2012-02-28

    10, 34823, 2012-02-28 -- same num as id 5

    // March

    11, 95234, 2012-03-05

    12, 25383, 2012-03-03-- same num as id 2

    13, 32845, 2012-03-16 -- same num as id 7

    14, 34250, 2012-03-28-- same num as id 1, 8

    15, 12412, 2012-03-28

    // April

    16, 53535, 2012-04-05

    17, 23232, 2012-04-03

    18, 57325, 2012-04-16 -- same num as id 9

    19, 34250, 2012-04-28-- same num as id 1, 8, 14

    20, 12343, 2012-04-28

    Example results from data above would be something like below (just looking for new vs duplicate numbers):

    January: 5 new and 0 duplicates (no duplicates will ever be found the first month)

    Febuary: 3 new and 2 duplicates (checks for duplicates in the month of Jan only)

    March: 2 new and 3 duplicates (checks for duplicates in the month of Jan & Feb only)

    April: 3 new and 2 duplicates (checks for duplicates in the month of Jan, Feb & March only)

    ...and so on for each month after that

    Hopefully you can see what I'm trying to do. Also, the total amount (new + duplicate) will always be 5 based on the data above. Any real smart person out there that can help?

    Edit: Put in wrong forum (I'm using SQL Server 2005)

    This works, assuming that your ID column is ever-increasing (i.e., IDs are assigned in the same order as create_date) - otherwise, you'll need to compare create_dates in the CROSS APPLY subquery:

    -- Please remove the extraneous letter "x" from the CxREATE and DxROP keywords - my company filters internet traffic that includes certain T-SQL DDL keywords.

    CxREATE TABLE dbo.testData (ID int PRIMARY KEY, number int, create_date datetime)

    GO

    INSERT INTO dbo.testData (ID, number, create_date)

    SELECT 1, 34250, '2012-01-01' UNION ALL

    SELECT 2, 25383, '2012-01-03' UNION ALL

    SELECT 3, 93482, '2012-01-13' UNION ALL

    SELECT 4, 52934, '2012-01-20' UNION ALL

    SELECT 5, 34823, '2012-01-23' UNION ALL

    SELECT 6, 09943, '2012-02-03' UNION ALL

    SELECT 7, 32845, '2012-02-05' UNION ALL

    SELECT 8, 34250, '2012-02-16' UNION ALL

    SELECT 9, 57325, '2012-02-28' UNION ALL

    SELECT 10, 34823, '2012-02-28' UNION ALL

    SELECT 11, 95234, '2012-03-05' UNION ALL

    SELECT 12, 25383, '2012-03-03' UNION ALL

    SELECT 13, 32845, '2012-03-16' UNION ALL

    SELECT 14, 34250, '2012-03-28' UNION ALL

    SELECT 15, 12412, '2012-03-28' UNION ALL

    SELECT 16, 53535, '2012-04-05' UNION ALL

    SELECT 17, 23232, '2012-04-03' UNION ALL

    SELECT 18, 57325, '2012-04-16' UNION ALL

    SELECT 19, 34250, '2012-04-28' UNION ALL

    SELECT 20, 12343, '2012-04-28'

    GO

    SELECT DATEPART(MONTH, td.create_date) AS monthNumber

    ,DATENAME(MONTH, td.create_date) AS [monthName]

    ,SUM(CASE WHEN xd.duplicate IS NULL THEN 1 ELSE 0 END) AS nNew

    ,COUNT(td.ID) - SUM(CASE WHEN xd.duplicate IS NULL THEN 1 ELSE 0 END) as nDuplicates

    FROM dbo.testData td

    OUTER APPLY (SELECT TOP(1) t.ID AS duplicate FROM dbo.testData t WHERE t.ID < td.ID AND t.number = td.number) xd

    GROUP BY DATEPART(MONTH, td.create_date), DATENAME(MONTH, td.create_date)

    ORDER BY DATEPART(MONTH, td.create_date)

    GO

    DxROP TABLE dbo.testData

    Two notes about this code:

    - It works no matter how many rows you have per month.

    - For a larger data set, proper indexing will yield better performance.

    Also, notice how I provided DDL for the dbo.testData table and converted your sample data into an INSERT . . . SELECT statement to make it "consumable" for others who may want to work on your question. It's considered good form on this forum to provide DDL, *consumable* test data, and expected results when you ask a question about manipulating specific data so that we can easily set up test environments and provide tested code.

    Jason Wolfkill

  • That worked great. Thanks a lot wolfkillj.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply