List out rows from a duplicate record set that has one or more odd values

  • I have a table with data that looks as below...

    Each CaseID has multiple records in the table. Each of these rows is expected to have same data in all columns except "CreatedDate". However, in certain cases the StudentID may be different from the other row(s) of the same CaseID. I need to report data for all such cases. Additionally, the output needs to be arranged in the order of caseID with highest created date. i.e., in the result set all records belonging to the CaseID which was modified most recently should appear at the top.

    Here is the sample of how the desired result set should look like...

    Following is the code snippet to create sample data

    DECLARE @mySampleTable TABLE (

    CaseIDINT,

    StudentIDVARCHAR (20),

    StudentSocialVARCHAR (11),

    StudentNameVARCHAR (50),

    CreatedDateDateTime2

    )

    INSERT @mySampleTable (CaseID, StudentID, StudentSocial, StudentName, CreatedDate)

    VALUES(1001, 'FTS-2134', '111-22-3333', 'Roger Federer', '5/12/2013'),

    (1001, 'FTS-2314', '111-22-3333', 'Roger Federer', '5/08/2013'),

    (1002, 'PTS-5769', '222-33-4444', 'Maria Sharapova', '5/14/2013'),

    (1002, 'PTS-5769', '222-33-4444', 'Maria Sharapova', '5/2/2013'),

    (1003, 'FTS-8764', '333-44-5555', 'Serena Williams', '5/17/2013'),

    (1003, 'FTS-8764', '333-44-5555', 'Serena Williams', '5/06/2013'),

    (1003, 'FTS-7864', '333-44-5555', 'Serena Williams', '4/30/2013'),

    (1004, 'FTS-9132', '444-55-6666', 'Novak Djokovic', '5/10/2013'),

    (1004, 'FTS-9312', '444-55-6666', 'Novak Djokovic', '4/27/2013'),

    (1005, 'PTS-4722', '555-66-7777', 'Novak Djokovic', '5/19/2013'),

    (1005, 'PTS-4722', '555-66-7777', 'Novak Djokovic', '5/12/2013')

    select * from @mySampleTable

    For now, this is what I did...

    SELECTDISTINCT a.CaseID, a.StudentID, b.StudentID, a.StudentSocial, a.StudentName, a.CreatedDate

    FROM@mySampleTable a

    INNER JOIN @mySampleTable b ON a.CaseID = b.CaseID AND a.StudentSocial = b.StudentSocial AND a.StudentID <> b.StudentID

    It doesn't do the sorting, but seems to do the rest. I am not sure if this is the best way to achieve what I am looking for though. My actual production tables have fairly large amount of data. I would appreciate your help and inputs with this.

    Thanks in advance...

    - Rex

  • My solution avoids joining the table to itself by using a couple of window functions in a CTE named "flagged_data." It adds a "StudentID_Problem" column (1 for a problem, 0 for no problem). Because COUNT(DISTINCT) does not work in a window function, I compared MIN(StudentID) to MAX(StudentID) to see if they are a match. If they are not a match, there is a problem because there are at least two StudentID values for the given CaseID.

    Because this flag needs to be a filter, and because window functions are not allowed in the WHERE clause, there has to be an intermediary step in a CTE. AFter the CTE, though, it is a simple SELECT statement ordered by a LastEntryDate (also calculated with a window function in the CTE), CaseID, and CreatedDate.

    with

    mySampleTable (CaseID, StudentID, StudentSocial, StudentName, CreatedDate) as

    (select 1001, 'FTS-2134', '111-22-3333', 'Roger Federer', '5/12/2013' union all

    select 1001, 'FTS-2314', '111-22-3333', 'Roger Federer', '5/08/2013'union all

    select 1002, 'PTS-5769', '222-33-4444', 'Maria Sharapova', '5/14/2013'union all

    select 1002, 'PTS-5769', '222-33-4444', 'Maria Sharapova', '5/2/2013'union all

    select 1003, 'FTS-8764', '333-44-5555', 'Serena Williams', '5/17/2013'union all

    select 1003, 'FTS-8764', '333-44-5555', 'Serena Williams', '5/06/2013'union all

    select 1003, 'FTS-7864', '333-44-5555', 'Serena Williams', '4/30/2013'union all

    select 1004, 'FTS-9132', '444-55-6666', 'Novak Djokovic', '5/10/2013'union all

    select 1004, 'FTS-9312', '444-55-6666', 'Novak Djokovic', '4/27/2013'union all

    select 1005, 'PTS-4722', '555-66-7777', 'Novak Djokovic', '5/19/2013'union all

    select 1005, 'PTS-4722', '555-66-7777', 'Novak Djokovic', '5/12/2013'),

    flagged_data as

    (SELECT

    CaseID,

    StudentID,

    StudentSocial,

    StudentName,

    CreatedDate,

    StudentID_Problem = Case when MIN(StudentID) over (PARTITION by CaseID) <> MAX(StudentID) over (partition by CaseID) then 1 else 0 end,

    LastEntryDate = MAX(CreatedDate) over (partition by CaseID)

    FROM

    mySampleTable)

    select

    CaseID,

    StudentID,

    StudentSocial,

    StudentName,

    CreatedDate

    from

    flagged_data

    where

    StudentID_Problem = 1

    order by

    LastEntryDate desc, CaseID, CreatedDate desc

  • It can be done without using windowed function (so, will have better performance):

    SELECT *

    FROM @mySampleTable s

    WHERE s.CaseID IN (SELECT CaseID

    FROM @mySampleTable

    GROUP BY CaseID

    HAVING MAX(StudentID) != MIN(StudentID))

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • The next thing you should do is to stop storing SSN in clear text. Get that stuff encrypted!!!!

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Geoff5,

    Thanks a lot for your suggested solution. I appreciate that.

    Eugene Elutin,

    Your solution looks simpler, but it is not sorting the end result the way I want it. Any suggestions on that?

    Sean Lange,

    Thanks for the encryption suggestion. This was just a sample data set I created to simulate my requirement. Actual data does not have any data as confidential as SSN. I definitely appreciate your suggestion and keep that in mind when I deal with any sensitive data.

    - Rex

  • RexHelios (5/21/2013)


    Your solution looks simpler, but it is not sorting the end result the way I want it. Any suggestions on that?

    His query does not have an order by. If you want it to be sorted in some certain you need to add an order by.

    How do you want it sorted?

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Ahh I see how you want it sorted. You could modify Eugene's solution slightly like this.

    with MySort as

    (

    select CaseID,

    ROW_NUMBER() over(order by Max(CreatedDate) desc) as SortOrder

    from @mySampleTable

    group by CaseID

    )

    SELECT *

    FROM @mySampleTable s

    join MySort ms on ms.CaseID = s.CaseID

    WHERE s.CaseID IN (SELECT CaseID

    FROM @mySampleTable

    GROUP BY CaseID

    HAVING MAX(StudentID) != MIN(StudentID))

    order by SortOrder, CreatedDate

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean,

    I'm not an expert at execution plans, so I can't speak definitively on which approach is more efficient, but intuitively it would seem to me that referencing the raw data table only once, as my solution does (in the CTE), is going to be a better choice than referencing it three times, as your sorted solution does based on Eugene's "simpler" approach (once in the CTE, once in the main query, and once in the WHERE clause subquery).

    In addition, Eugene cited the window functions as the problem he solved by putting the StudentID comparison into a subquery in the WHERE clause, but your sorting solution puts a window function back into a CTE. So all things considered, do you have reason to believe that your last recommendation is a more efficient query solution than my earlier suggestion?

  • geoff5 (5/21/2013)


    Sean,

    I'm not an expert at execution plans, so I can't speak definitively on which approach is more efficient, but intuitively it would seem to me that referencing the raw data table only once, as my solution does (in the CTE), is going to be a better choice than referencing it three times, as your sorted solution does based on Eugene's "simpler" approach (once in the CTE, once in the main query, and once in the WHERE clause subquery).

    In addition, Eugene cited the window functions as the problem he solved by putting the StudentID comparison into a subquery in the WHERE clause, but your sorting solution puts a window function back into a CTE. So all things considered, do you have reason to believe that your last recommendation is a more efficient query solution than my earlier suggestion?

    Not at all. I was merely posting it as an option. Once you add the sorting that I put in there it is most likely going to be slower.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Thank you all!

    - Rex

  • Thank you!

    Wt.

  • I believe Sean's solution has an extra unnecessary step.

    If to use the filtering subquery in the JOIN it may provide sorting criteria as well.

    Then CTE query may be excluded:

    SELECT *

    FROM @mySampleTable s

    INNER JOIN (SELECT CaseID, MAX(CreatedDate) LatestUpdate

    FROM @mySampleTable

    GROUP BY CaseID

    HAVING MAX(StudentID) != MIN(StudentID)) fs ON s.CaseID = fs.CaseID

    ORDER BY LatestUpdate DESC, CreatedDate DESC

    _____________
    Code for TallyGenerator

  • And of course, it must say:

    SELECT s.*

    to be precise. 🙂

    _____________
    Code for TallyGenerator

  • geoff5 (5/21/2013)


    Sean,

    I'm not an expert at execution plans, so I can't speak definitively on which approach is more efficient, but intuitively it would seem to me that referencing the raw data table only once, as my solution does (in the CTE), is going to be a better choice than referencing it three times, as your sorted solution does based on Eugene's "simpler" approach (once in the CTE, once in the main query, and once in the WHERE clause subquery).

    In addition, Eugene cited the window functions as the problem he solved by putting the StudentID comparison into a subquery in the WHERE clause, but your sorting solution puts a window function back into a CTE. So all things considered, do you have reason to believe that your last recommendation is a more efficient query solution than my earlier suggestion?

    1. Number of time the same table is mentioned in the query has no direct impact on performance. As SQL optimizer will decide how many times table/index should be scanned/seeked.

    In case of using windowed function, SQL will need to perform lazy table spool operation, which is not the best on in terms of performance. So, even so you have used table only once in your query, SQL will need to scan it once and perform table spool twice (to calculate two different ROW_NUMBERs). For the query which does simple GROUP BY, SQL will need just to scan table twice. And it's only based on the given example. I guess in OP real table CaseId is most likely to be indexed. In this case, my query will only have one table scan and one index seek. Query with ROW_NUMBER will still do the same: one scan, two lazy table spools.

    2. IN or INNER JOIN?

    Until you wanted to sort by CaseId with highest CreatedDate, you would use IN as it would win slightly over INNER JOIN in this case, as we wouldn't need anything to take from the table (actually sub-query) which we JOIN to. Here is a very good article from Gail Shaw about it: http://sqlinthewild.co.za/index.php/2010/01/12/in-vs-inner-join/.

    However, as soon as you need this particular order, you need to get calculated MAX (CreatedDate). Therefore - INNER JOIN.

    So Sergiy's query does exactly that. However I would also enforce order by CaseIs, just in case there are different CaseId's with the same MAX(CreatedDate):

    SELECT s.*

    FROM @mySampleTable s

    INNER JOIN (SELECT CaseID, MAX(CreatedDate) LatestUpdate

    FROM @mySampleTable

    GROUP BY CaseID

    HAVING MAX(StudentID) != MIN(StudentID)) fs ON s.CaseID = fs.CaseID

    ORDER BY LatestUpdate DESC,

    CaseID,

    CreatedDate DESC -- here it's depends how you want to sort records for the same CaseId

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Eugene Elutin (5/22/2013)


    geoff5 (5/21/2013)


    However I would also enforce order by CaseIs, just in case there are different CaseId's with the same MAX(CreatedDate)

    Spot on!

    _____________
    Code for TallyGenerator

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply