Need help with SQL to get duplicate rows and still show unique fields in the rows and filter by NULL field when one or both are NULL

  • I have an issue where rows were duplicated by a certain workflow and I need to identify the rows that were duplicated. I need to know the rows that have the FIN and MRN duplicated and I can get those easy enough, but I also want to see the FirstName, LastName and WhenDischarged fields that may not be unique for the rows with the duplicate FIN(Account Number) and MRN(Medical Record Number).

    This script gets me all of the data that I need, but I only want to see the WhenDischarged field when one or both entries are NULL. I do not want to see the duplicated rows when the WhenDischarged date for both entries is populated with data.

    select WhenDischarged, FirstName, LastName, AccountNumber, MedicalRecordNumber

    from

    [Patient] p

    where

    1 < (select count(*) from [Patient] i where i.AccountNumber = p.AccountNumber and i.MedicalRecordNumber = p.MedicalRecordNumber)

    order by AccountNumber

    This script works a little better, but it only shows me when one of the WhenDischarged fields is NULL for the duplicated records and not when both WhenDischarged fields are NULL.

    select p.MedicalRecordNumber, p.AccountNumber, p.FirstName, p.LastName, p.WhenDischarged

    from Patient p

    inner join

    (

    select MedicalRecordNumber

    from Patient

    group by MedicalRecordNumber

    having sum(case when WhenDischarged is not null then 1 else 0 end) = 1

    and sum(case when WhenDischarged is null then 1 else 0 end) = 1

    ) d

    on p.MedicalRecordNumber = d.MedicalRecordNumber

    where

    1 < (select count(*) from [Patient] i where i.AccountNumber = p.AccountNumber and i.MedicalRecordNumber = p.MedicalRecordNumber)

    order by AccountNumber

    In Summary I want rows where the FIN and MRN are exact matches (There are only two rows for each patient with duplicate data). I want FirstName and LastName even if they do not match. I want WhenDischarged when both are NULL, when one is NULL, but not when both are not NULL

  • Have a play with ROW_NUMBER(). This should be close:

    SELECT MedicalRecordNumber, AccountNumber, FirstName, LastName, WhenDischarged

    FROM (

    SELECT p.MedicalRecordNumber, p.AccountNumber, p.FirstName, p.LastName, p.WhenDischarged,

    rn = ROW_NUMBER() OVER(PARTITION BY p.MedicalRecordNumber, p.AccountNumber ORDER BY p.WhenDischarged)

    FROM Patient p

    ) d

    WHERE d.rn = 1

    AND d.WhenDischarged IS NULL

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Hi Chris,

    Thanks for the suggestion, but I cannot get that to show what I need.

    It is Pulling

    MedicalRecordNumberAccountNumberFirstNameLastNameWhenDischarged

    12345 0012345 John Smith NULL

    12346 0012346 Jane Doe NULL

    12347 0012347 John Doe NULL

    12348 0012348 Jane Smith NULL

    What I need it to do is display both records with the duplicate MRN/FIN and then display First and Last Names whether duplicates or unique and then WhenDischarged when 1 is NULL or both are NULL, but not when both have a Date populated.

    MedicalRecordNumberAccountNumberFirstNameLastNameWhenDischarged

    12345 0012345 John Smith NULL

    12345 0012345 John Smithe NULL

    12346 0012346 Jane Doe NULL

    12346 0012346 Jane Doe 2016-01-01 23:59:00.000

    12347 0012347 John Doe 2016-01-02 23:59:00.000

    12347 0012347 John Doe NULL

    12348 0012348 Jane Smith NULL

    12348 0012348 Janey Smith NULL

    I would not want to see these rows when both duplicate records have the WhenDischarged field populated.

    MedicalRecordNumberAccountNumberFirstNameLastNameWhenDischarged

    12345 0012345 John Smith 2015-12-23 14:23:00.000

    12345 0012345 John Smithe 2015-12-28 13:56:00.000

    12346 0012346 Jane Doe 2016-01-03 13:56:00.000

    12346 0012346 Jane Doe 2016-01-01 23:59:00.000

  • Sorry for the poor format of the results. I cannot figure out how to get this site to use the spacing I typed out.

  • Experiment with the result set. Take out the rn=1 filter so you can see the full set.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • wolfgheist (1/14/2016)


    Hi Chris,

    Thanks for the suggestion, but I cannot get that to show what I need.

    It is Pulling

    MedicalRecordNumberAccountNumberFirstNameLastNameWhenDischarged

    12345 0012345 John Smith NULL

    12346 0012346 Jane Doe NULL

    12347 0012347 John Doe NULL

    12348 0012348 Jane Smith NULL

    What I need it to do is display both records with the duplicate MRN/FIN and then display First and Last Names whether duplicates or unique and then WhenDischarged when 1 is NULL or both are NULL, but not when both have a Date populated.

    MedicalRecordNumberAccountNumberFirstNameLastNameWhenDischarged

    12345 0012345 John Smith NULL

    12345 0012345 John Smithe NULL

    12346 0012346 Jane Doe NULL

    12346 0012346 Jane Doe 2016-01-01 23:59:00.000

    12347 0012347 John Doe 2016-01-02 23:59:00.000

    12347 0012347 John Doe NULL

    12348 0012348 Jane Smith NULL

    12348 0012348 Janey Smith NULL

    I would not want to see these rows when both duplicate records have the WhenDischarged field populated.

    MedicalRecordNumberAccountNumberFirstNameLastNameWhenDischarged

    12345 0012345 John Smith 2015-12-23 14:23:00.000

    12345 0012345 John Smithe 2015-12-28 13:56:00.000

    12346 0012346 Jane Doe 2016-01-03 13:56:00.000

    12346 0012346 Jane Doe 2016-01-01 23:59:00.000

    Try this query, which sets up some sample data so we can both see the same thing. Expand the data set if you need to, or change values within it.

    ;WITH SampleData (MedicalRecordNumber, AccountNumber, FirstName, LastName, WhenDischarged) AS (

    SELECT 12345, '0012345','John','Smith', CAST(NULL AS date) UNION ALL

    SELECT 12345, '0012345','John','Smithe', NULL UNION ALL

    SELECT 12346, '0012346','Jane','Doe', NULL UNION ALL

    SELECT 12346, '0012346','Jane','Doe', '2016-01-01 23:59:00.000' UNION ALL

    SELECT 12347, '0012347','John','Doe', '2016-01-02 23:59:00.000' UNION ALL

    SELECT 12347, '0012347','John','Doe',NULL UNION ALL

    SELECT 12348, '0012348','Jane','Smith', '2016-01-01 23:59:00.000' UNION ALL

    SELECT 12348, '0012348','Janey','Smith', '2016-01-01 23:59:00.000')

    SELECT

    MedicalRecordNumber, AccountNumber, FirstName, LastName, WhenDischarged,

    rn, PickMe = CASE WHEN d.rn = 1 AND d.WhenDischarged IS NULL THEN 'YES' ELSE 'NO' END

    FROM (

    SELECT *,

    rn = ROW_NUMBER() OVER(PARTITION BY p.MedicalRecordNumber, p.AccountNumber ORDER BY p.WhenDischarged)

    FROM SampleData p

    ) d

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • I am playing around with it, but it keeps pulling too much data. It is pulling 20k+ rows, when my first script above pulls 4302 rows that contains all of the data including when the discharge data is populated for both records which I do not want and the second pulls 1302 rows, but is missing the Whendischarged when one is Null and one has a value.

  • How are you creating the little window with correct spacing format? Here are the results that I am seeing from your last query.

    MRN FIN First Last Discharge rn PickMe

    123450012345JohnSmitheNULL1YES

    123450012345JohnSmithNULL2NO

    123460012346JaneDoeNULL1YES

    123460012346JaneDoe2016-01-012NO

    123470012347JohnDoeNULL1YES

    123470012347JohnDoe2016-01-022NO

    123480012348JaneySmith2016-01-011NO

    123480012348JaneSmith2016-01-012NO

  • wolfgheist (1/14/2016)


    I am playing around with it, but it keeps pulling too much data. It is pulling 20k+ rows, when my first script above pulls 4302 rows that contains all of the data including when the discharge data is populated for both records which I do not want and the second pulls 1302 rows, but is missing the Whendischarged when one is Null and one has a value.

    Bear in mind that your script doesn't work correctly (or you wouldn't be here).

    wolfgheist (1/14/2016)


    In Summary I want rows where the FIN and MRN are exact matches (There are only two rows for each patient with duplicate data). I want FirstName and LastName even if they do not match. I want WhenDischarged when both are NULL, when one is NULL, but not when both are not NULL

    Since the last script I posted does exactly this (without eliminating the rows from the output), can you add rows to the sample data section to show how the script isn't working?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ROW_NUMBER() is overused. Here the OP needs a single value from one record to be available across the entire partition, namely whether a NULL exists on any record in the partition. Since ROW_NUMBER() assigns a different value to every single record in the partition, it cannot fulfill this need. Here, I've used MAX(), precisely because it returns the same value for every record within a partition.

    I've also included COUNT() in case the OP wants to filter only those records where a duplicate exists.

    ;WITH SampleData (MedicalRecordNumber, AccountNumber, FirstName, LastName, WhenDischarged) AS (

    SELECT 12345, '0012345','John','Smith', CAST(NULL AS date) UNION ALL

    SELECT 12345, '0012345','John','Smithe', NULL UNION ALL

    SELECT 12346, '0012346','Jane','Doe', NULL UNION ALL

    SELECT 12346, '0012346','Jane','Doe', '2016-01-01 23:59:00.000' UNION ALL

    SELECT 12347, '0012347','John','Doe', '2016-01-02 23:59:00.000' UNION ALL

    SELECT 12347, '0012347','John','Doe',NULL UNION ALL

    SELECT 12348, '0012348','Jane','Smith', '2016-01-01 23:59:00.000' UNION ALL

    SELECT 12348, '0012348','Janey','Smith', '2016-01-01 23:59:00.000')

    SELECT

    MedicalRecordNumber, AccountNumber, FirstName, LastName, WhenDischarged, d.cnt

    FROM (

    SELECT *,

    last_discharge = MAX(ISNULL(p.WhenDischarged, '9999-12-31')) OVER(PARTITION BY p.MedicalRecordNumber, p.AccountNumber),

    cnt = COUNT(*) OVER(PARTITION BY p.MedicalRecordNumber, p.AccountNumber)

    FROM SampleData p

    ) d

    WHERE d.last_discharge = '9999-12-31'

    ORDER BY d.MedicalRecordNumber, d.AccountNumber, d.WhenDischarged

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • wolfgheist (1/14/2016)


    I am playing around with it, but it keeps pulling too much data. It is pulling 20k+ rows, when my first script above pulls 4302 rows that contains all of the data including when the discharge data is populated for both records which I do not want and the second pulls 1302 rows, but is missing the Whendischarged when one is Null and one has a value.

    Use the IFCode shortcuts menu, just to the left of the window you type into.

    This is the resultset in code = "plain" and font = "Courier New"

    [font="Courier New"]MRN FIN First Last Discharge rn PickMe

    12345 0012345 John Smithe NULL 1 YES

    12345 0012345 John Smith NULL 2 NO

    12346 0012346 Jane Doe NULL 1 YES

    12346 0012346 Jane Doe 2016-01-01 2 NO

    12347 0012347 John Doe NULL 1 YES

    12347 0012347 John Doe 2016-01-02 2 NO

    12348 0012348 Janey Smith 2016-01-01 1 NO

    12348 0012348 Jane Smith 2016-01-01 2 NO [/font]

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (1/14/2016)


    wolfgheist (1/14/2016)


    I am playing around with it, but it keeps pulling too much data. It is pulling 20k+ rows, when my first script above pulls 4302 rows that contains all of the data including when the discharge data is populated for both records which I do not want and the second pulls 1302 rows, but is missing the Whendischarged when one is Null and one has a value.

    Bear in mind that your script doesn't work correctly (or you wouldn't be here).

    I fully agree 🙂

    wolfgheist (1/14/2016)


    In Summary I want rows where the FIN and MRN are exact matches (There are only two rows for each patient with duplicate data). I want FirstName and LastName even if they do not match. I want WhenDischarged when both are NULL, when one is NULL, but not when both are not NULL

    Since the last script I posted does exactly this (without eliminating the rows from the output), can you add rows to the sample data section to show how the script isn't working?

    Well it displayed some things that do not quite match up with what I am looking for. I have spaced them in sets of what I am looking for and what I do not want.

    123450012345JohnSmitheNULL1YES - I want this row

    123450012345JohnSmithNULL2NO - I want this row

    123460012346JaneDoeNULL1YES - I want this row

    123460012346JaneDoe2016-01-012NO - I want this row

    123470012347JohnDoeNULL1YES - I want this row

    123470012347JohnDoe2016-01-022NO - I want this row

    123480012348JaneySmith2016-01-011NO - I do not want this row

    123480012348JaneSmith2016-01-012NO - I do not want this row

  • drew.allen (1/14/2016)


    ROW_NUMBER() is overused. Here the OP needs a single value from one record to be available across the entire partition, namely whether a NULL exists on any record in the partition. Since ROW_NUMBER() assigns a different value to every single record in the partition, it cannot fulfill this need. Here, I've used MAX(), precisely because it returns the same value for every record within a partition.

    I've also included COUNT() in case the OP wants to filter only those records where a duplicate exists.

    ;WITH SampleData (MedicalRecordNumber, AccountNumber, FirstName, LastName, WhenDischarged) AS (

    SELECT 12345, '0012345','John','Smith', CAST(NULL AS date) UNION ALL

    SELECT 12345, '0012345','John','Smithe', NULL UNION ALL

    SELECT 12346, '0012346','Jane','Doe', NULL UNION ALL

    SELECT 12346, '0012346','Jane','Doe', '2016-01-01 23:59:00.000' UNION ALL

    SELECT 12347, '0012347','John','Doe', '2016-01-02 23:59:00.000' UNION ALL

    SELECT 12347, '0012347','John','Doe',NULL UNION ALL

    SELECT 12348, '0012348','Jane','Smith', '2016-01-01 23:59:00.000' UNION ALL

    SELECT 12348, '0012348','Janey','Smith', '2016-01-01 23:59:00.000')

    SELECT

    MedicalRecordNumber, AccountNumber, FirstName, LastName, WhenDischarged, d.cnt

    FROM (

    SELECT *,

    last_discharge = MAX(ISNULL(p.WhenDischarged, '9999-12-31')) OVER(PARTITION BY p.MedicalRecordNumber, p.AccountNumber),

    cnt = COUNT(*) OVER(PARTITION BY p.MedicalRecordNumber, p.AccountNumber)

    FROM SampleData p

    ) d

    WHERE d.last_discharge = '9999-12-31'

    ORDER BY d.MedicalRecordNumber, d.AccountNumber, d.WhenDischarged

    Drew

    From the sample set this pulls what I am looking for.

    123450012345JohnSmitheNULL2

    123450012345JohnSmithNULL2

    123460012346JaneDoeNULL2

    123460012346JaneDoe2016-01-012

    123470012347JohnDoeNULL2

    123470012347JohnDoe2016-01-022

  • wolfgheist (1/14/2016)


    ChrisM@Work (1/14/2016)


    wolfgheist (1/14/2016)


    I am playing around with it, but it keeps pulling too much data. It is pulling 20k+ rows, when my first script above pulls 4302 rows that contains all of the data including when the discharge data is populated for both records which I do not want and the second pulls 1302 rows, but is missing the Whendischarged when one is Null and one has a value.

    Bear in mind that your script doesn't work correctly (or you wouldn't be here).

    I fully agree 🙂

    wolfgheist (1/14/2016)


    In Summary I want rows where the FIN and MRN are exact matches (There are only two rows for each patient with duplicate data). I want FirstName and LastName even if they do not match. I want WhenDischarged when both are NULL, when one is NULL, but not when both are not NULL

    Since the last script I posted does exactly this (without eliminating the rows from the output), can you add rows to the sample data section to show how the script isn't working?

    Well it displayed some things that do not quite match up with what I am looking for. I have spaced them in sets of what I am looking for and what I do not want.

    123450012345JohnSmitheNULL1YES - I want this row

    123450012345JohnSmithNULL2NO - I want this row

    123460012346JaneDoeNULL1YES - I want this row

    123460012346JaneDoe2016-01-012NO - I want this row

    123470012347JohnDoeNULL1YES - I want this row

    123470012347JohnDoe2016-01-022NO - I want this row

    123480012348JaneySmith2016-01-011NO - I do not want this row

    123480012348JaneSmith2016-01-012NO - I do not want this row

    So you want ALL row pairs, except where WhenDischarged for both rows of a pair is not null?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • This script appears to be doing exactly what I want. Thanks both of you guys. You are awesome!!! 🙂

    [Code="sql']SELECT

    MedicalRecordNumber, AccountNumber, FirstName, LastName, WhenDischarged, d.cnt

    FROM (

    SELECT *,

    last_discharge = MAX(ISNULL(p.WhenDischarged, '9999-12-31')) OVER(PARTITION BY p.MedicalRecordNumber, p.AccountNumber),

    cnt = COUNT(*) OVER(PARTITION BY p.MedicalRecordNumber, p.AccountNumber)

    FROM Patient p

    ) d

    WHERE d.last_discharge = '9999-12-31' and cnt = 2

    ORDER BY d.MedicalRecordNumber, d.AccountNumber, d.WhenDischarged[/code]

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply