Compare Dates in an iterative manner for a particular set of records

  • dfarran (11/13/2008)


    It is also possible to do the query with just left joins,

    SELECT a.PersonID, a.Version , a.dedate,

    [is improper] = CASE WHEN MAX(c.version) IS NOT NULL THEN 1

    ELSE 0 END

    FROM personrecord a

    LEFT JOIN personrecord b ON b.personid = a.personid

    LEFT JOIN personrecord c ON c.personid = a.personid

    AND c.version >= b.version

    AND c.dedate < b.dedate

    GROUP BY a.PersonID, a.Version, a.dedate

    with the primary key on the table (personid, version) this performs slightly better than the cross join with both the small(8) and medium(circa .75m) rowcount.

    I would avoid using the version = version -1 because this does not cater for missing versions (try doubling the version numbers in test data)

    Good point about not relying on version = version-1!

  • Peter M. (11/12/2008)


    I'd be more interested in seeing the execution times where the data size is non-trivial. My suspicion would lead me to believe the row_number() solution might scale better based on my own experience using the related dense_rank() function.

    You could even use two dense rank functions that are partitioned identically but have an order by that is based on the version number in one and the date field in the other, then using that to compare when the resulting ranks are not equal. It would require a derived table joined to the original table, but you could get the results you are looking for.

    and thanks, Jeff Moden, for pointing out the use of the clustered index on PersonId, Version.

    I was interested to see how the different solutions would scale, so I expanded the PersonRecord table to 1 million PersonIDs, and expanded the rows of data to 5 records per PersonId, using Jeff's code. And I added the clustered index.

    And I added dfarran's query.

    Here are the results:

    Original Query 153,733 Milliseconds duration

    James Goodwin -

    Left join on inner join 136,703 Milliseconds duration

    Matt C -

    SQL2005 row_number()

    over (partition 184,280 Milliseconds duration

    Adam Haines -

    inner joins and CASE 138,906 Milliseconds duration

    dfarrans -

    just left joins,

    no version-1 245,016 Milliseconds duration

    Granted, these are total execution times, so include I/O.

    Also, I did not alter the queries that used version-1, which is a weakness if the version sequence is not perfect.

  • Am I missing something ?

    select a.PersonID,a.Version, a.DEDate , 1 AS [Is ImProper]

    from PersonRecord a

    inner join PersonRecord b on a.PersonID = b.PersonID and a.Version>b.Version and a.DEDate<b.DEDate

  • AlexT (11/13/2008)


    Am I missing something ?

    select a.PersonID,a.Version, a.DEDate , 1 AS [Is ImProper]

    from PersonRecord a

    inner join PersonRecord b on a.PersonID = b.PersonID and a.Version>b.Version and a.DEDate<b.DEDate

    Yes, you then need to apply this flag to all the records for that PersonId. (So, if a PersonId has 5 rows and 2 of them are out of order, then all 5 rows will have the ImProper = 1, and those PersonIds with no rows out of order will have ImProper = 0.)

  • yes whatever,

    select distinct a.PersonID,a.Version, a.DEDate , case when b.PersonID is null then 0 else 1 end AS [Is ImProper]

    from PersonRecord a

    left join (

    select a.PersonID

    from PersonRecord a

    inner join PersonRecord b on a.PersonID = b.PersonID and a.Version>b.Version and a.DEDate<b.DEDate

    ) b on a.PersonID = b.PersonID

    but initial tast was just get bad data entry dates

  • AlexT (11/13/2008)


    yes whatever,

    select distinct a.PersonID,a.Version, a.DEDate , case when b.PersonID is null then 0 else 1 end AS [Is ImProper]

    from PersonRecord a

    left join (

    select a.PersonID

    from PersonRecord a

    inner join PersonRecord b on a.PersonID = b.PersonID and a.Version>b.Version and a.DEDate<b.DEDate

    ) b on a.PersonID = b.PersonID

    but initial tast was just get bad data entry dates

    Thanks, this achieves the desired results, and it's a straight-forward, easy-to-read solution.

    When I tested it on my 5 million row table, total execution time was 164580 Milliseconds duration

    Edited: Although I think I would put the "select distinct" in the subselect, for cases where one PersonId has multiple rows that are out of order.

  • that supposed to be faster, if clustered index applied on (PersonID, Version )

    select distinct a.PersonID,a.Version, a.DEDate , case when b.PersonID is null then 0 else 1 end AS [Is ImProper]

    from PersonRecord a

    left join (

    select a.PersonID, a.Version

    from PersonRecord a

    inner join PersonRecord b on a.PersonID = b.PersonID and a.Version>b.Version and a.DEDate<b.DEDate

    ) b on a.PersonID = b.PersonID and a.Version=b.Version

    thanks

  • AlexT (11/13/2008)


    that supposed to be faster, if clustered index applied on (PersonID, Version )

    select distinct a.PersonID,a.Version, a.DEDate , case when b.PersonID is null then 0 else 1 end AS [Is ImProper]

    from PersonRecord a

    left join (

    select a.PersonID, a.Version

    from PersonRecord a

    inner join PersonRecord b on a.PersonID = b.PersonID and a.Version>b.Version and a.DEDate<b.DEDate

    ) b on a.PersonID = b.PersonID and a.Version=b.Version

    thanks

    Good point about the fact that Version is part of the clustered index.

    Unfortunately, incorporating version into the subselct breaks the solution. It no longer returns the desired results.

  • yes sorry, my mistake. if you don't mind you can try this

    select a.PersonID, a.Version, a.DEDate,

    case when b.PersonID is null then 0 else 1 end AS [Is ImProper]

    from PersonRecord a

    left join (

    select distinct a.PersonID

    from PersonRecord a

    inner join PersonRecord b on a.PersonID = b.PersonID and a.Version>b.Version and a.DEDate<b.DEDate

    ) b on a.PersonID = b.PersonID

    thanks

  • AlexT (11/13/2008)


    yes sorry, my mistake. if you don't mind you can try this

    select a.PersonID, a.Version, a.DEDate,

    case when b.PersonID is null then 0 else 1 end AS [Is ImProper]

    from PersonRecord a

    left join (

    select distinct a.PersonID

    from PersonRecord a

    inner join PersonRecord b on a.PersonID = b.PersonID and a.Version>b.Version and a.DEDate<b.DEDate

    ) b on a.PersonID = b.PersonID

    This query does produce the desired effect, and yes, moving the distinct inside the subselect does result in a significant performance lift. Total execution time was 144470 Milliseconds! (vs. previous run time of was 164580 Milliseconds)

  • that means that you have a lot "bad" de dates in table. in reality difference probably will be more significant

  • No, actually, only 10,000 PersonIds in my test table of 5 million rows (1 million unique PersonIDs) have records that are out of order; that's about 1% of the ids. That also explains the nice lift, because the subselect has far fewer rows for the distinct to operate on.

    Note: the subselect finds roughly 20,000 rows, distinct return 10,000 rows.

Viewing 12 posts - 31 through 41 (of 41 total)

You must be logged in to reply to this topic. Login to reply