Self-Joined to Find if Archived Versions of Record Exist w/o Duplicate Records

  • I have a table keeping client notes that can be updated and the previous note archived using an archived field. When a note is updated, the archived field is populated with the date on the old versions of the note.  This allows us to see the current version (the one where archived is null) and also to review past versions (where archived is not null).  For the UI, I want to show an icon for notes that have archived versions so they can access them. I know that the notes belong together because they have the same noteGroupID.  Here is the simplified query:

    SELECT
    cn.noteID,
    cn.noteGroupID,
    cn.notes,
    IIF(cn2.noteID IS NOT NULL, 1, 0) AS olderVersionsAvailable
    FROM
    dbo.clientNotes AS cn
    LEFT JOIN
    dbo.clientNotes AS cn2
    ON cn2.noteGroupID = cn.noteGroupID
    AND cn2.archived IS NOT NULL
    WHERE
    cn.archived IS NULL

    The problem is, with this code, I get multiple records if a there are multiple archived versions of the note. I could do a group to fix it, but that seems wrong for lots of reasons.  How do I include BOTH olderVersionsAvailable = 1 and olderVersionsAvailable = 0 without duplicating records when there are multiple archived notes?

    I am sure there must be a way to do this, but trusty google searches have not turned one up.

    Any help please?

    Sincerely,

    Brandon

  • Without having sample data, it is tricky to say for certain, but as a thought - would a DISTINCT on your SELECT help?

    If not, could you provide some DDL and sample data showing what is happening and what you expect to happen?

  • I would do this with an EXISTS check in the SELECT list:

     

    SELECT
    cn.noteID,
    cn.noteGroupID,
    cn.notes,
    IIF(EXISTS (SELECT *
    FROM clientNotes cn2
    WHERE cn2.noteGroupID = cn.noteGroupID
    AND cn2.archived IS NOT NULL), 1, 0) AS olderVersionsAvailable
    FROM
    dbo.clientNotes AS cn
    WHERE
    cn.archived IS NULL a

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

  • Thank you both. Both work fine - but is distinct better than group in performance? Also, I thought running a query in the select portion caused multiple queries for each result - is this wrong?  That would obviously be a terrible cost if that was the case.  Thank you for taking the time to answer.

  • In fact, it looks like DISTINCT has a much higher cost than GROUP BY generally. https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct

    I am skeptical that putting a query in an iif in the select clause is a great performer - though I am not that experienced with that structure. This table will have millions of records but only has 11 right now (so my testing is not very realistic).

    Thanks again!

    Brandon

  • Also, I thought running a query in the select portion caused multiple queries for each result - is this wrong?

    Keep in mind that SQL is a declarative language. You tell what result you want. The optimizer figures out the best way to compute the query. I can't promise that your query will fly with your tables and data, since I don't them. But assuming that there are suitable indexes, I would not be too worried.

    As for GROUP BY vs. DISTINCT, there is no aggregation here, so GROUP BY would only be confusing. But the DISTINCT would be wrong in the first place. The key is that you only care about one hit with the condition archived IS NOT NULL, so why then ask for all?

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

  • Well, distinct does work - but you are right - hopefully SQL can figure out silliness and make it run like the wind 🙂

  • My opinion with regards to performance - test it out!

    In theory, one method may be faster than the other.  In practice, this can surprise you.  Sometimes the statistics and indexes and optimizer tricks can make the query you THINK will be slower actually perform better.

    If you are ever concerned about the performance, I say test it and see which one performs better with your data.

    Now, one thing about the distinct approach is that the distinct is applied after getting all of the data.  So with large data sets, it is not likely to be the fastest approach.  But if you are looking to get the absolute fastest result, there are tricks that can be done to handle this.  You could keep that bit in the table for the alternate available rather than calculating it.  Then the stored procedure that is used to insert/update data in the table could check if the value is new or if it would result in an archived version being made and flip the bit that way.  Slight performance hit to your INSERT/UPDATE stored procedure with the benefit of having your SELECT not need a distinct or nested SELECT.  Mind you changing table structure when there are already a lot of rows in the table would require you to update existing data and may break any SSIS, SSRS, or automated processes you have that rely on it.

    One thing I do like about the distinct approach vs a nested select is that I find it easier to read.  That bit is just my opinion though.  Some people may find nested selects easier to read than potentially missing the DISTINCT keyword and being confused why some results are filtered out.

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply