Self-Joined to Find if Archived Versions of Record Exist w/o Duplicate Records

Question

Post reply

Self-Joined to Find if Archived Versions of Record Exist w/o Duplicate Records

brandon 94807

SSC Veteran

Points: 280
More actions
May 6, 2021 at 6:57 pm
Go to Answer
#3880249
I have a table keeping client notes that can be updated and the previous note archived using an archived field. When a note is updated, the archived field is populated with the date on the old versions of the note. This allows us to see the current version (the one where archived is null) and also to review past versions (where archived is not null). For the UI, I want to show an icon for notes that have archived versions so they can access them. I know that the notes belong together because they have the same noteGroupID. Here is the simplified query:
```
SELECT
        cn.noteID,
        cn.noteGroupID,
        cn.notes,
        IIF(cn2.noteID IS NOT NULL, 1, 0) AS olderVersionsAvailable
FROM
        dbo.clientNotes AS cn
    LEFT JOIN
        dbo.clientNotes AS cn2
            ON cn2.noteGroupID = cn.noteGroupID
               AND cn2.archived IS NOT NULL
WHERE
        cn.archived IS NULL
```
The problem is, with this code, I get multiple records if a there are multiple archived versions of the note. I could do a group to fix it, but that seems wrong for lots of reasons. How do I include BOTH olderVersionsAvailable = 1 and olderVersionsAvailable = 0 without duplicating records when there are multiple archived notes?
I am sure there must be a way to do this, but trusty google searches have not turned one up.
Any help please?
Sincerely,
Brandon

Erland Sommarskog

SSC-Insane

Points: 24038

May 6, 2021 at 7:59 pm

Answer

#3880260

I would do this with an EXISTS check in the SELECT list:

SELECT
        cn.noteID,
        cn.noteGroupID,
        cn.notes,
        IIF(EXISTS (SELECT *
                    FROM  clientNotes cn2
                    WHERE cn2.noteGroupID = cn.noteGroupID
                      AND cn2.archived IS NOT NULL), 1, 0) AS olderVersionsAvailable
FROM
        dbo.clientNotes AS cn
WHERE
        cn.archived IS NULL    a

[font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Mr. Brian Gale One Orange Chip Points: 25026 More actions · Answer 1

Without having sample data, it is tricky to say for certain, but as a thought - would a DISTINCT on your SELECT help?

If not, could you provide some DDL and sample data showing what is happening and what you expect to happen?

The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

brandon 94807 SSC Veteran Points: 280 More actions · Answer 2

Thank you both. Both work fine - but is distinct better than group in performance? Also, I thought running a query in the select portion caused multiple queries for each result - is this wrong? That would obviously be a terrible cost if that was the case. Thank you for taking the time to answer.

brandon 94807 SSC Veteran Points: 280 More actions · Answer 3

In fact, it looks like DISTINCT has a much higher cost than GROUP BY generally. https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct

I am skeptical that putting a query in an iif in the select clause is a great performer - though I am not that experienced with that structure. This table will have millions of records but only has 11 right now (so my testing is not very realistic).

Thanks again!

Brandon

Erland Sommarskog SSC-Insane Points: 24038 More actions · Answer 4

Also, I thought running a query in the select portion caused multiple queries for each result - is this wrong?

Keep in mind that SQL is a declarative language. You tell what result you want. The optimizer figures out the best way to compute the query. I can't promise that your query will fly with your tables and data, since I don't them. But assuming that there are suitable indexes, I would not be too worried.

As for GROUP BY vs. DISTINCT, there is no aggregation here, so GROUP BY would only be confusing. But the DISTINCT would be wrong in the first place. The key is that you only care about one hit with the condition archived IS NOT NULL, so why then ask for all?

[font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

brandon 94807 SSC Veteran Points: 280 More actions · Answer 5

Well, distinct does work - but you are right - hopefully SQL can figure out silliness and make it run like the wind 🙂

Mr. Brian Gale One Orange Chip Points: 25026 More actions · Answer 6

My opinion with regards to performance - test it out!

In theory, one method may be faster than the other. In practice, this can surprise you. Sometimes the statistics and indexes and optimizer tricks can make the query you THINK will be slower actually perform better.

If you are ever concerned about the performance, I say test it and see which one performs better with your data.

Now, one thing about the distinct approach is that the distinct is applied after getting all of the data. So with large data sets, it is not likely to be the fastest approach. But if you are looking to get the absolute fastest result, there are tricks that can be done to handle this. You could keep that bit in the table for the alternate available rather than calculating it. Then the stored procedure that is used to insert/update data in the table could check if the value is new or if it would result in an archived version being made and flip the bit that way. Slight performance hit to your INSERT/UPDATE stored procedure with the benefit of having your SELECT not need a distinct or nested SELECT. Mind you changing table structure when there are already a lot of rows in the table would require you to update existing data and may break any SSIS, SSRS, or automated processes you have that rely on it.

One thing I do like about the distinct approach vs a nested select is that I find it easier to read. That bit is just my opinion though. Some people may find nested selects easier to read than potentially missing the DISTINCT keyword and being confused why some results are filtered out.

The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.