RE: List out rows from a duplicate record set that has one or more odd values

SSC Guru

Points: 59322

May 22, 2013 at 3:29 am

geoff5 (5/21/2013)
Sean,
I'm not an expert at execution plans, so I can't speak definitively on which approach is more efficient, but intuitively it would seem to me that referencing the raw data table only once, as my solution does (in the CTE), is going to be a better choice than referencing it three times, as your sorted solution does based on Eugene's "simpler" approach (once in the CTE, once in the main query, and once in the WHERE clause subquery).
In addition, Eugene cited the window functions as the problem he solved by putting the StudentID comparison into a subquery in the WHERE clause, but your sorting solution puts a window function back into a CTE. So all things considered, do you have reason to believe that your last recommendation is a more efficient query solution than my earlier suggestion?

1. Number of time the same table is mentioned in the query has no direct impact on performance. As SQL optimizer will decide how many times table/index should be scanned/seeked.

In case of using windowed function, SQL will need to perform lazy table spool operation, which is not the best on in terms of performance. So, even so you have used table only once in your query, SQL will need to scan it once and perform table spool twice (to calculate two different ROW_NUMBERs). For the query which does simple GROUP BY, SQL will need just to scan table twice. And it's only based on the given example. I guess in OP real table CaseId is most likely to be indexed. In this case, my query will only have one table scan and one index seek. Query with ROW_NUMBER will still do the same: one scan, two lazy table spools.

2. IN or INNER JOIN?

Until you wanted to sort by CaseId with highest CreatedDate, you would use IN as it would win slightly over INNER JOIN in this case, as we wouldn't need anything to take from the table (actually sub-query) which we JOIN to. Here is a very good article from Gail Shaw about it: http://sqlinthewild.co.za/index.php/2010/01/12/in-vs-inner-join/.

However, as soon as you need this particular order, you need to get calculated MAX (CreatedDate). Therefore - INNER JOIN.

So Sergiy's query does exactly that. However I would also enforce order by CaseIs, just in case there are different CaseId's with the same MAX(CreatedDate):

SELECT s.*

FROM @mySampleTable s

INNER JOIN (SELECT CaseID, MAX(CreatedDate) LatestUpdate

FROM @mySampleTable

GROUP BY CaseID

HAVING MAX(StudentID) != MIN(StudentID)) fs ON s.CaseID = fs.CaseID

ORDER BY LatestUpdate DESC,

CaseID,

CreatedDate DESC -- here it's depends how you want to sort records for the same CaseId

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]