May 14, 2007 at 8:53 am
Hey folks, I'm looking at making the following query more efficient potentially using the ranking functions and I'd like some advice from the gurus. The purpose of the following is to have a status for a person, and also have a historical background as to what they've done, status wise. This was the best way I could come up with to do this a few years back, but I'm wondering if there's a better way with Sql 2005.
Here's a toned down structure and my query. Any help/critique would be greatly appreciated.
CREATE TABLE #Status(
StatusID int NOT NULL,
StatusName VARCHAR(50) NOT NULL,
StatusCategoryID int NOT NULL
) ON [PRIMARY]
CREATE TABLE #RegStatus(
[RegistrationID] [uniqueidentifier] NOT NULL,
[StatusID] [int] NOT NULL,
[StatusTimeStamp] [datetime] NOT NULL,
[UniqueRowID] [int] IDENTITY(1,1) NOT NULL
) ON [PRIMARY]
SET NOCOUNT on
INSERT INTO #Status VALUES(200, 'StatusA', 1)
INSERT INTO #Status VALUES(210, 'StatusB', 1)
INSERT INTO #Status VALUES(115, 'StatusC', 1)
INSERT INTO #Status VALUES(112, 'StatusD', 1)
INSERT INTO #Status VALUES(314, 'StatusE', 1)
INSERT INTO #Status VALUES(15, 'StatusF', 1)
INSERT INTO #Status VALUES(22, 'StatusG', 1)
INSERT INTO #Status VALUES(300, 'StatusX', 2)
INSERT INTO #Status VALUES(310, 'StatusY', 2)
INSERT INTO #Status VALUES(320, 'StatusZ', 2)
INSERT INTO #RegStatus VALUES('7A6058D0-06CB-4E83-A8C4-B1AFC74B11F0', 200, GETDATE())
INSERT INTO #RegStatus VALUES('7A6058D0-06CB-4E83-A8C4-B1AFC74B11F0', 210, GETDATE())
INSERT INTO #RegStatus VALUES('7A6058D0-06CB-4E83-A8C4-B1AFC74B11F0', 115, GETDATE())
INSERT INTO #RegStatus VALUES('8B94A666-A3DD-4CB1-89A0-9910047AE7A0', 112, GETDATE())
INSERT INTO #RegStatus VALUES('8B94A666-A3DD-4CB1-89A0-9910047AE7A0', 314, GETDATE())
INSERT INTO #RegStatus VALUES('8B94A666-A3DD-4CB1-89A0-9910047AE7A0', 200, GETDATE())
INSERT INTO #RegStatus VALUES('8B94A666-A3DD-4CB1-89A0-9910047AE7A0', 22, GETDATE())
INSERT INTO #RegStatus VALUES('58AF0FC6-C900-4BD0-B3F7-F9D62701F021', 15, GETDATE())
INSERT INTO #RegStatus VALUES('58AF0FC6-C900-4BD0-B3F7-F9D62701F021', 115, GETDATE())
INSERT INTO #RegStatus VALUES('58AF0FC6-C900-4BD0-B3F7-F9D62701F021', 200, GETDATE())
INSERT INTO #RegStatus VALUES('58AF0FC6-C900-4BD0-B3F7-F9D62701F021', 115, GETDATE())
SET NOCOUNT Off
/*
This is a query from within a function that I use to not only get the latest
status for one registrant, but I can use it to get the latest status for everyone as well.
*/
DECLARE @RegStatusCatID int,
@RegID UNIQUEIDENTIFIER
SET @RegStatusCatID = 1
SET @RegID = null
select LS.*, S.StatusName, S.StatusCategoryID
from #Status S
join(
select RS.RegistrationID, RS.StatusID, RS.StatusTimeStamp
from #RegStatus RS
join
(
SELECT RS.RegistrationID , max(RS.UniqueRowID) UniqueRowID
FROM #RegStatus RS
join #Status S
on RS.StatusID = S.StatusID
and S.StatusCategoryID = @RegStatusCatID
and (
@RegID is null
or (@RegID is not null
and RS.RegistrationID = @RegID)
 
group by RS.RegistrationID
  LS
on RS.UniqueRowID = LS.UniqueRowID
) LS
on S.StatusID = LS.StatusID
--SELECT * FROM #RegStatus
DROP TABLE #RegStatus
DROP TABLE #Status
May 14, 2007 at 9:31 am
Mark
Two things:
(1) Using a Common Table Expression instead of a subquery would make your query more readable, even if it didn't help the performance.
(2) At the beginning you set @RegID to be null, so why does your query test for NULLness further down? Even if that test were necessary, I think you could simplify it to ...@RegID IS NULL OR RS.RegistrationID = @RegID...
John
May 14, 2007 at 12:18 pm
John, I appreciate the prompt reply.
(1) I'm currently looking into CTEs and cannot figure out how they'll help the situation. Currently I have this code in a UDF (not explained very well earlier) and I think if I used a CTE, I'd have to create a temp table to return my values. I would guess this extra step would slow things down a little.
(2)
@RegID is null
or (@RegID is not null
and RS.RegistrationID = @RegID)
this code acts as a case statement. If I pass in a RegID to the UDF, it'll return the status of only one RegistrationID. This one UDF serves 2 purposes. It'd PROBABLY be best of me to leave the RegID business out of this all together and make 2 UDF's, one for a single RegistrationIDs, and one for all RegistrationIDs, but for maintainability, I've put both into one.
Mark
May 15, 2007 at 2:22 am
Mark
CTEs can be used in UDFs without temporary tables. I have rewritten your query as below. I'm not sure about improving performance, but it makes it more readable. You mentioned ranking functions in your first post, but I can't see how those would help... but then I'm obviously not as familiar with your data as you are.
I still think you can shorten the @RegID part without loss of generality. Certainly for the sample data you have provided, the same result set is returned.
By the way, why do you have an identity column and a unique identifier in one of your tables?
I've made the following changes:
(a) Changed one of the aliases (actually the name of the CTE) to LS1 since your use of the same alias twice was confusing
(b) Got rid of the .* construct and specified the columns indivdually
(c) Moved the parts of the join predicate that only refer to one side of the join into the WHERE clause: with an INNER JOIN this doesn't make any difference but I think it's clearer to do it like that and it's safer in case you ever change it to an OUTER JOIN.
John
DECLARE
@RegStatusCatID int,
@RegID UNIQUEIDENTIFIER
SET @RegStatusCatID = 1
SET @RegID = NULL;
WITH LS1 AS
(
SELECT RS.RegistrationID, RS.StatusID, RS.StatusTimeStamp
FROM #RegStatus RS
JOIN (
SELECT RS.RegistrationID , MAX(RS.UniqueRowID) UniqueRowID
FROM #RegStatus RS
JOIN #Status S
ON RS.StatusID = S.StatusID
WHERE S.StatusCategoryID = @RegStatusCatID
AND (@RegID IS NULL
OR RS.RegistrationID = @RegID)
GROUP BY RS.RegistrationID
) LS
ON RS.UniqueRowID = LS.UniqueRowID
)
SELECT
LS1.RegistrationID,
LS1.StatusID,
LS1.StatusTimeStamp,
S.StatusName,
S.StatusCategoryID
FROM #Status S
JOIN LS1
ON S.StatusID = LS1.StatusID
May 15, 2007 at 6:52 am
John, thanks for the detailed explanations. I really appreciate your perspective. I didn't realize that I can still use a CTE in a function without making a temp table. I'll play with that today. I had some suggestions from the same post in comp.databases.ms-sqlserver that lead me to using ROW_NUMBER() in the inner query. Here's the sample:
--Common Table Expression that Ranks the status' for each registrant
--based on the order of the UniqueRowID
WITH CurStatus AS
(
SELECT RegistrationID,
StatusID,
StatusTimeStamp,
[UniqueRowID],
ROW_NUMBER()
OVER(
PARTITION BY RegistrationID
ORDER BY [UniqueRowID] DESC
  AS rn
FROM #RegStatus RS
where (
@RegID is null
or (@RegID is not null
and RS.RegistrationID = @RegID)
 
)
SELECT S.StatusID, S.StatusName, S.StatusCategoryID, CS.RegistrationID, CS.StatusTimeStamp
FROM #Status S
JOIN CurStatus CS
ON S.StatusID = CS.StatusID
AND CS.RN = 1
AND S.StatusCategoryID = @RegStatusCatID
May 15, 2007 at 7:24 am
Mark
I see what you've done there, using the ROW_NUMBER rank instead of the MAX aggregate function. I'd be interested to find out - have you compared the execution plans for this with what you had before? Which is more efficient? I've got an inkling that the original should perform better, especially if you have a large amount of data, since the result set of the CTE should be much smaller using MAX than using ROW_NUMBER.
John
May 15, 2007 at 7:27 am
Currently I've only tested it with that sample data and it performs much better according to the execution plans (from how I know how to read them). Next step is mass amounts of data.
Mark
May 15, 2007 at 7:40 am
Mark
OK, please let us know when you've done that. My inkling may have been wrong, since the query optimizer may not totally materialise the CTE anyway. If you SET STATISTICS IO ON, that'll tell you how much juice each query is using... but don't forget to clear the cache before each test (not on a production server)! What indexes do you have on your tables?
Thanks
John
May 15, 2007 at 8:10 am
John, I'm not sure how to clear my cache, but here's my results with simply 1500 rows in the RegStatus Table. I'd like some more data in that table to replicate what's going to happen in a live environment, but that'll happen in the coming weeks.
Using ROW_Number() and the CTE (Subtree cost 0.0584093)
(32 row(s) affected)
Table 'Status'. Scan count 1, logical reads 66, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'RegStatus'. Scan count 1, logical reads 22, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Previous example, no changes (Subtree cost 0.115096)
(32 row(s) affected)
Table 'Status'. Scan count 1, logical reads 66, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'RegStatus'. Scan count 2, logical reads 44, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
I'm not OVERLY familiar with reading the Execution plans, but in SQLSMS I clicked the Actual Execution Plan button and ran the queries. Relative to the batch the Ranked query was 33% of the batch, and the Old was 66%. The subtree costs of the select statements were listed above.
May 15, 2007 at 8:35 am
Mark
Yes, a similar thing happens with your original sample data as well. Now we just need to see what happens when you get hold of all the extra data. What indexes do you have on the tables?
By the way, you can use DBCC DROPCLEANBUFFERS to clear the data cache.
John
May 16, 2007 at 7:58 am
Assuming a clustered index on the UniqueRowId (or a combination of the other columns used in the aggregate query), in a similar situation we've found that using TOP(1) with an order by on the cluster to perform faster than the MAX. Also, in situations where you're returning only the latest value, using CROSS APPLY with the TOP(1) query ran much faster.
SELECT ...
FROM dbo.BoundBookedPolicy p
CROSS APPLY (SELECT TOP(1) v2.BoundBookedPolicyId,v2.BoundBookedVersionId,v2.EndsmtNum,v2.EndsmtId
FROM dbo.BoundBookedVersion v2
WHERE v2.BoundBookedPolicyId = p.BoundBookedPolicyId
ORDER BY v2.BoundBookedPolicyId DESC, v2.BoundbookedVersionId DESC) AS v
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
Viewing 11 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply