December 10, 2010 at 7:47 am
I have a data warehouse with millions of rows in my main table (Coverage) and my versioned table (CoverageVersions). The main table has "permanent" values which are always constant no matter what happens. The versioned table adds a new record every time a mutable value gets changed.
The newest record in the versioned table is always the correct one. So to find the mutable values, I do a subquery to locate the record with the highest (most recent) EffectiveDate, then join the results back to the main table to get the non-mutable values.
Here's my code:
SELECT fcv.CoverageKey, fcv.SourceKey, fcv.Effective AS EffectiveDt, fcv.CovRefundKey, fcv.TypeKey, fcv.StatusKey
FROM (SELECT CoverageKey, max(Effective) AS EffectiveDt
FROM dbo.coverageversions
GROUP BY coveragekey) fcvsq
INNER JOIN dbo.coverageversions fcv
ON fcvsq.coveragekey = fcv.CoverageKey AND fcvsq.EffectiveDt = fcv.Effective
INNER JOIN dbo.coverage fc
ON fcv.coveragekey = fc.coveragekey
Now, the subquery always does an Index Seek, which is okay. But the JOIN always does an Index Scan. I'm trying to figure out if that's a good thing, a bad thing, or an indifferent thing. I don't have an exact running time because this query is part of a bigger proc (and SSMS keeps running out of memory when I run this), but I'm pretty sure most of the load occurs on this section.
Maybe my query is solid and I can't enhance it. But I thought I'd throw it out to y'all and see if anyone has any suggestions.
Here's my DDL. All dates are integer values that map back to the key (TimeKey) of a Time table, which sorts dates into FullDate, CalendarMonth, CalendarQuarter, etc.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Coverage](
[CoverageKey] [int] IDENTITY(1,1) NOT NULL,
[SourceKey] [int] NOT NULL,
[CertInfoKey] [int] NOT NULL,
[PD2Key] [tinyint] NULL,
[CovCD] [varchar](10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[TimeKey] [int] NOT NULL,
[Deleted] [int] NULL,
CONSTRAINT [PK_Coverage_CoverageKey] PRIMARY KEY CLUSTERED
(
[CoverageKey] ASC
)WITH (PAD_INDEX = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[CoverageVersions](
[CoverageKey] [int] NOT NULL,
[Effective] [int] NOT NULL,
[SourceKey] [int] NOT NULL,
[EntryDate] [int] NULL,
[ExpirationDate] [int] NULL,
[CustomerCost] [smallmoney] NULL,
[ExportDate] [int] NULL,
[CovRefundKey] [int] NULL,
[BatchKey] [int] NULL,
[StatusKey] [int] NULL,
[TypeKey] [int] NULL,
[ModifiedOn] [int] NULL,
[RefundAcctYrMo] [int] NULL,
CONSTRAINT [PK_CoverageVersions] PRIMARY KEY CLUSTERED
(
[CoverageKey] ASC,
[Effective] ASC,
[SourceKey] ASC
)WITH (PAD_INDEX = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[CoverageVersions] WITH CHECK ADD CONSTRAINT [FK_CoverageVersions_CoverageKey] FOREIGN KEY([CoverageKey])
REFERENCES [dbo].[Coverage] ([CoverageKey])
ON DELETE CASCADE
GO
ALTER TABLE [dbo].[CoverageVersions] CHECK CONSTRAINT [FK_CoverageVersions_CoverageKey]
GO
INSERT INTO dbo.Coverage (CoverageKey, SourceKey, CertInfoKey, PD2Key, CovCd,
TimeKey, Deleted)
SELECT 2, 1, 2220882, 0, 'GAP1', 40016, 0 UNION ALL
SELECT 3, 1, 2782106, 0, 'NONE', 40016, 0 UNION ALL
SELECT 4, 1, 2334965, 0, 'NONE', 40016, 0 UNION ALL
SELECT 5, 1, 3171687, 0, 'NONE', 40016, 0 UNION ALL
SELECT 6, 1, 7919390, 0, 'NONE', 40016, 0 UNION ALL
SELECT 7, 1, 3646045, 0, 'NONE', 40016, 0 UNION ALL
SELECT 8, 1, 3667294, 0, 'NONE', 40016, 0 UNION ALL
SELECT 9, 1, 7905175, 0, 'NONE', 40016, 0 UNION ALL
SELECT 10, 1, 8534004, 0, 'NONE', 40016, 0 UNION ALL
SELECT 11, 1, 8158285, 0, 'NONE', 40016, 0 UNION ALL
SELECT 12, 1, 6576787, 0, 'NONE', 40016, 0 UNION ALL
SELECT 13, 1, 1413614, 0, 'NONE', 40016, 0 UNION ALL
SELECT 14, 1, 2252017, 0, 'NONE', 40016, 0 UNION ALL
SELECT 15, 1, 7426578, 0, 'NONE', 40016, 0 UNION ALL
SELECT 16, 1, 4170009, 0, 'NONE', 40016, 0 UNION ALL
SELECT 17, 1, 2101578, 0, 'NONE', 40016, 0 UNION ALL
SELECT 18, 1, 1881784, 0, 'NONE', 40016, 0 UNION ALL
SELECT 19, 1, 3184168, 0, 'NONE', 40016, 0 UNION ALL
SELECT 20, 1, 87996, 0, 'NONE', 40016, 0 UNION ALL
SELECT 21, 1, 5947111, 0, 'NONE', 40016, 0;
INSERT INTO dbo.CoverageVersions (CoverageKey, Effective, SourceKey, EntryDate, ExpirationDate,
CustomerCost, ExportDate, CovRefundKey, BatchKey, StatusKey, TypeKey, ModifiedOn)
SELECT 2, 40016, 1, 38257, 40259, 0.00, 38682, 0, 225330, 1, 0 UNION ALL
SELECT 2, 40260, 1, 38257, 40259, 0.00, 38682, 0, 225330, 8, 0 UNION ALL
SELECT 3, 40016, 1, 36493, 39049, 1.00, 0, 0, 5467, 8, 0 UNION ALL
SELECT 4, 40016, 1, 36733, 39291, 1.00, 0, 0, 11263, 8, 0 UNION ALL
SELECT 5, 40016, 1, 37010, 39566, 1.00, 0, 0, 3829, 8, 0 UNION ALL
SELECT 6, 40016, 1, 36347, 38896, 1.00, 0, 0, 2021, 8, 0 UNION ALL
SELECT 7, 40016, 1, 36502, 23922, 40.00, 0, 0, 461, 8, 0 UNION ALL
SELECT 8, 40016, 1, 36365, 38927, 1.00, 0, 0, 2080, 8, 0 UNION ALL
SELECT 9, 40016, 1, 36499, 23375, 40.00, 0, 0, 464, 8, 0 UNION ALL
SELECT 10, 40016, 1, 37003, 24471, 1.00, 0, 0, 3796, 8, 0 UNION ALL
SELECT 11, 40016, 1, 36611, 39170, 1.00, 0, 0, 34179, 8, 0 UNION ALL
SELECT 12, 40016, 1, 36319, 38866, 1.00, 0, 0, 34695, 8, 0 UNION ALL
SELECT 13, 40016, 1, 36667, 39230, 1.00, 0, 0, 34290, 8, 0 UNION ALL
SELECT 14, 40016, 1, 36607, 39170, 1.00, 0, 0, 14997, 8, 0 UNION ALL
SELECT 15, 40016, 1, 36607, 39170, 1.00, 0, 0, 34171, 8, 0 UNION ALL
SELECT 16, 40016, 1, 36520, 39080, 1.00, 0, 0, 6365, 8, 0 UNION ALL
SELECT 17, 40016, 1, 36578, 39139, 1.00, 0, 0, 4613, 8, 0 UNION ALL
SELECT 18, 40016, 1, 36578, 39139, 1.00, 0, 0, 730, 8, 0 UNION ALL
SELECT 19, 40016, 1, 36885, 39445, 1.00, 0, 0, 612, 8, 0 UNION ALL
SELECT 20, 40016, 1, 36926, 39476, 1.00, 0, 0, 2746, 8, 0 UNION ALL
SELECT 21, 40016, 1, 36926, 39476, 1.00, 0, 0, 4091, 8, 0;
December 10, 2010 at 9:32 am
You get a table scan because your subquery has no limiting WHERE condition. In order to find max(Effective) per CoverageKey you'd need to scan the entire table (or clustered index).
One option would be adding a tinyint OutOfDate column with a default value (lets say Zero which is updated regulary (daily/weekly depending on the load and change frequency) and set the value to 1 if the data is not the most recent. Therewith you'd limit the rows to be searched. Based on that I would add a unique nonclustered index on OutOfDate DESC,CoverageKey,Effective. This index would be narrow and sorted in a way, that all new rows will be at the end. It wouldn't really matter if there are more than one row inserted for a CoverageKey between two OutOfDate Update cycles, since the majority of rows will be excluded (assuming there are more outdate than current rows).
Finally, include WHERE OutOfDate =0 in your subquery.
When applied to your sample data I got an index seek for the subquery and clustered index seek for the two main table.
If you'd be on 2K8 instead of 2K5, I'd recommend making the index above a filtered one on OutOfDate =0...
Edit: as a side note: Your set of insert statements for the CoverageVersions table fails since there are 12 column names listed but only 11 values per row provided... So I removed the last column name from the list.
December 10, 2010 at 9:46 am
A similar option would be adding another table CoverageVersionsHistory with the current CoverageVersions structure and move all outdated versions into this history table on a scheduled base. So, instead of the OutOfDate column with two possible values you'd have two separate tables. This wouldn't require the additional index and the clustered index scan on that table wouldn't take that long. It might still be benfitial to add a NC index on CoverageKey,Effective though.
December 10, 2010 at 10:05 am
LutzM (12/10/2010)
You get a table scan because your subquery has no limiting WHERE condition. In order to find max(Effective) per CoverageKey you'd need to scan the entire table (or clustered index).
I'm a little confused by this response. The Subquery has an Index Seek on it. Why would that (and the lack of a WHERE clause on it) cause a Clustered Index Scan on the table it is being joined to?
EDIT: Wait. I get why when joining the subquery back to the original table might cause the scan. But there's also a Clustered Index Scan on the main Coverage table. That's the part I really don't get.
December 10, 2010 at 10:09 am
LutzM (12/10/2010)
A similar option would be adding another table CoverageVersionsHistory with the current CoverageVersions structure and move all outdated versions into this history table on a scheduled base.
Yeah, but unfortunately I can't do that without causing a whole heap of pain. Not even sure I'll be able to add a column to the one table without causing problems (there are SSIS dependencies on this that could cause any schema alterations to be a major project).
Thanks for catching the data issue. I'm editing that post right after this one.
December 10, 2010 at 11:20 am
Seems like we're both confused now... :ermm:
Attached please find the snapshots of the execution plans I get when running the queries on your sample data (I didn't include files as sqlplan files since it should be easy to reproduce by running the sample data you provided).
I think the different plans are caused by different row numbers and data distribution.
Could you please post the actual execution plan as sqlplan based on your large table join? I'd expect it to be different...
December 10, 2010 at 11:27 am
Brandie Tarvin (12/10/2010)
LutzM (12/10/2010)
A similar option would be adding another table CoverageVersionsHistory with the current CoverageVersions structure and move all outdated versions into this history table on a scheduled base.Yeah, but unfortunately I can't do that without causing a whole heap of pain. Not even sure I'll be able to add a column to the one table without causing problems (there are SSIS dependencies on this that could cause any schema alterations to be a major project).
Thanks for catching the data issue. I'm editing that post right after this one.
Errmmm.... a rather dirty option:
Add an after insert trigger to the CoverageVersionsHistory table and copy the inserts into a "pseudo-staging table". Use this table in your query instead of your current CoverageVersionsHistory and delete the outdated rows on a scheduled base. Therewith you wouldn't have to mess with your original table design and get a nice and dense table...
December 10, 2010 at 11:34 am
LutzM (12/10/2010)
Brandie Tarvin (12/10/2010)
LutzM (12/10/2010)
A similar option would be adding another table CoverageVersionsHistory with the current CoverageVersions structure and move all outdated versions into this history table on a scheduled base.Yeah, but unfortunately I can't do that without causing a whole heap of pain. Not even sure I'll be able to add a column to the one table without causing problems (there are SSIS dependencies on this that could cause any schema alterations to be a major project).
Thanks for catching the data issue. I'm editing that post right after this one.
Errmmm.... a rather dirty option:
Add an after insert trigger to the CoverageVersionsHistory table and copy the inserts into a "pseudo-staging table". Use this table in your query instead of your current CoverageVersionsHistory and delete the outdated rows on a scheduled base. Therewith you wouldn't have to mess with your original table design and get a nice and dense table...
Or off the same trigger remove the outdated ones... I think I see a few things Brandie but I've got to go work at work, sadly. Hopefully can swing in and really dig at it later. I'd mention them but I'm not entirely sure I'm not barking up the wrong tree. 🙂
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
December 10, 2010 at 11:37 am
Craig Farrell (12/10/2010)
... I'd mention them but I'm not entirely sure I'm not barking up the wrong tree. 🙂
Unlikely. The tree is already occupied and I can't spot you. 😀 😛
December 13, 2010 at 4:52 am
LutzM (12/10/2010)
Craig Farrell (12/10/2010)
... I'd mention them but I'm not entirely sure I'm not barking up the wrong tree. 🙂Unlikely. The tree is already occupied and I can't spot you. 😀 😛
*snicker*
Okay, it must be too early in the morning because I'm getting different execution plan results this morning that I know I got last week. For one, I don't have any branches in my plan this morning. So I'm going take a little time to go over this again to see what changed between my T-SQL today and my T-SQL on Friday.
On the other hand, Lutz, you did give me a good idea for a NC Index on SourceKey that doesn't fix the execution plan, but speeds up the query considerably. So thanks for that.
December 17, 2010 at 12:26 pm
Brandie,
I'm coming into this rather late (traveling the last couple of days). You could try something like this with a CROSS APPLY:
SELECT
FCV.CoverageKey, FCV.SourceKey, FCV.EffectiveDt, FCV.CovRefundKey
, FCV.TypeKey, FCV.StatusKey
FROM dbo.Coverage C
CROSS APPLY
(SELECT TOP 1 CV.CoverageKey, CV.SourceKey, CV.Effective AS EffectiveDt
, CV.CovRefundKey, CV.TypeKey, CV.StatusKey
FROM dbo.CoverageVersions CV
WHERE CV.CoverageKey = C.CoverageKey
ORDER BY CV.Effective DESC
) AS FCV
I ended up trading 1 clustered index scan in the CoverageVersions table for a clustered index scan in the Coverage table. The Coverage table becomes the anchor in the query. If there is some way to limit the number of rows for the Coverage table by some sort of index it may end up performing better.
Todd Fifield
December 22, 2010 at 8:18 am
Brandie Tarvin (12/13/2010)
Okay, it must be too early in the morning because I'm getting different execution plan results this morning that I know I got last week. For one, I don't have any branches in my plan this morning.
If by 'no branches' you mean no joins, the answer is because the optimizer is darned clever, and turns your whole query into this:
This is a very similar plan to the first one Lutz posted. His had an extra seek because he had forgotten to add the foreign key between the two tables, or omitted to run the CHECK CONSTRAINT statement straight after the definition.
The optimizer can see that you aren't returning any columns from the Coverage table, and it also knows that the join condition is on the primary key of Coverage, so it all boils down to a check that a matching row in Coverage exists. The foreign key (when checked and enabled) provides that guarantee, so there's no need to do a join at all.
The second clever optimization is transforming the whole of the remaining query into a simple scan of the CoverageVersions table, followed by a Segment, and a Top. This is known as a Segment Top construction, and you can find full details of it here:
http://sqlblog.com/blogs/paul_white/archive/2010/07/28/the-segment-top-query-optimisation.aspx
Paul
December 22, 2010 at 8:25 am
The optimal index for the query given is:
CREATE INDEX [IX dbo.CoverageVersions CoverageKey,Effective (SourceKey,CovRefundKey,TypeKey,StatusKey)]
ON dbo.CoverageVersions (CoverageKey DESC, Effective DESC)
INCLUDE (SourceKey, CovRefundKey, TypeKey, StatusKey)
WITH (FILLFACTOR = 100, SORT_IN_TEMPDB = ON, MAXDOP = 1;
That presents groups (by CoverageKey) in Effective order to the Segment iterator, and the Top just takes the first row for each group created by the Segment. The included columns are just there to ensure the index is covering. The leading columns are descending to allow a forward scan of the index.
December 22, 2010 at 8:39 am
As a regular Top-N-Per-Group problem, we can also write it using a ranking function:
SELECT fcvsq.CoverageKey,
fcvsq.SourceKey,
fcvsq.Effective AS EffectiveDt,
fcvsq.CovRefundKey,
fcvsq.TypeKey,
fcvsq.StatusKey
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY CoverageKey ORDER BY Effective DESC) AS Sequence
FROM dbo.coverageversions
) AS fcvsq
WHERE fcvsq.Sequence = 1;
The optimal index is the same as previously, except CoverageKey is ASC:
CREATE INDEX [IX dbo.CoverageVersions CoverageKey,Effective (SourceKey,CovRefundKey,TypeKey,StatusKey)]
ON dbo.CoverageVersions (CoverageKey ASC, Effective DESC)
INCLUDE (SourceKey, CovRefundKey, TypeKey, StatusKey)
WITH (FILLFACTOR = 100, SORT_IN_TEMPDB = ON, MAXDOP = 1, DROP_EXISTING = ON);
This gives a similar plan to Segment Top:
It's a very similar idea, with the Sequence Project numbering the rows, and the Filter returning just row #1 per group.
December 23, 2010 at 8:19 am
Good stuff as always Paul! Bummer is that the optimal index is essentially another copy of the table, which could be large given that it is a DW fact table and also gum up the data load, index maintenance, backups, etc. But if the query needs to run fastest sometimes you gotta do what you gotta do!
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
Viewing 15 posts - 1 through 15 (of 27 total)
You must be logged in to reply to this topic. Login to reply