August 2, 2012 at 5:04 am
I'm going to start this off rather generically because the code we have does work, but has started running into performance problems.
In a data warehouse, I have the below table:
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[DimCEHistory](
[CEHistoryKey] [int] IDENTITY(1,1) NOT NULL,
[OtherSystemEID] [int] NOT NULL,
[CIKey] [int] NOT NULL,
[EDate] [int] NOT NULL,
[CycleMonth] [int] NOT NULL,
[AEarned1] [smallmoney] NOT NULL,
[AEarned2] [smallmoney] NOT NULL,
[AEarned3] [smallmoney] NOT NULL,
[AEarned4] [smallmoney] NOT NULL,
[AEarned5] [smallmoney] NOT NULL,
[AEarned6] [smallmoney] NOT NULL,
[AEarned7] [smallmoney] NOT NULL,
[AEarned8] [smallmoney] NOT NULL,
[AEarned9] [smallmoney] NOT NULL,
[AEarned10] [smallmoney] NOT NULL,
[PEarned1] [smallmoney] NOT NULL,
[PEarned2] [smallmoney] NOT NULL,
[PEarned3] [smallmoney] NOT NULL,
[PEarned4] [smallmoney] NOT NULL,
[TotalAEarned] [smallmoney] NOT NULL,
[TotalPEarned] [smallmoney] NOT NULL,
[ModifiedOn] [int] NULL,
[TimeKey] [int] NOT NULL,
CONSTRAINT [PK_DimCEHistory_CEHistoryKey] PRIMARY KEY CLUSTERED
(
[CEHistoryKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[DimCEHistory] WITH NOCHECK ADD CONSTRAINT [FK_DimCEHistory_CIKey] FOREIGN KEY([CIKey])
REFERENCES [dbo].[DimCI] ([CIKey])
GO
ALTER TABLE [dbo].[DimCEHistory] CHECK CONSTRAINT [FK_DimCEHistory_CIKey]
GO
This table is essentially a SCD type of table, but since we need to keep a history of changes rather than just updating the column, it's a manually created history table. It gets updated when records change. So if Record 123A has a change on 07/01/2012, then the record gets shoved into this table so long as the modified date on the last 123A record is less than 07/01/2012.
With me so far?
We pull records from a staging table, join them to this table and the main (non-history) table. Then we do 3 left outer joins to our DimTime table to check the dates because the only actual dates in this database are in the DimTime table and everything is connect via an integer that refers back to the TimeKey (PK CI of the DimTime table).
In addition to the date comparisions, the WHERE clause checks the equality of each and every earned column (there are 15 of them) to see if the history record needs to be added. This worked fine for several years, but now the table is getting so long that performance is degrading. The past few month ends, the query is degrading. We're using a basic UPDATE statement.
UPDATE DimCEHistory
Set AEarned 1 = s.Earned1 .... <all other Earned columns here>
FROM Staging s
INNER JOIN DimCEHistory ceh
ON ....
INNER JOIN DimCE ce
ON ....
LEFT OUTER JOIN DimTime dt1
ON ... (last update date column in staging table)
LEFT OUTER JOIN DimTime dt2
ON ... (Another date column in staging table)
LEFT OUTER JOIN DimTIme dt3
ON ... (Another date column in staging table)
WHERE <pseudocode follows> 2 staging columns are not NULL
and the TimeKey > history ModifiedOn column and
(the record doesn't previously exists in the history table
or the earningsdate <> the timekey from dt3
or the CycleMonth <> the timekey from dt2
or AEarned1 <> staging column 1
or AEarned2 <> staging column 2 ... etc. etc. etc.)
I'm pretty sure that WHERE clause which compares all 15 money columns is what's killing me. But I don't want to add a table-covering NC index because that's just stupid and I don't think it'll work. So my question to you, in a vague sort of way is, do you have any advice on better options for value matching?
I'm not sure if MERGE will work in this instance or not (having never found an excuse to use MERGE). I'm not sure if a CTE will help me at all either. Thoughts?
FYI: This whole thing happens in a T-SQL Proc in an SSIS DataFlow task (OLEDB to OLEDB, nothing in between).
August 2, 2012 at 6:32 am
Sometimes, just occasionally, when the target table isn't the first table listed in the FROM list, SQL Server gets its knickers in a twist. Try this:
UPDATE ceh
Set AEarned 1 = s.Earned1 .... <all other Earned columns here>
FROM DimCEHistory ceh
INNER JOIN Staging s
ON ....
INNER JOIN DimCE ce
ON ....
LEFT OUTER JOIN DimTime dt1
ON ... (last update date column in staging table)
LEFT OUTER JOIN DimTime dt2
ON ... (Another date column in staging table)
LEFT OUTER JOIN DimTIme dt3
ON ... (Another date column in staging table)
WHERE <pseudocode follows> 2 staging columns are not NULL
and the TimeKey > history ModifiedOn column and
(the record doesn't previously exists in the history table
or the earningsdate <> the timekey from dt3
or the CycleMonth <> the timekey from dt2
or AEarned1 <> staging column 1
or AEarned2 <> staging column 2 ... etc. etc. etc.)
Failing this, I'd get the lot into a temp table, index it, and use it for the update. The reason for this is straightforward - the longer an update query holds locks for, the more chance there is of interference from other processes.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
August 2, 2012 at 11:10 am
Huh. Interesting thoughts, Chris. I'll take a look at that option.
Maybe I should be indexing the staging table after it's loaded too.
August 2, 2012 at 2:08 pm
Brandie Tarvin (8/2/2012)
I'm going to start this off rather generically because the code we have does work, but has started running into performance problems.
Well, will try to help, but you know how well generic code usually goes with optimization assistance. But for you, I'll give it more thought then a brief glance. 🙂
This table is essentially a SCD type of table, but since we need to keep a history of changes rather than just updating the column, it's a manually created history table. It gets updated when records change. So if Record 123A has a change on 07/01/2012, then the record gets shoved into this table so long as the modified date on the last 123A record is less than 07/01/2012.
With me so far?
With you... No, I'm afraid I'm confused. Basically each change goes in as its own row, right? Why is this an update and not an insert? Usually you'd just end up updating the 'use until' date on the old record for what I'd normally expect designing something described to me above, so obviously I'm missing something very important here.
We pull records from a staging table, join them to this table and the main (non-history) table. Then we do 3 left outer joins to our DimTime table to check the dates because the only actual dates in this database are in the DimTime table and everything is connect via an integer that refers back to the TimeKey (PK CI of the DimTime table).
Is DimTime constructed in a way that you could do </> checks without having to go to DimTime because the identity is ordered in parallel with the dates? That's usually how its done (ie: 1 = 1/1/1900, 2 = 1/2/1900... etc) and you might be able to ... errr... sorry, nevermind, that won't work until it's converted. There's a reason I usually key off a datetime directly in my fact tables. One extra byte/row and a whole lot less headache. 🙂 Sorry, scratch the last except for a possible date-overhaul to avoid the extra-joins.
In addition to the date comparisions, the WHERE clause checks the equality of each and every earned column (there are 15 of them) to see if the history record needs to be added.
<trim>
I'm pretty sure that WHERE clause which compares all 15 money columns is what's killing me. But I don't want to add a table-covering NC index because that's just stupid and I don't think it'll work. So my question to you, in a vague sort of way is, do you have any advice on better options for value matching?
What's the execution plan look like once the predicates get involved? Are they predicates to the table scan or are they a late filter? Any chance we can see the .sqlplan?
For value matching... not really, no. Though you could hash the series and store that in the table and then simply compare hashing to test for changes. I've seen... occassional gains from doing that. I don't have all the reasons it sometimes helps and sometimes doesn't nailed down, but in all cases I'm 1/2 col key -> 1/2 col key and then start playing similar games that you are above in doing 'Do I NEED to change you' tests. Sometimes a hash helps... sometimes it doesn't matter. I think it has to do, however, with the volume of inbound records per pass but I'm honestly not sure.
I'm not sure if MERGE will work in this instance or not (having never found an excuse to use MERGE). I'm not sure if a CTE will help me at all either. Thoughts?
Nah, the Merge won't do anything for you since you're not dual-passing the process (UpSert) and it's actually a little slower then a straight insert/update alone. Also, it won't help at all during the comparison phase, just simply won't make a difference.
Out of curiousity can the same earned column values be input on different days? Is that the reason for the mass check to make sure you don't update rows that don't need it with a different 'change date'? Is keeping the change date in-line with the first time these values were used important to you? If it's not, to heck with it, give it an updatedDate change and don't compare the 15 columns and definately get a matching index in your staging table either way.
FYI: This whole thing happens in a T-SQL Proc in an SSIS DataFlow task (OLEDB to OLEDB, nothing in between).
The above assumes this is a misrepresentation, or I'm a little lost. You don't even need a staging table if you're SCD'ing from SSIS, though it's hell on wheels to your transactions with the umpteen billion updates from that component. Also, pre-transform all your dates via Lookup Components to save you compares later. Restrict them to only the direct fields you need and you should be good to go.
If you want to get fancy for new rows that aren't in DimTime yet (IE: You don't prefill your calendar) you can dump failed rows to a synchronous transformation script, have that script generate the new date row for the failed lookup row, give it the new ID, and then kick it back into the stream for the next lookup. Just make sure you trap in the asychronous so that only the first row that causes an issue for a date generates the new row command.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
August 2, 2012 at 2:38 pm
In place of this:
or the earningsdate <> the timekey from dt3
or the CycleMonth <> the timekey from dt2
or AEarned1 <> staging column 1
or AEarned2 <> staging column 2 ... etc. etc. etc.)
You might try this:
1 =
case
when earningsdate <> dt3.timekeythen 1
when CycleMonth <> dt2.timekeythen 1
when AEarned1 <> staging_column_1then 1
when AEarned2 <> staging_column_2then 1
...etc.
...etc.
when AEarned999 <> staging_column_999then 1
else 0 end
I had a query that was comparing hundreds of columns to see if they were different, and it was taking forever to run until I switched it over to a CASE. If possible, put the most likely to change columns first, so that it doesn't have to do every check.
I can't promise that will help, but I doubt it will hurt.
August 2, 2012 at 3:02 pm
Michael Valentine Jones (8/2/2012)
I had a query that was comparing hundreds of columns to see if they were different, and it was taking forever to run until I switched it over to a CASE. If possible, put the most likely to change columns first, so that it doesn't have to do every check.I can't promise that will help, but I doubt it will hurt.
Good call, I'd never even thought about using CASE STATEMENT short circuits to shorten up the per-row comparisons.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
August 3, 2012 at 2:40 am
Michael Valentine Jones (8/2/2012)
In place of this:
or the earningsdate <> the timekey from dt3
or the CycleMonth <> the timekey from dt2
or AEarned1 <> staging column 1
or AEarned2 <> staging column 2 ... etc. etc. etc.)
You might try this:
1 =
case
when earningsdate <> dt3.timekeythen 1
when CycleMonth <> dt2.timekeythen 1
when AEarned1 <> staging_column_1then 1
when AEarned2 <> staging_column_2then 1
...etc.
...etc.
when AEarned999 <> staging_column_999then 1
else 0 end
I had a query that was comparing hundreds of columns to see if they were different, and it was taking forever to run until I switched it over to a CASE. If possible, put the most likely to change columns first, so that it doesn't have to do every check.
I can't promise that will help, but I doubt it will hurt.
Great catch, Michael. I've seen this working really well with complex deduping.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
August 3, 2012 at 2:47 am
Brandie Tarvin (8/2/2012)
Huh. Interesting thoughts, Chris. I'll take a look at that option.Maybe I should be indexing the staging table after it's loaded too.
Here's a good reference[/url] covering the first suggestion, Brandie.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
August 3, 2012 at 11:45 am
Evil Kraig F (8/2/2012)
Michael Valentine Jones (8/2/2012)
I had a query that was comparing hundreds of columns to see if they were different, and it was taking forever to run until I switched it over to a CASE. If possible, put the most likely to change columns first, so that it doesn't have to do every check.I can't promise that will help, but I doubt it will hurt.
Good call, I'd never even thought about using CASE STATEMENT short circuits to shorten up the per-row comparisons.
I've never thought of it either. And never used it. Hmm. That will take some investigating.
Craig, to answer the biggest question you had, the process updates and does a SELECT every day, but our month end file tends to be bigger for some reason, and that's when we notice the lag. Yeah, I know generic code doesn't work very well, but I'm not looking to have other people rewrite it for me. I just wanted a "start here" bit of advice.
Now that I have suggestions, I'm going to be comparing a few bits of code over the next few weeks to see which ones work better with the execution plan and such. If I can't get this improved by myself, I'll post more concrete details and sample data for the blow-by-blow T-SQL assistance.
August 3, 2012 at 1:38 pm
Two possibilities. Not sure if either is applicable.
First, a RowVersion column. If the data in the source table changes, the RowVersion will be updated. They're sequential (within a database), so you can tell what rows have been created/updated since the last time you ran an ETL process, by logging the last RowVersion that was checked by each run of the ETL process. Anything higher than that, has been changed/inserted.
If you aren't updating a source table, then that's not going to work. I can't quite tell from what was posted if that's the case or not.
Second option, create an indexed checksum or hashbytes column in the table that's generated from the columns you want to verify changes on. If you store the same checksum/hashbytes in your target table, and index that, all you have to do is compare the values in those columns.
USE ProofOfConcept;
GO
IF OBJECT_ID(N'tempdb..#T1') IS NOT NULL
DROP TABLE #T1;
IF OBJECT_ID(N'tempdb..#T2') IS NOT NULL
DROP TABLE #T2;
CREATE TABLE #T1 -- source table
(ID INT IDENTITY
PRIMARY KEY,
Col1 INT,
Col2 INT,
Col3 INT,
Ck AS CHECKSUM(Col1, Col2, Col3));
CREATE INDEX IDX_T1_CK ON #T1 (Ck);
CREATE TABLE #T2 -- destination table
(ID INT IDENTITY
PRIMARY KEY,
Loaded DATETIME NOT NULL
DEFAULT (GETDATE()),
T1ID INT NOT NULL,
Col1 INT,
Col2 INT,
Col3 INT,
Ck AS CHECKSUM(Col1, Col2, Col3));
CREATE INDEX IDX_T2_CK ON #T2 (Ck);
INSERT INTO #T1
(Col1, Col2, Col3)
VALUES (1, 2, 5);
INSERT INTO #T2
(Loaded,
T1ID,
Col1,
Col2,
Col3)
SELECT GETDATE(),
T1.ID,
T1.Col1,
T1.Col2,
T1.Col3
FROM #T1 AS T1
LEFT OUTER JOIN #T2 AS T2
ON T1.ID = T2.T1ID
AND T1.Ck != T2.Ck;
UPDATE #T1
SET Col1 = 10
WHERE ID = 1;
INSERT INTO #T2
(Loaded,
T1ID,
Col1,
Col2,
Col3)
SELECT GETDATE(),
T1.ID,
T1.Col1,
T1.Col2,
T1.Col3
FROM #T1 AS T1
LEFT OUTER JOIN #T2 AS T2
ON T1.ID = T2.T1ID
AND T1.Ck != T2.Ck;
SELECT *
FROM #T2;
Of course, you probably only want to compare the most recent row in #T2 per ID to #T1, but that's a simply CTE or Outer Apply operation that can be rolled into the above skeleton.
Because of the way Checksum works, you can end up with collisions that might give a false negative, so Hashbytes might be better. A little more complex to generate, because it takes one input value, but it's not that hard to concat the values together for it.
That usually allows a quick check that rapidly identifies rows with differences via a comparison on two simple indexes (one index per table).
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
August 3, 2012 at 5:00 pm
Brandie Tarvin (8/2/2012)
....
LEFT OUTER JOIN DimTime dt1
ON ... (last update date column in staging table)
LEFT OUTER JOIN DimTime dt2
ON ... (Another date column in staging table)
LEFT OUTER JOIN DimTIme dt3
ON ... (Another date column in staging table)
....
How about using the NOT EXISTS clause to replace the above LEFT JOINS.Something like this..
and NOT EXISTS(Select 1 from DimTime dt1 where
dt1.somecolumn=(last update date column in staging table) or
dt1.somecolumn= (Another date column in staging table) or
dt1.somecolumn= (Another date column in staging table)
)
I have seen performance boost to about 50% with this method.But it could be only used if you are not referring to columns of the outer table.
--------------------------------------------------------------------------------------------------
I am just an another naive wannabe DBA trying to learn SQL Server
Viewing 11 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply