June 9, 2015 at 11:06 am
Nice piece as always Gail, made me giggle a little as when I read it, I had just finished a very similar exercise which brought the execution time of a query down from several hours (last run was 11+) to 16 seconds. The only difference was that I had some additional statistics from SolarWinds DPA which made the task even easier. Having said that, the indexing suggestions from the DPA were not the ones I initially came up with nor the ones I ended up using.
There is a problem one constantly faces which I call the GAG, generalization, assumptions and guesswork, must say that DTA and the missing index suggestions are often less than helpful when fighting this problem.
😎
June 9, 2015 at 11:28 am
FootyRef (6/9/2015)
GilaMonster (6/9/2015)
FootyRef (6/9/2015)
I only want to do the switching of partitions when running the daily ETL processes.You've talked about having monthly partitions and switching partitions out of the fact table, into a staging table, running some processing against the data you've switched out and then switching it back in.
That is NOT what partition switching is for. It's not so you can switch portions of the table out to process them. It is so that you can load data into your fact tables without needing the long-duration locks that an insert would take. If you have partitioning by month, you'll switch data in and/or out at most once a month, load the new month's data in and move the earliest month out to archive or to be deleted if applicable.
So, I can't switch, say, four partitions out into a staging table, do inserts and updates against data in those partitions and switch them back into the fact table? For example, we have a fact table that keeps track of benefits issued to and taken away from our participants. We get new data every day and let's say it is the second of the month and we received data for the last day of last month and yesterday, the first day of this month. I want to grab the partitions in the fact that are for last and this month, switch them into a staging table, and run the ETL processes that will either add the new benefits issued or update the benefits previously issued to participants. Once the processes have completed, the partitions will be switched back from the staging table into the fact table in the data mart.
I have read many, many documents talking about that is how it is done but perhaps I have misunderstood them all. I am just trying to figure out a way to dynamically do it based on the months (based on Benefit Issuance Date) of data I happen to be processing that day. This way, I only need to switch out those partitions and do not need to stage the whole table and load and index it into the mart.
Piling on a bit here, the normal flow is from the staging to the fact tables, partitions can be switched "under" the facts from the staging but that is a one way street if you like. Point being that once in the fact table, the data should not need any further "processing". As Gail said, there are many many problems lurking around when opening this up as a two way street, the complexity is most certainly going to make the process more costly than doing without it.
😎
June 9, 2015 at 12:20 pm
Eirikur Eiriksson (6/9/2015)
FootyRef (6/9/2015)
GilaMonster (6/9/2015)
FootyRef (6/9/2015)
I only want to do the switching of partitions when running the daily ETL processes.You've talked about having monthly partitions and switching partitions out of the fact table, into a staging table, running some processing against the data you've switched out and then switching it back in.
That is NOT what partition switching is for. It's not so you can switch portions of the table out to process them. It is so that you can load data into your fact tables without needing the long-duration locks that an insert would take. If you have partitioning by month, you'll switch data in and/or out at most once a month, load the new month's data in and move the earliest month out to archive or to be deleted if applicable.
So, I can't switch, say, four partitions out into a staging table, do inserts and updates against data in those partitions and switch them back into the fact table? For example, we have a fact table that keeps track of benefits issued to and taken away from our participants. We get new data every day and let's say it is the second of the month and we received data for the last day of last month and yesterday, the first day of this month. I want to grab the partitions in the fact that are for last and this month, switch them into a staging table, and run the ETL processes that will either add the new benefits issued or update the benefits previously issued to participants. Once the processes have completed, the partitions will be switched back from the staging table into the fact table in the data mart.
I have read many, many documents talking about that is how it is done but perhaps I have misunderstood them all. I am just trying to figure out a way to dynamically do it based on the months (based on Benefit Issuance Date) of data I happen to be processing that day. This way, I only need to switch out those partitions and do not need to stage the whole table and load and index it into the mart.
Piling on a bit here, the normal flow is from the staging to the fact tables, partitions can be switched "under" the facts from the staging but that is a one way street if you like. Point being that once in the fact table, the data should not need any further "processing". As Gail said, there are many many problems lurking around when opening this up as a two way street, the complexity is most certainly going to make the process more costly than doing without it.
😎
New records might need to be added for that partition month and already existing records might need to be updated for that partition month. I want to bring that partition month into its own staging table, do the inserts and updates and then switch it back to the original fact table from the staging table.
June 9, 2015 at 1:18 pm
FootyRef (6/9/2015)
I want to bring that partition month into its own staging table, do the inserts and updates and then switch it back to the original fact table from the staging table.
Why?
What are you going to gain from doing all that work as opposed to just doing the inserts and updates on the fact table?
It's not about preventing locks, you said no one's reading the table.
If the indexing on the fact table supports the updates then it's not about performance (and if they don't, then you'll need to add extra steps to add indexes to the staging table after the switch out and drop the indexes before switching back in)
So what do you expect to gain?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
August 26, 2016 at 12:26 am
Grant Fritchey (6/9/2015)
There are places where partitioning can, and does, improve performance.
Apologies for the year late response and I hate to be contrary but everywhere I've seen someone make that claim by example, I've seen others beat it with a correctly indexed monolithic table and correctly written code for the problem at hand. I've also beaten a couple of examples myself and it's actually not that difficult.
I think the reason why many people think that partitioning can help performance is because, when they partition, they actually add the correct index to support the partitioning, which is also the correct index that is normally the solution to getting the monolithic table to perform, as well. And, the monolithic table outperforms the partitioned table, in such cases, because the code only needs to plow through one B-TREE instead of one B-TREE per partition. Even supposed "partition elimination" isn't as effective as having a properly indexed monolithic table.
I will, however, agree that partitioning has some great benefits in reducing index maintenance on certain types of tables as well as offering greatly reduced backup requirements when static legacy partitions are set to READ_ONLY.
I'm also finding that old Partitioned Views, which are available in the Standard Edition as well as the "Expensive Edition", have advantages that Partitioned Tables don't have and are missing little when compared to Partitioned Tables. There's even an easy equivalent to "SWITCH" and, even more importantly, static legacy partitions can be indexed differently than the currently active partition(s) for better read performance on the former and better insert performance on the latter without giving up that equivalent ability like you would have to with Partitioned Tables.
There's another HUGE benefit to Partitioned Views over Partitioned Tables, but I have to save something "new" for the article. 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
August 26, 2016 at 2:01 am
Jeff Moden (8/26/2016)
Grant Fritchey (6/9/2015)
There are places where partitioning can, and does, improve performance.Apologies for the year late response and I hate to be contrary but everywhere I've seen someone make that claim by example, I've seen others beat it with a correctly indexed monolithic table and correctly written code for the problem at hand. I've also beaten a couple of examples myself and it's actually not that difficult.
I've seen it as well (my howlers article on Simple Talk showed it), but in my experience it hasn't been significant enough an improvement that I'm willing to implement something as complex as partitioning to get.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
August 26, 2016 at 6:21 am
GilaMonster (8/26/2016)
Jeff Moden (8/26/2016)
Grant Fritchey (6/9/2015)
There are places where partitioning can, and does, improve performance.Apologies for the year late response and I hate to be contrary but everywhere I've seen someone make that claim by example, I've seen others beat it with a correctly indexed monolithic table and correctly written code for the problem at hand. I've also beaten a couple of examples myself and it's actually not that difficult.
I've seen it as well (my howlers article on Simple Talk showed it), but in my experience it hasn't been significant enough an improvement that I'm willing to implement something as complex as partitioning to get.
Oh believe me, I think the cases where it can work are pretty darned rare, but they do exist. Always and forever (until the technology changes anyway), partitioning is about data management. I say that over and over again. I just know that, in some cases, it does improve performance. I suppose even saying that is opening the door a crack which will cause an entire herd of horses to stampede through it, because everyone always believes that they, and they alone, are the single unique exception to the rule, any rule, all rules. However, it's still true. It's just exceedingly rare.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
August 26, 2016 at 6:40 pm
Gail, your post about the importance of exact measuring changes one by one. You are really good in this!
Another aspect of optimization is using what I call "the task insider info". Maybe in other articles you point it - I don't read it, so my apologize 😉
1. The first thing I see in the original query is that we have two entities (tasks-posts), that linked to each other not only by formal ids, but by time-relation too. I mean that task should be always older than post. Am I right?
If I, than this should be always true: "t.CreatedOn <= p.PostDate". And we can add extra constraint, that helps us a lot:
... and p.PostDate >= @StartDate
if PostDate >= CreatedOn and CreatedOn >= @StartDate than PostDate >= @StartDate also.
Obviously, we can't add expression, that links PostDate and @EndDate, because @EndDate about tread, not posts in it. tread can contain new posts.
2. We can add index (query optimizer advice one), and this can helps us a lot - we can scan only one half of index in average.
3. This is good, but we can achieve best results, if we can do some denormalization - add TaskCreatedOn field to posts table. So we can add index on it and fully utilize it in query. If quering db in this way important - denormalization is ok. It is only 3 bytes more on every row (if we use date data type). We add only one more condition to query:
... and p.CreatedOn between @StartDate and @EndDate
I suppose only this one denormalization trik and extra condition will perform better, than all, that was done in the orginal article (can you test it on your data?).
4. After that we can test another indexes on other tables (listed in the article).
5. If we have a lot of time to do some exercises, we can try to avoid sorting and stream aggregating in the query. We can shift work to the proper index in combination with proper quering. First of all, there is group by operation in the query and one aggregation - "count(*)":
SELECT
...
,COUNT(*) AS ThreadPostCount
FROM
dbo.Forums AS f
INNER JOIN dbo.Threads AS t ON
f.ForumID = t.ForumID
INNER JOIN dbo.Posts AS p ON
t.ThreadID = p.ThreadID
INNER JOIN dbo.Users AS u ON
p.Poster = u.UserID
...
GROUP BY
f.ForumName
,DATEADD(MONTH, DATEDIFF(MONTH, 0, p.PostDate), 0)
,t.Title
,u.UserName
If look at this operations, we can see, that for the purpose of counting (and group by) we actually can use only posts table (with some restrictions - query result can be different, but only a bit):
f.ForumName doesn't matter (original query doesn't contain order by clause, so, we can think so 😉 )
DATEADD(MONTH, DATEDIFF(MONTH, 0, p.PostDate), 0) should be converted in materialized calculation on the posts
t.Title can be replaced with p.ThreadID (if we have no treads with duplicated title, the results be identical)
u.UserName changed to p.Poster (the same as above)
so, the query should look something like this:
with postPreFiltered as
(
select
--cast(dateadd(month, datediff(month, 0, PostDate), 0) as date) AS Month
p.Month
,p.ThreadID
,p.Poster
from
dbo.Posts p
where
--p.PostDate >= @StartDate
p.TaskCreatedOn between @StartDate AND @EndDate
)
,
postAggregated as
(
select
*
from
(
select
*
,count(*) over
(
partition by
p.Month
,p.ThreadID
,p.Poster
) as ThreadPostCount
,row_number() over
(
partition by
p.Month
,p.ThreadID
,p.Poster
order by
(select 1)
) as SingleRow
from
postPreFiltered p
) t
where
SingleRow = 1
)
select
f.ForumName
,p.Month
,t.Title
,u.UserName
,p.ThreadPostCount
from
dbo.Forums AS f
inner join dbo.Threads AS t ON
f.ForumID = t.ForumID
inner join postAggregated /*dbo.Posts*/ AS p ON
t.ThreadID = p.ThreadID
inner join dbo.Users AS u ON
p.Poster = u.UserID
where
t.CreatedOn between @StartDate AND @EndDate
and
f.IsPublic = 1
;
so we use window's functions on one table, now we add proper index, that helps avoid sort and stream aggregate. This optimization is very data distribution dependent, and I don't really think, that Gail (the author of the original article) really want a non-sorted query, because the user will see unsorted data - primarily index build on the ids, not strings ;).
I downloaded archive with db from the previous article and played it a bit (it's a pity, that it's not 25 million rows), anyway, I have a lot of fun trying to optimize the original query! Thank you, Gail!
August 26, 2016 at 9:21 pm
I love this thread. Just reading all the advice from experts is awesome. I am no expert but will add my 2 cents.
My company is moving to a SQL 2016 server and we are utilizing column store indexes. With column store indexes you do not specify a key. So no primary key at all. The column store index is a completely different beast.
I will say and I have proof that partitioning a column store index will benefit your queries. This assumes that you are partitioning by what you mostly filter by in the WHERE clause. For us it was a time dimension. If you partition on a time dimension and have a select statement that only needs say 4 years of day, by using partitioning sql server only goes to those partitions it needs. If you do not implement partitioning it will scan all.
You can try this and see in the execution plan. Partitioning and Clustered Column store to me go hand and hand.
But that is very small percentage and most companies have not implemented or using columns store yet.
August 29, 2016 at 6:25 am
GilaMonster (8/26/2016)
Jeff Moden (8/26/2016)
Grant Fritchey (6/9/2015)
There are places where partitioning can, and does, improve performance.Apologies for the year late response and I hate to be contrary but everywhere I've seen someone make that claim by example, I've seen others beat it with a correctly indexed monolithic table and correctly written code for the problem at hand. I've also beaten a couple of examples myself and it's actually not that difficult.
I've seen it as well (my howlers article on Simple Talk showed it), but in my experience it hasn't been significant enough an improvement that I'm willing to implement something as complex as partitioning to get.
Grant Fritchey (8/26/2016)
GilaMonster (8/26/2016)
Jeff Moden (8/26/2016)
Grant Fritchey (6/9/2015)
There are places where partitioning can, and does, improve performance.Apologies for the year late response and I hate to be contrary but everywhere I've seen someone make that claim by example, I've seen others beat it with a correctly indexed monolithic table and correctly written code for the problem at hand. I've also beaten a couple of examples myself and it's actually not that difficult.
I've seen it as well (my howlers article on Simple Talk showed it), but in my experience it hasn't been significant enough an improvement that I'm willing to implement something as complex as partitioning to get.
Oh believe me, I think the cases where it can work are pretty darned rare, but they do exist. Always and forever (until the technology changes anyway), partitioning is about data management. I say that over and over again. I just know that, in some cases, it does improve performance. I suppose even saying that is opening the door a crack which will cause an entire herd of horses to stampede through it, because everyone always believes that they, and they alone, are the single unique exception to the rule, any rule, all rules. However, it's still true. It's just exceedingly rare.
Thanks, folks. I seriously appreciate your feedback. Like I said, I've never had the pleasure of seeing partitioning beat a monolithic table for read performance so it's interesting to hear from folks that have. I know with all the things we do, it's impossible to memorize such exceptions. If you come across one in the future, I'd love to hear about it because even an old dog like me can learn something different now and then. 🙂
--Jeff Moden
Change is inevitable... Change for the better is not.
February 10, 2017 at 1:28 pm
Since Steve was kind enough to re-post this in today's daily email, I just now was able to read this posting. Nice job Gail!
While the purpose of the article was tuning without query modifications, I have to wonder if there wouldn't be a advantage of moving the following from the WHERE clause:t.CreatedOn BETWEEN @StartDate AND @EndDate
to the INNER JOIN so it would like this this: INNER JOIN dbo.Threads AS t
ON t.ForumID = f.ForumID
AND t.CreatedOn BETWEEN @StartDate AND @EndDate
Would this not have a positive affect on performance by reducing the number records evaluated in the INNER JOIN?
LinkedIn: https://www.linkedin.com/in/sqlrv
Website: https://www.sqlrv.com
February 10, 2017 at 1:48 pm
Aaron N. Cutshall - Friday, February 10, 2017 1:28 PM...I have to wonder if there wouldn't be a advantage of moving the following from the WHERE clause:t.CreatedOn BETWEEN @StartDate AND @EndDate
to the INNER JOIN so it would like this this:INNER JOIN dbo.Threads AS t
ON t.ForumID = f.ForumID
AND t.CreatedOn BETWEEN @StartDate AND @EndDate
Would this not have a positive affect on performance by reducing the number records evaluated in the INNER JOIN?
if you look at a query execution plan, you'll see that the WHERE clause of a query will typically get evaluated as a SEEK PREDICATE before the JOIN operator is evaluated, so it would still minimize the number of rows the JOIN is handling.
February 10, 2017 at 2:13 pm
Aaron N. Cutshall - Friday, February 10, 2017 1:28 PMSince Steve was kind enough to re-post this in today's daily email, I just now was able to read this posting. Nice job Gail!While the purpose of the article was tuning without query modifications, I have to wonder if there wouldn't be a advantage of moving the following from the WHERE clause:
t.CreatedOn BETWEEN @StartDate AND @EndDate
to the INNER JOIN so it would like this this:INNER JOIN dbo.Threads AS t
ON t.ForumID = f.ForumID
AND t.CreatedOn BETWEEN @StartDate AND @EndDate
Would this not have a positive affect on performance by reducing the number records evaluated in the INNER JOIN?
No. those two queries are completely identical in results and performance.
If, however, that was a LEFT OUTER JOIN, your change would have made it a different query with different results.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
February 10, 2017 at 3:16 pm
GilaMonster - Friday, February 10, 2017 2:13 PMNo. those two queries are completely identical in results and performance.If, however, that was a LEFT OUTER JOIN, your change would have made it a different query with different results.
Thanks Gail. Some time ago I began the habit of consolidating things that were criteria specific for the tables in the join and left joins may be the reason. Even if it doesn't help (or hurt) performance, I like keeping things as consolidated and consistent as I can. To me, it makes modifications so much easier especially in large, complex queries. It must be the neat freak in me! 🙂
LinkedIn: https://www.linkedin.com/in/sqlrv
Website: https://www.sqlrv.com
February 12, 2017 at 5:35 am
Aaron N. Cutshall - Friday, February 10, 2017 3:16 PMThanks Gail. Some time ago I began the habit of consolidating things that were criteria specific for the tables in the join and left joins may be the reason.
To be blunt, if I got code for a code review that had filters in the FROM clause, I'd send it back.
Filters belong in the WHERE clause (or HAVING if they're on aggregates), not in the FROM clause, with the sole exception of when the filter needs to be done before a LEFT JOIN. Mixing joins and filters results in a query that's harder to read, harder to figure out what is being filtered out than if the filters were all in the WHERE, and does not get you any performance benefits.
When dealing with outer joins, filtering in the FROM or the WHERE result in two different queries with two different results, and it's not the case that one is correct and one is not.
eg
Customers LEFT JOIN Orders on <join condition>
WHERE Orders.OrderValue<1000 or OrderID is null
will return just customers who have never made an order, or if they have, only if they've made orders below R1000 (and those orders)
Customers LEFT JOIN Orders on <join condition> and Orders.OrderValue<1000
will return all customers and their orders if under R1000.
Which one of those is correct depends on what the requirement is. And, tbh, the latter I'll often prefer joining to a subquery over the filter in the join. Makes it clearer when things are being executed.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
Viewing 15 posts - 31 through 45 (of 45 total)
You must be logged in to reply to this topic. Login to reply