Complicated performance issue

  • Hi All,

    We have 2 live environments both with same database structure and application but there is a big difference in performance between both environments.

    The amount of data is a little different (15-20 % bigger in the slow environment) but the performance is much different.

    I have done index tuning for both sites.

    I attached 2 execution plans and client statistics showing big difference in execution time between them, this is an example but most of queries are behaving the same way.

    Also it came to me may be there are some waits causing SQLs to be slow like that, so i attached also the wait stats in both.

    Note: They have mirroring enabled on the slow environment but things were also slow even before mirroring was configured.

    I hope i provided good information but if anything else is needed please let me know and really hope to get help on this.

    Regards

    Nader

  • I would start from making sure stats are update on both the sides or may be just update the stats with full scan if possible. Then execute the queries on both sides and share the plan here.

  • Hi Nader,

    The plans you post actually have the same plan shape, so they do not explain the performance difference you are seeing.

    However, I do see that the estimated rowcounts are way off in both cases, and I suspect that you are suffering from the "Ascending Key Problem" - an issue where estimates are wrong because the assumption is that added data will be similar to existing data, but in reality all new data uses new values; thhis frequently occurs with identity columns or with dates. The plans show that a table named EPISODE is queried for data with start_date >= April 1st, 2015; SQL Server expects to find less than 30 rows but in reality there are over 65,000. A plan optimized for 30 rows will never perform well for that large number of rows.

    So I agree with curious: update statistics on the tables used in this query (EPISODE, PAT_BILLING, and COMP_BILLING), then try again.

    In the plan I see that you are querying a view (PENDING_GET_LIST), can you share the source code of that view? And the structure (CREATE TABLE statements, including all indexes) of the three tables it uses?

    I did not look at the wait statistics you posted. They are gathered over time and only reset when the server restarts or when you explicitly reset them. If you are able to test these queries when there is no other activity on the server, then I recommend collecting wait statistics, running the query, then collecting wait statistics again and computing the difference. That will show what waits occurred during execution of the query. (But again, try to do this when the server is not otherwise used - and yes, I am aware that this is a state not often possible on a production server).


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • First i would like to thank you Curious and Hugo for the detailed reply.

    I have updated statistics for both database for those 3 tables.

    The view still taking 3 sec on the faster db and around 9 secs on the slow one.

    I attached the new execution plans and the table and view designs.

    I also cleared the wait stats and posted them again.

    Can you tell me please guys how did you know that the tables need update for stats based on the execution plan so i apply that in later work.

    Again thanks you very much for your help and waiting for your feedback.

    Nader

  • Hugo Kornelis (3/31/2016)


    Hi Nader,

    and I suspect that you are suffering from the "Ascending Key Problem" - an issue where estimates are wrong because the assumption is that added data will be similar to existing data, but in reality all new data uses new values; thhis frequently occurs with identity columns or with dates. .

    Does that issue occur due to not updated statistics or does it have another reason?

    Also what about those trace flags?

    DBCC TRACEON (2388)

    DBCC TRACEON (2389)

    I read an article about them but not sure exactly how will they help.

    Regards

    Nader

  • On the slower one you are pulling about 20K more records in slower one and the amount of data is doubled ( 2MB vs 1MB) and so you are also hitting PAT_Billing 20K times more. I think that i where your problem is.

  • But will that raise the time from 3 sec to 9 secs, i think the difference is big .

    Is the estimated row count based on stats correct now?

  • nadersam (3/31/2016)


    But will that raise the time from 3 sec to 9 secs, i think the difference is big .

    Is the estimated row count based on stats correct now?

    Depends, that extra 20K will make difference and also you can see that the amount of data you are pulling is twice more than the faster one. It is exponential impact, though it is only 20K more, you are hitting the other table 20K times more.

  • Hugo Kornelis (3/31/2016)


    Hi Nader,

    The plans you post actually have the same plan shape, so they do not explain the performance difference you are seeing.

    However, I do see that the estimated rowcounts are way off in both cases, and I suspect that you are suffering from the "Ascending Key Problem" - an issue where estimates are wrong because the assumption is that added data will be similar to existing data, but in reality all new data uses new values; thhis frequently occurs with identity columns or with dates. The plans show that a table named EPISODE is queried for data with start_date >= April 1st, 2015; SQL Server expects to find less than 30 rows but in reality there are over 65,000. A plan optimized for 30 rows will never perform well for that large number of rows.

    So I agree with curious: update statistics on the tables used in this query (EPISODE, PAT_BILLING, and COMP_BILLING), then try again.

    In the plan I see that you are querying a view (PENDING_GET_LIST), can you share the source code of that view? And the structure (CREATE TABLE statements, including all indexes) of the three tables it uses?

    I did not look at the wait statistics you posted. They are gathered over time and only reset when the server restarts or when you explicitly reset them. If you are able to test these queries when there is no other activity on the server, then I recommend collecting wait statistics, running the query, then collecting wait statistics again and computing the difference. That will show what waits occurred during execution of the query. (But again, try to do this when the server is not otherwise used - and yes, I am aware that this is a state not often possible on a production server).

    Hi Hugo,

    You mentioned the "Ascending Key Problem" can you please give me more details on that issue, after i update stats , i still see the number of estimated rows very different from number of actual rows.

    Appreciate your feedback.

    Nader

  • EDIT:

    Sorry, that's what you get for using multiple windows. I posted the text intended for another topic here. Apologies.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Okay, and here is the post I wanted to make in this topic.

    Apologies for not replying sooner. I have been on the road for a few days, traveling to and speaking at conferences in Vienna and Tel Aviv, but I now finally have time to revisit this topic, download the new plans and look at what changed.

    First, to reply to your questions:

    Can you tell me please guys how did you know that the tables need update for stats based on the execution plan so i apply that in later work.

    I opened both execution plans in SSMS, and started looking at the plan as a whole, at the individual operators, and at the properties of all operators. I cannot tell you exactly what I look for, because I look at everything, and focus on things that seem odd. In this case, the thing that struck me most was the huge difference between the estimated and the actual rowcount.

    Such a difference is an indication that either a plan is reused that should not be (e.g. bad parameter sniffing), or estimates are off (which can be caused by a lot of things). I then dug a bit deeper and noticed that the index seek that has the biggest difference (over 83,000 actual rows versus 21.9 estimated on the original slow plan) includes a predicate on a date column. In my experience, when you see a huge under-estimate on a date column, it is in over 90% of all cases caused by the ascending key problem.

    After looking at the new posts, I now know that you are not in that 90% group. More on that below.

    You mentioned the "Ascending Key Problem" can you please give me more details on that issue

    Even though it does not appear to apply in your case, it is good to know about. Here is a good article about that issue: https://www.simple-talk.com/sql/database-administration/statistics-on-ascending-columns/[/url]. Note that the fixes described in that article apply to SQL Server 2012 and older only; in SQL Server 2014 a lot of the logic for cardinality estimation changed and this included a new way to (try to) deal with the ascending key problem. See http://www.sqlpassion.at/archive/2014/05/19/statistics-with-ascending-key-column-values/[/url] for a quick description of the changes.

    Anyway, back to your problem. I looked at your code; unfortunately I could not run it because you have not included the "Episode" table. You also did not include the indexes.

    But I could look at the code you did post. The view definition looks quite straightforward - two similar joins, between the "Episode" table and each of the two "Billing" tables, with the same filter conditions and the results combined using UNION ALL. Not uncommon when you have two tables that are essentially subtypes of a single super-type (based on the table names, I guess that you use two different billing plans, perhaps for patients that are invoiced directly and patiens that are invoiced through their insurance company). A better database design might have been to have a single table for billing and only have two separate sub-tables for the columns that are actually different, but this is the design you got and you need to work with it. Fair enough.

    I do recommend changing the join syntax - comma-seperating tables and specifying the join condition in the where clause is supported, but a lot of people get confused by it and it's easy too forget a join condition. Please use the infixed join syntax ("FROM table1 AS a INNER JOIN table2 AS b ON a.Column = b.Column") instead.

    You should remove the TOP(100)PERCENT and the ORDER BY, because they will be ignored anyway. This used to be an (undocumented, and also very unreliable) trick to force SQL Server to sort a view when you run a SELECT on it without ORDER BY, but in I believe SQL Server 2005 the optimizer was improved to understand that TOP(100)PERCENT will not actually remove any rows, and since the ORDER BY in this context is only used to define the meaning of the TOP, the sorting is not done anymore.

    I would personally also prefer to use aliases on all tables, and to prefix all columns with the alias of the table they come from, but that is a readability and maintainability issue.

    However, none of the above is related to your performance issues, so let's get back to what I can conclude from the plan.

    You write that you still see the same issue after updating statistics. That alone already suggests that this is not the ascending key problem. And you also get the same plans, with only minor changes in the estimated (and actual) rowcounts. So as I said before, in your case the ascending key problem is not the issue.

    Another possible explanation for your slow query could be a back-firing row goal. The term "row goal" means that SQL Server optimized the query for a specific number of rows. In this case, for 30 - because of the TOP 30 in the "SELECT TOP(30) * FROM View_Name" statement. When the optimizer knows it only has to return 30 rows, it can often produce a much faster plan. An example of a good use of a row goal would be when the result set without TOP 30 would be two million rows and assembling that entire result set in the fastest possible way would take 20 seconds - 10 microseconds per row. Another strategy would find the results one row at a time, at a pace of 500 microseconds per row. If you just need 30 rows, than the method that takes more time per row is faster overall - because 30 times 500 microseconds is 1.5 milliseconds, considerably faster than the 20 seconds for the entire set.

    In your case the row goal backfires horrible. I will walk you through one of the plans and tell you what happens and why.

    Let's look at the "slownew" plan. It is important to realize that operators work in a "pull" model - when called they will produce one row and then wait until called again. So the Index Seek operator in the upper right corner will not proactively process all rows that match its search criteria, it will return the first row when called, then return the second when called again, and so on. This is important because of the TOP operator at the far left. This TOP operator will stop calling it's input operator after receiving the 30th row, so in a simplified case where every row read by the Index Seek at the far right ends up as one row in the results, the Index Seek will return 30 rows and then not be called anymore, ignoring all the other rows that it might have read if it had been called more often.

    The Index Seek at the far right uses one of the indexes on the Episode table to find rows that match the date filter (you can see this by looking at the Seek Predicates property of the operator). The estimated number of rows is 21.84. In this case, that estimate is not based on the number of rows SQL Server thinks match the Seek Predicates, but on the number of times that the operator will be called. So the optimizer expects that it needs to call this operator somewhere between 21 and 22 times, and then the 30th row of the final query will be ready and the work will stop. The actual rowcount is much higher, over 80,000, and I think that these are actually all qualifying rows. In other words, during the actual execution the goal of 30 rows was not even met yet when the index seek had processed all rows available and returned an "end of data" condition.

    The Nested Loops operator in the upper branch then calls its other input for each row it receives. Nested Loops operators are great if the number of rows in the upper input is small, and the lower input is cheap. In this case, the lower input is indeed cheap (a single Clustered Index Seek), and the estimated number of rows in the upper input os low, but the actual number of rows is not. I am pretty sure that here is where most of the three- or nine-second execution time is spent. If you look at the properties Clustered Index Seek itself, you will see not only a Seek Predicates property that implements the join; you will also see a Predicate property that implements the rest of the WHERE clause, filtering on Main_Serv and GstStatus. The estimated rowcount (which represents the estimate per execution) is 1.37. Doing the math, you see that with an expected 21.8 rows from the upper input, each matching an expected 1.37 rows from the lower input, you will end up with exactly 30 matches. And that is indeed the estimate for the number of rows produced by the Nested Loops operator. However, the logic used by the optimizer was actually the reverse of this - it started with an estimation that for each row from Episode, there would be on average 1.37 rows in Pat_Billing with the same Episode_Key and the correct values in Main_Serv and GstStatus. We need to create 30 rows, so that means that we will have to read (30 divided by 1.37 = ) 21.84 rows from Episode.

    And here we identified what I think is the root cause of slow performance. For some reason, SQL Server thinks that, for rows in Episode with a start_date of April 1 2015 or later, most of them will have at least one or two rows in Pat_Billing that have the same Episode_Key and the right values for Main_Serv and GstStatus. Because of that, and because it only needs to find 30 rows, the optimizer expects that it will be done after processing 21 - 22 rows from Episode. Based on that estimation, the choice for a nested loops operator is fully justified. But in reality, most rows from Episode have no matches at all. In fact, even after processing all 83,690 rows from Estimate, the total number of matches found by the Nested Loops operator is still stuck at just 2! If the optimizer had "known" that it would have to perform the join for all 83,690 rows and still not find the 30 matches it needs, it would almost certainly have used a different plan, probably with a Hash Matchh or a Merge Join operator instead of the Nested Loops.

    One way to make this query go faster is to find out why SQL Server mis-estimates the number of matching rows in Pat_Billing and try to fix that. But that will be hard, because you need to understand how statistics work and how the cardinality estimator processes filters that combine AND and OR, and joins. Very complex stuff. Another way would be to simplify the filter, but that is hard to do without impacting the correctness of the query. If you control the database, you might consider adding a persisted computed column that simply represents the result of the test on Main_Serv and GstStatus used in this query, then rewrite the view to just test if that column is 'Y' or 'N'. However, do not do this unless this is a major critical query or unless you have dozens of other queries and view that all use the exact same test. Yet another option is that perhaps the data is somehow skewed - did something change in April 2015? Perhaps the Pat_Billing table has millions of rows that only match episodes with a start_date before that, but has hardly any rows matching later episodes? If that is the case, then perhaps you should consider using separate tables for data before and after april 2015 - allthough that can introduce new issues as well.

    But there is also a much simpler way to -probably- optimize this query. Just remove the TOP(30) from the statement. The query currently produces 2 rows, so the TOP(30) does not affect the results at all; and removing the rowgoal will change the optimizer's choices entirely. I am not saying that this will speed up the query execution - but in all honesty, I really expect that it almost certainly will speed it up. On both your servers.

    If you are afraid that you will ever get more than 30 rows, then you can work around that by first doing the SELECT (without TOP) in a temporary table and then fetching TOP(30) from that temporary table; or you can add an ORDER BY to your query - that last tip is not guaranteed to work because it again creates a whole new problem for the optimizer, but it is likely that you will get a plan that first finds all results, then sorts them and then fetches the first 30. Plus, it also makes the selection of which 30 rows to produce more predictable.

    (For completeness sake: Perhaps you wonder why, if the estimated rows for the first input to the Concatenate operator is 30, the lower input still has an estimate of 1 row. That is because SQL Server never estimates less than 1 row, except when your query contains something that can impossibly return any rows; that is the only case where you will ever see an estimate of 0. But in this case, despite the estimates telling SQL Server that the second branch will probably never execute, it is still possible so SQL Server will use the lowest estimate it can, which is 1.)

    Now all of the above is about slowness of BOTH queries. The slow query has to process more rows than the fast query, but certainly not three times as much, so that would not explain why the run-time goes from 3 to 9 seconds. Unfortunately, there is nothing in the plan that would explain this.

    If (after reading all the above and hopefully getting much better performance without the row goal) you are still interested in trying to explain this, then I recommend first running the same query a few times in a row to make sure that the performance is consistent. If it is, then enable SET STATISTICS IO and look at the number of reads from each table. Often people only look at logical reads, but in this case you should also look at physical reads and read-ahead reads. Perhaps in this specific case, the extra 20,000 rows somehow caused the buffer pool to overflow so that more physical IO was needed. Finally, look at the specifications of the server. How fast are their disk subsystems? How much memory is available for the buffer pool? How many concurrent queries are running? Etc.

    In the wait stats you posted, I see a high amount of CXPACKET waits. These occur when queries are executed in parallel. However, the plans you posted are both serial, so these waits can only come from other workloads that were running at the same time. Concurrent workloads can have a huge impact on overall performance.

    The wait stats for the slower server show wait states related to Database Mirroring and to AlwaysOn Availability Groups. Is this perhaps a readable secondary in an AlwaysOn Availability Group? If that is the case, then you might want to investigate whether readable secondaries are known to be slower - this is something outside my domain of expertise so I would not know.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • nadersam (3/31/2016)


    First i would like to thank you Curious and Hugo for the detailed reply.

    I have updated statistics for both database for those 3 tables.

    The view still taking 3 sec on the faster db and around 9 secs on the slow one.

    I attached the new execution plans and the table and view designs.

    I also cleared the wait stats and posted them again.

    Can you tell me please guys how did you know that the tables need update for stats based on the execution plan so i apply that in later work.

    Again thanks you very much for your help and waiting for your feedback.

    Nader

    Can you please include EPISODE table as well?

    _____________
    Code for TallyGenerator

  • Thank you very much Hugo for the very detailed reply, i think it's great even if i can't fix that query because i learned many things, i posted as an example of a complaint we get that most SQLs on first environment is slower than second one.

    For sake of testing, i created a new table called testepisode with same structure as original and i deleted the number of extra rows around 20 k that it has more than fast enviroment.

    And it was still slow 🙂 but not as before.

    one of the important things i learned was to create a weekly maintenance plan to rebuild statistics with FULL SCAN.

    Once again i would like to thank you very much for your support.

    Regards

    Nader

  • Sergiy (4/3/2016)


    nadersam (3/31/2016)


    First i would like to thank you Curious and Hugo for the detailed reply.

    I have updated statistics for both database for those 3 tables.

    The view still taking 3 sec on the faster db and around 9 secs on the slow one.

    I attached the new execution plans and the table and view designs.

    I also cleared the wait stats and posted them again.

    Can you tell me please guys how did you know that the tables need update for stats based on the execution plan so i apply that in later work.

    Again thanks you very much for your help and waiting for your feedback.

    Nader

    Can you please include EPISODE table as well?

    Sorry for missing episode table , i have included it along with it's indexes in this post.

    Regards

    Nader

  • Some follow up on what you mentioned Hugo.

    After removing the top 30 and now by taking a look at the index seek operation on episode table.

    The actual number of rows is almost exactly the same as estimated which was one of the things i couldn't understand why is happening even after updating statistics.

    Off course you explained that in detail in your post.

    Thanks again.

    Nader

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply