Hi, and thanks for all the replies. Yesterday was a long day and we finally got the rebuilds complete last night. I'm not happy at all with the solution, as it is unheard of to take a production system offline to rebuild tables in order to get the system working again. I'm also unhappy that we cannot quite pinpoint the underlying problem.
Allow me to answer your questions:
What other solutions have you tried?
We have tried "ALTER INDEX WITH REORGANIZE", "ALTER INDEX WITH REBUILD" without success.
Have you look at locks and wait states to understand why it's doing this? You should be able to at least narrow down the problem.
The excessive wait stats were all SOS_SCHEDULER_YIELD. At the time there were very few requests and very little IO, but the CPU was running between 85% and 100%.
App code or not (and I assume we're not talking 3rd party here), if things are this bad, that you have to take the production system offline, I don't know of many businesses that couldn't find a bit of time for some refactoring. It sounds like you need it.
3rd party (and 4th party) stuff. This software has a data layer that builds SQL on the fly. It creates temporary stored procedures (CREATE PROCEDURE #aabb001 @P1, @P2 AS .....) then it executes these with the necessary parameters. The customer does not want us to create stored procedures as they do not want to be vendor dependant. It is a nightmare to maintain and optimise.
Are those tables heaps or clusters?
What's logical fragmentation?
I couldn't take a reading, but file growth is set at 1Gb per file. They are on a dedicated disk array on a SAN.
forwarding pointers? avg page space used? at the point that you need to do a 'rebuild'?
Not sure. I only checked index fragmentation which was 99% on one of the tables, but an index rebuild did not help much.
What version of SQL are you using? What edition?
SQL Server 2008 R2 (RTM) - 10.50.1600.1
p.s. How experienced are your DBAs?
My boss was working with me on the issue and he knows his SQL very well. Been around the block a few times. My DBA skills are ok but I'm primarily a developer.
I'm thinking when he does a load of data, the statistics go out of date, and "it grinds to a halt"? updating statistics right after the large load seems to be one of the first things i'd look at.
I did not think of statistics as Auto Create Stats and Auto Update Stats are both on.
If you can't change the design or application. then you could change the hardware... I have seen some good performance improvements from using SSDs on badly designed databases, though at over 1TB it will be expensive..
Too expensive for this client, besides the disks were completely underutilised yesterday.
Just a thought, do you drop the indexes prior to a large load of data and then rebuild same indexes afterwards?
Not at all.
Also, there may be a little bit of back-end tweaking that can be done, whilst using views to present the same logical output to the application. Not knowing the structure of the db, this is just a thought aired.
We currently have a partitioning proposal with the client as well as an archiving proposal. The archiving is difficult due to the applications data layer, but we want to implement distributed views with insert and update triggers.
Can you take us step by step through your 'large data load', including existing rowcounts, expected inclusion volume, number of tables, if you use staging tables to cleanse the data if there's any cleanup to be done, etc?
The first time this happened, in December, the client did an annual run that updated each and every member on the system. The update actually created a new version of each so there was a large number of new rows in a large number of tables, although I cannot give you numbers. Monday nights update was very different. We needed to add new properties to all the members. The strange thing was that my script to add the new properties was run two weeks ago with no ill effect. My script added the properties with a default value of NULL. The script added about 250,000 rows to 5 tables, so over a million new rows. On Monday evening their application, via the data layer, performed an update in a loop to set the value to a valid string. They manage to update 91,000 rows in about 10 hours before stopping the process. This is when the system croaked. Don't ask me why they decided to perform an update like this, it is very typical of the incompetence we have to deal with on a daily basis. I have since informed them that a SQL script could perform the update in a set in a matter of minutes, but that is besides the point.
The SQL Guy @ blogspot