Which Database is Causing the Problem?

Last Updated 5 years ago

Pick a bottle, any bottle (Moet & Chandon)

You’ve got one SQL Server hosting a bunch of databases, and performance is bad.

Which database should you focus on first?

Here are a few ways you can pick – but keep in mind they may produce different answers, especially when you’ve got several databases that suck, or different ways of sorting the culprits.

Option 1: check the plan cache.

As SQL Server executes queries, it tracks which queries it sees the most often, plus helpful metrics like how much CPU time they used, how much data they read, number of times they spilled to disk, etc.

To find out which queries are using the most CPU power, install sp_BlitzCache, and then run it:

EXEC sp_BlitzCache;

1	EXEC sp_BlitzCache;

By default, it sorts by CPU, finding the most CPU-intensive queries. You can also use different sort orders like these, and check the documentation for more:

EXEC sp_BlitzCache @SortOrder = 'duration';
EXEC sp_BlitzCache @SortOrder = 'reads';

1 2	EXEC sp_BlitzCache @SortOrder = 'duration'; EXEC sp_BlitzCache @SortOrder = 'reads';

The results will show you the top most resource-intensive queries, and the very first column shows the database name.

I just know the answer is down here somewhere, give me more time

Drawbacks:

The database name is just the context where the query ran. If you’ve got cross-database queries, or people are running queries from tempdb, you won’t really see which objects they’re accessing.
The data can be somewhat transient when your server is under heavy memory pressure, like trying to host 1TB of data on a VM with 16GB RAM.
If your app is using unparameterized SQL, the results won’t be as helpful.

Option 2: check the file statistics.

SQL Server tracks which files get the most reads & writes, and you can query that with the system function sys.dm_io_virtual_file_stats:

SELECT d.name AS database_name, f.* 
  FROM sys.dm_io_virtual_file_stats(NULL, NULL) f
  INNER JOIN sys.databases d ON f.database_id = d.database_id
  ORDER BY f.num_of_bytes_read DESC;

SELECT d.name AS database_name, f.*

FROM sys.dm_io_virtual_file_stats(NULL, NULL) f

INNER JOIN sys.databases d ON f.database_id = d.database_id

ORDER BY f.num_of_bytes_read DESC;

Here, I’m sorting them by which files have had the most data read since the server was restarted. (It can also be reset when the database is restored, taken offline/online, etc.)

Drawbacks:

These numbers don’t include data retrieved from cache.
These DO include system tasks like backups, corruption checks, and index rebuilds.
The query above is pretty vanilla – for more details like latency, check out sp_BlitzFirst.

Option 3: check which pages are cached.

If one database’s pages are dominating SQL Server’s memory, then queries in that database are probably the ones repeatedly reading that same data over and over.

To find out which database’s data is using up the most cache, use the sample query from Books Online:

SELECT COUNT(*)AS cached_pages_count  
    ,CASE database_id   
        WHEN 32767 THEN 'ResourceDb'   
        ELSE db_name(database_id)   
        END AS database_name  
FROM sys.dm_os_buffer_descriptors  
GROUP BY DB_NAME(database_id) ,database_id  
ORDER BY cached_pages_count DESC;

SELECT COUNT(*)AS cached_pages_count

,CASE database_id

WHEN 32767 THEN 'ResourceDb'

ELSE db_name(database_id)

END AS database_name

FROM sys.dm_os_buffer_descriptors

GROUP BY DB_NAME(database_id) ,database_id

ORDER BY cached_pages_count DESC;

dbo.Champagne is both the problem and the solution

Drawbacks:

Querying this data is sloooow, and gets slower the more memory that you have. Querying this data on >64GB RAM can take several minutes.
This data is extremely transient – it changes all through the day as people run different queries, causing different pages to go in & out of cache.
This doesn’t necessarily mean that the queries reading the most data are the ones causing the biggest problems.

Index scans aren’t always bad, and index seeks aren’t always great.

15 Reasons Your Query Was Fast Yesterday, But Slow Today

8 Comments. Leave new

Eric
April 17, 2019 9:52 am

I use wait data to determine the heavy hitters. use 1 second sampling rate to collect who is doing what 24×7. You can group by the time range, user, database, etc to determine who is using the most of what resource. It does not provide a “total cost” or “total cache size” but from a statistical view, the sampling rate will provide sufficient data points to identify problematic code. Similar to what commercial tools do , but you can build it for free.

Reply
- Brent Ozar
  April 17, 2019 9:53 am
  
  When you say “wait data” – where are you sampling wait stats from? Sys.dm_os_wait_stats doesn’t show which database is involved, and the query-level wait stats are based on the query’s context, so just making sure you’re aware that you have the same drawbacks as option 1 (plan cache.)
  
  Reply
KnoedelDBA
April 18, 2019 12:17 am

Hi Brent
I use sometimes in SSMS under Reports the “Activity – Top Sessions” – Report.
There i see “Top CPU Consuming Sessions”, “Top Memory Consuming Sessions”, “Top Sessions By # Reads” and “Top Sessions By # Writes”.
You don’t consider this report, because there its not possible to see applications that don’t uses connection pooling?

Reply
- Brent Ozar
  April 18, 2019 6:32 am
  
  KnoedelDBA – great question. I didn’t consider those because a session can hop around between databases, like running queries in several different databases. I don’t have a problem with that approach, though – it’s no worse than any of the other above approaches since they all have drawbacks. I like the idea!
  
  Reply
Willem Leenen
April 18, 2019 12:53 am

What helped me very much is that I make a clear distinction between performance data (duration) and diagnostic data ( CPU consumption, (b)locking etc). In your scenario, i often create a xevent per day ( mondaytrace, tuesdaytrace etc) started and stopped by jobs. I create a topSQL overview based on duration, grouped by SQL_hash. I enhance this with the business-angle – e.g. SLA’s, informal as they may be. If needed, filter on time of the slowness. Now i have a clear overview where the time is spend on. “Timewriting for databases” a manager once dubbed this, when i showed him a telling piechart. Only then I go for diagnostic data, depending on the performanceprofile of the environment ( webshop, DWH or OLTP are the 3 main ones i recognise). So i know if i need to check for componentstress, executionplan instability, locking etc.
Without the above approach, I would solve problems without knowing how big they are, nor would i know the impact of my tuning.

Reply
- Brent Ozar
  April 18, 2019 6:31 am
  
  Ooo, I like the idea of a distinction between performance data and diagnostic data, that’s interesting.
  
  Reply
Shaun Austin
April 18, 2019 4:21 am

Regarding cached pages, I wrote a very similar script a while ago and have been using it to log buffer-pool usage every hour. It takes 22 seconds to complete on a server running SQL 2016 Standard, so is limited to 128GB of memory for the buffer pool (although the server has far more). However the query goes parallel to the tune of 8 cores. Running it with MAXDOP 1 actually speeds it up.

As Brent says, the data is very transient. The more often you capture the data, the more reliable it is in terms of pin pointing memory usage, however it’s SLOW, and look out for it going parallel!

Reply
Emanuele Meazzo
April 18, 2019 6:29 am

If “the problem” is “queries are slow” then I usually just set up an xevent tracking queries longer than an acceptable threshold and then see what I catched and what they are up to.
For a quantitative analysis, an histogram target bucketed for DB is ok too to have a rough idea of where most of the stuff is running.

Reply