Data compression

  • Hi,

    I've recently upgraded to Enterprise Edition that has data compression.

    Some tables have page compression savings about 70% or more (from 400MB to 120MB).

    But the PK is clustered and it has over 90.000 writes and only 1.000 reads...

    Does the 75% select percentage, recommend by Microsoft, be applied to the data (clustered PK is data and index) or just for the indexes?

    Even if I drop the clustered index and turn the table into a heap it will have the high writes and low reads...

    Thanks,

    Pedro



    If you need to work better, try working less...

  • Data compression is mostly about optimizing storage. Page Compression will yield approximately 70% compression in tables and indexes. By reducing the number of page reads and page buffers, it can potentially improve the performance of queries, but to what degree compression results in improvements in query performance depends. Queries that can across a large number of pages, something like full or partial table/index scans, will benefit most, but even then it probably won't be a 70% improvement.

    Data compression works best for tables that are mostly inserted and selected. If you update compressed data, especially if you're updating primary keys, then compression can actually degrade performance significantly due to page splitting.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (9/28/2016)


    Data compression is mostly about optimizing storage. Page Compression will yield approximately 70% compression in tables and indexes. By reducing the number of page reads and page buffers, it can potentially improve the performance of queries, but to what degree compression results in improvements in query performance depends. Queries that can across a large number of pages, something like full or partial table/index scans, will benefit most, but even then it probably won't be a 70% improvement.

    Data compression works best for tables that are mostly inserted and selected. If you update compressed data, especially if you're updating primary keys, then compression can actually degrade performance significantly due to page splitting.

    I know that a compression of 70% doesn't imply a performance improvement of 70%...

    Data compressed => less reads => less IO (where almost the bottleneck is...).

    The primary key is never updated (identity seed) and the data if often updated... The data is selected by a reference field introduced by the user... That's why the PK has many writes and almost no reads... The auxiliary index (user + reference) has lots of reads and no writes (updates).

    Any table or index whose fields are updated often won't benefit from compression, right?

    Thanks,

    Pedro



    If you need to work better, try working less...

  • PiMané (9/28/2016)


    Eric M Russell (9/28/2016)


    Data compression is mostly about optimizing storage. Page Compression will yield approximately 70% compression in tables and indexes. By reducing the number of page reads and page buffers, it can potentially improve the performance of queries, but to what degree compression results in improvements in query performance depends. Queries that can across a large number of pages, something like full or partial table/index scans, will benefit most, but even then it probably won't be a 70% improvement.

    Data compression works best for tables that are mostly inserted and selected. If you update compressed data, especially if you're updating primary keys, then compression can actually degrade performance significantly due to page splitting.

    I know that a compression of 70% doesn't imply a performance improvement of 70%...

    Data compressed => less reads => less IO (where almost the bottleneck is...).

    The primary key is never updated (identity seed) and the data if often updated... The data is selected by a reference field introduced by the user... That's why the PK has many writes and almost no reads... The auxiliary index (user + reference) has lots of reads and no writes (updates).

    Any table or index whose fields are updated often won't benefit from compression, right?

    Thanks,

    Pedro

    When the size of a previously inserted row increases in length (like when it's updated), the row must be marked as deleted and moved to another location in the page. If the size of data within a page is updated in a way that prevents all the rows from being re-written to the size 8k page, then SQL Server must append a new page and split rows between them. This is more problematic when pages are compressed, because the compression ratio for a page depends on how the data is populated. For example, if you are inserting rows with NULL columns, and then going back later and updating those rows with values, then that data is less sparse than it was when initially inserted, the compression ratio will drop, and you'll see page splits.

    It's basically the same page splitting issue we've always had with non-compressed data, only compression compicates it. But it can be mitigated by specifying a FILLFACTOR between something like 70 and 90 and setting PAD_INDEX = ON.

    https://blogs.msdn.microsoft.com/sqlserverstorageengine/2011/05/12/data-compression-and-fill-factor/

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Hi,

    Used the script in this page: http://sqlmag.com/database-performance-tuning/practical-data-compression-sql-server

    But it doesn't give the table with the 70% compression ration since the PK (clustered = data) has only write and no reads (the selection ratio is 0...).

    It might also have more writes than records due to page splits since it uses the sys.dm_db_index_operational_stats (leaf and range stats)...

    I used compression on that table and notice the performance improved but the read stalls (IO) when from 5ms to 30ms... Since the data is compressed it should have less read stalls and more CPU (decompress data)...

    Thanks,

    Pedro



    If you need to work better, try working less...

  • Going to a heap in the hopes that you get improved compression is not the way to tune queries. If that clustered index is in use, it needs to stay in place, regardless of whether you get optimal compression on it or not. Compression is an added benefit, not a purpose unto itself.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey (9/28/2016)


    Going to a heap in the hopes that you get improved compression is not the way to tune queries. If that clustered index is in use, it needs to stay in place, regardless of whether you get optimal compression on it or not. Compression is an added benefit, not a purpose unto itself.

    The clustered index, also PK, is "only" there because it's a small data type and sequencial..

    If it was the table's natural key it would be a huge index for PK (more than 4 fields) that would imply all the FKs having the same fields (instead of just one), all table's "extra" indexes would also have it (huge key instead of 8 bytes - bigint) and fragmentation would also be a bigger problem...

    But if it has lots of writes has an index it should also have lots of writes as "only data" (heap), so I'll just keep it has it is...

    The storage, right now, has good response times... average 5ms...

    @Grant: if I may ask another thing....

    About performance tuning.

    Weekly I analyse the workload and check what possible missing indexes should exist and what indexes have lots of writes and few reads.

    I always get an index recommendation on a worker process that executes every 15 minutes, according to Glen Berry's DMV after two days it has an advantage of 400.000.

    I create the index and 24h after I have a "recommendation" to drop the index... 150.000 writes and 200 reads...

    What's the best approach on these cases?!

    Thanks,

    Pedro



    If you need to work better, try working less...

  • On the PK:

    Don't confuse the primary key and the clustered index. While the default is for the PK to be clustered, you don't have to make the clustered index on a table the PK. If a different column, or columns, make sense as the clustered index because that column(s) is more likely to be used to retrieve the data, then use that instead of the PK. With very rare exceptions, every table should have a PK. With very rare exceptions, every table should have a clustered index. However, they don't have to be the same.

    On the Missing Indexes:

    I don't trust them. They are, at best guide posts. I have a very simple query that I use to teach with because it produces so many interesting affects:

    CREATE PROC dbo.AddressByCity @City NVARCHAR(30)

    AS

    SELECT  a.AddressID,

            a.AddressLine1,

            a.AddressLine2,

            a.City,

            sp.Name AS StateProvinceName,

            a.PostalCode

    FROM    Person.Address AS a

    JOIN    Person.StateProvince AS sp

            ON a.StateProvinceID = sp.StateProvinceID

    WHERE   a.City = @City;

    GO

    If you run this, it will suggest a missing index. If you create the missing index suggested, you know what happens? Nothing. It's not used. That's because it's not a good index for this query. Instead, for this query, what's needed is an index on the City column with an INCLUDE of AddressLine1, AddressLine2 & PostalCode. The Missing Index suggests just an index on City. It's wrong.

    So, I'm not surprised when a Missing Index doesn't help a query. They're nothing but suggestions and guideposts and can't be relied on implicitly as a definitive answer. They must be looked at with a large amount of suspicion. They must be tested to confirm that they will in fact help.

    Further, depending on how you're retrieving the Missing Index information, you may not have a correlation to specific queries and query performance. The missing index system views in your database don't have a tie back to any given query. Therefore, you might have an ad hoc query that gets run once, and never again, but showed a missing index. Creating that index is going to be a serious waste of time and resources on your system because the query that needed it (assuming it's even a good index) will never be run again.

    Only use those things as guides. Don't automatically apply them.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey (9/29/2016)


    On the PK:

    Don't confuse the primary key and the clustered index. While the default is for the PK to be clustered, you don't have to make the clustered index on a table the PK.

    I know... I just use the PK clustered cause an identity bigint if sequential and small and also is used as FK... clustered because is sequential and PK because is small (one column) and used for FK.. The "natural key" is a non clustered index..

    Grant Fritchey (9/29/2016)


    Further, depending on how you're retrieving the Missing Index information, you may not have a correlation to specific queries and query performance. The missing index system views in your database don't have a tie back to any given query. Therefore, you might have an ad hoc query that gets run once, and never again, but showed a missing index. Creating that index is going to be a serious waste of time and resources on your system because the query that needed it (assuming it's even a good index) will never be run again.

    Only use those things as guides. Don't automatically apply them.

    I always check the number of times the query was executed and the last execution... If there's a missing index on a query used by a worker once a day (usually at night) I ignore it...

    It's best to take the reads vs writes in consideration rather than the missing index then, right?

    For the reads vs writes it best to use the "simple" sys.dm_db_index_usage_stats with reads, writes, seeks, scans or sys.dm_db_index_operational_stats with leafs information and other important things?

    Thanks,

    Pedro



    If you need to work better, try working less...

  • PiMané (9/29/2016)


    It's best to take the reads vs writes in consideration rather than the missing index then, right?

    Yes & no. It really comes down to pain points. Not every query needs to be tuned. For example, we had a threshold of 3 seconds. If a query ran faster than 3 seconds, we just stopped worrying about... unless there was actual pain, locks, blocks, deadlocks, complaints from the users. Then, we'd tune the query. If tuning the query required indexes (and usually, it's code first, not indexes), yes, we'd compare reads & writes to get an idea of the impact of the index. However, most systems, most of the time, in most circumstances are read heavy, so the writes only entered into the situation if we saw clear and present issues that made us take them into consideration.

    For the reads vs writes it best to use the "simple" sys.dm_db_index_usage_stats with reads, writes, seeks, scans or sys.dm_db_index_operational_stats with leafs information and other important things?

    For the majority of situations the simple way is the best way. You only need to drill down on grotty details of specific behaviors for special issues and those should be rare on most systems.

    Too many people focus in on the types of work that Bob Ward, Adam Machanic, & Paul Randal do. Not that we shouldn't focus on their work, it's amazing, and WAY beyond what I do. It's just that these guys are on the bleeding edge, the extreme tip of the spear, for most of their work. The level of detail that they get into doesn't always translate into what most of us are dealing with day to day. I absolutely believe in keeping it simple since most of us have pretty fundamental issues to solve. If you're not dealing with multi-terabyte databases and/or 10s of thousands of transactions a minute, you're in the mid-range with the rest of us and the mid-range solutions are all we generally need.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thanks for all the help Grant, that's probably the best answer I ever got 🙂

    I guess it really comes down to what edge are we on...

    My biggest database has 100GB and 1000 transactions/s... The server has 2000 transactions/s..

    First check the code then check the database 🙂

    The main problem with our system is that we use dynamic SQL since the interface has over 20 fields to query by and they can all have values or none... (we use dynamic SQL so we don't use "@param IS NULL OR col = @param" or "col = ISNULL(@param, col)").

    When I look at index recommendation I almost get an heart attack... Since every column can have a value or not there are more than 40 index recommended...

    What I did, since the data is "restricted" by tenant (this is a multi tenant database), is to create a main index with the tenant columns and include the 20 that can be used to filter...

    Thanks for all your help and advice...

    Pedro



    If you need to work better, try working less...

  • Sounds like you're doing a good job. Keep it up. Focus on the metrics to drive your decisions. Take a look at Gail Shaw's blog post on "catch-all" queries. That will help in your situation.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey (9/29/2016)


    Sounds like you're doing a good job. Keep it up. Focus on the metrics to drive your decisions. Take a look at Gail Shaw's blog post on "catch-all" queries. That will help in your situation.

    Thanks.. it means a lot coming from you 🙂

    I already took a look at Gail post a few year ago keep checking for an alternative to dynamic SQL but so far nothing can actually replace it...

    There is the problem of parameter sniffing and I had my problems... A query that took 3secs to run when using OPTION RECOMPILE took only 10ms.... I turned on parameterization forced but the memory that the plans took was just too much for the server... so I when back to OPTION RECOMPILE on dynamic SQL...

    I'm hopping SQL 2016 with the parameter sniffing database configuration can solve this problem... or in the end, what SQL does is add the OPTION RECOMPILE at the end of every query execution 😀 (just being mean here 🙂 ).

    Regards,

    Pedro



    If you need to work better, try working less...

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply