Question on data compression

  • I have a table with with several XML columns stored as VARCHAR(MAX) because they some of the XMLs may not be valid XMLs. The table is quite large and I am compressing the table with DATA_COMPRESSION=PAGE on the clustered primary key on an identity column.

    There are other columns in the table besides the identity column and the XML columns, which I need to create non-clustered indexes. These columns are all of data type INT. Is there a benefit to compressing the indexes as well, or just the table/clustered PK should be compressed?

    I understand there's a performance penalty in writing to tables with data compression and a significant benefit for retrieving data from compressed tables.

    Any experience from experts is greatly appreciated.

  • Best answer I can give you is to test it. In all likelihood, I'd say, yes, you'll see a performance benefit since the vast majority of systems have more reads than writes. But every system is unique, so I'd say test it. Also, keep an eye on your CPU. If you're already stressed there, or starting to be stressed there, compression could make that worse.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • If you're using VARCHAR(MAX) data types and storing XML data you might not get much benefit from data compression as only IN_ROW_DATA pages can be compressed. If you haven't used it already it could be worth running the 'sp_estimate_data_compression_savings' stored proc and see what the potential space saving is and then, as mentioned above, you really need to test and see how it behaves in your environment.

    here's a link to Microsoft white paper regarding data compression in case it's of any use

  • It depends, many are satisfied with compression, but you should go to see that in practice on your environment.

    You can use sp_estimate_data_compression_savings to estimate savings - http://msdn.microsoft.com/en-us/library/cc280574.aspx ...and, the CPU is most important to consider. If you have high CPU then don't compress.

    Igor Micev,My blog: www.igormicev.com

  • N_Muller (7/23/2014)


    I have a table with with several XML columns stored as VARCHAR(MAX) because they some of the XMLs may not be valid XMLs. The table is quite large and I am compressing the table with DATA_COMPRESSION=PAGE on the clustered primary key on an identity column.

    There are other columns in the table besides the identity column and the XML columns, which I need to create non-clustered indexes. These columns are all of data type INT. Is there a benefit to compressing the indexes as well, or just the table/clustered PK should be compressed?

    I understand there's a performance penalty in writing to tables with data compression and a significant benefit for retrieving data from compressed tables.

    Any experience from experts is greatly appreciated.

    Quick question, what is the reason for storing the XML as VARCHAR(MAX), the XML datatype can store snippets which are not well formed as long as there are matching opening and closing tags?

    Example of a valid but not well formed XML

    DECLARE @XML XML = '<a><some_stuff/></a><next_stuff/>'

    select @XMLA better option would be using the XML data type, no need to do conversion for querying the data and the binary XML format is quite efficient in terms of space used.

    (Edit:added) In terms of using row/page compression, you will not gain much if anything. Alternative would be implementing a compression algorithm and store the data as varbinary, gives about 20-30% better space utilization than the XML data type but the drawback is the decompression overhead if one has to use the data.

    😎

  • I suspect there's a better clustering key, rather than the "default" identity, that might clear up a lot of your performance issues. Getting the best clustering key(s) is the first thing you should do, either confirming the current one or determining a better one.

    Then, if the varchar(max) columns aren't heavily used, you could force them out of the main table space, the compress the main table (unfortunately, then the "overflow" would not be compressed).

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Thanks everyone for the replies.

    In regards to the XML storage as VARCHAR(MAX), we can receive XML that does not go through XSD validation and may not even be a valid XML. By storing the data as VARCHAR(MAX), at least I can see the data. If I store it as XML it will fail.

    In regards to compression of in-row compression, VARCHAR(MAX) is not stored in row. I thought neither XML was stored in row, and I completely forgot about this important detail. The data won't be used much once stored; it's mostly for data archival and audit trail and after-the-fact validation. Are there any suggestions on how to compress this data?

  • N_Muller (7/23/2014)


    Thanks everyone for the replies.

    In regards to the XML storage as VARCHAR(MAX), we can receive XML that does not go through XSD validation and may not even be a valid XML. By storing the data as VARCHAR(MAX), at least I can see the data. If I store it as XML it will fail.

    In regards to compression of in-row compression, VARCHAR(MAX) is not stored in row. I thought neither XML was stored in row, and I completely forgot about this important detail. The data won't be used much once stored; it's mostly for data archival and audit trail and after-the-fact validation. Are there any suggestions on how to compress this data?

    Quick suggestion then since you are not regularely querying this data, store the data as FileTable, possibly on a cheaper/slower storage system. It can easily be "zipped", there are many CLR examples of compression routines available.

    😎

  • If you're willing to push back against the myth that identity should be default clustering key, and base the clustering key(s) choice on data instead, you can change the bolded code as needed below and then run these commands to review core index info about the table:

    USE [<your_db_name_here>]

    SET DEADLOCK_PRIORITY LOW --probably irrelevant, but just in case

    DECLARE @list_missing_indexes bit

    DECLARE @table_name_pattern sysname

    --NOTE: showing missing indexes can take some time; set to 0 if you don't want to wait.

    SET @list_missing_indexes = 1

    SET @table_name_pattern = '<your_table_name_here>' --'%'=all tables.

    --SET @table_name_pattern = '%'

    PRINT 'Started @ ' + CONVERT(varchar(30), GETDATE(), 120)

    IF @list_missing_indexes = 1

    BEGIN

    SELECT

    GETDATE() AS capture_date,

    DB_NAME(mid.database_id) AS Db_Name,

    dps.row_count,

    OBJECT_NAME(mid.object_id /*, mid.database_id*/) AS Table_Name,

    mid.equality_columns, mid.inequality_columns, mid.included_columns,

    user_seeks, user_scans, ca1.max_days_active, unique_compiles,

    last_user_seek, last_user_scan, avg_total_user_cost, avg_user_impact,

    system_seeks, system_scans, last_system_seek, last_system_scan, avg_total_system_cost, avg_system_impact,

    mid.statement, mid.object_id, mid.index_handle

    FROM sys.dm_db_missing_index_details mid WITH (NOLOCK)

    CROSS APPLY (

    SELECT DATEDIFF(DAY, create_date, GETDATE()) AS max_days_active FROM sys.databases WHERE name = 'tempdb'

    ) AS ca1

    LEFT OUTER JOIN sys.dm_db_missing_index_groups mig WITH (NOLOCK) ON

    mig.index_handle = mid.index_handle

    LEFT OUTER JOIN sys.dm_db_missing_index_group_stats migs WITH (NOLOCK) ON

    migs.group_handle = mig.index_group_handle

    LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON

    dps.object_id = mid.object_id AND

    dps.index_id IN (0, 1)

    --order by

    --DB_NAME, Table_Name, equality_columns

    WHERE

    1 = 1

    AND mid.database_id = DB_ID() --only current db

    AND OBJECT_NAME(mid.object_id) LIKE @table_name_pattern

    --AND mid.object_id IN (OBJECT_ID('<table_name_1>'), OBJECT_ID('<table_name_2>'))

    ORDER BY

    --avg_total_user_cost * (user_seeks + user_scans) DESC,

    Db_Name, Table_Name, equality_columns, inequality_columns

    END --IF

    -- list index usage stats (seeks, scans, etc.)

    SELECT

    ius2.row_num, DB_NAME() AS db_name,

    CASE WHEN i.name LIKE ca2.table_name + '%'

    THEN '~' + SUBSTRING(i.name, LEN(ca2.table_name) + 1 +

    CASE WHEN SUBSTRING(i.name, LEN(ca2.table_name) + 1, 1) = '_' THEN

    CASE WHEN SUBSTRING(i.name, LEN(ca2.table_name) + 2, 1) = '_' THEN 2 ELSE 1 END

    ELSE 0 END, 200)

    ELSE i.name END AS index_name,

    CASE WHEN i.is_unique = 0 THEN 'N' ELSE 'Y' END + '.' +

    CASE WHEN i.is_primary_key = 0 AND i.is_unique_constraint = 0 THEN 'N' ELSE 'Y' END AS [uniq?],

    ca2.table_name,

    i.index_id, --ius.user_seeks + ius.user_scans AS total_reads,

    dps.row_count,

    SUBSTRING(key_cols, 3, 8000) AS key_cols, SUBSTRING(nonkey_cols, 3, 8000) AS nonkey_cols,

    ius.user_seeks, ius.user_scans, ius.user_lookups, ius.user_updates,

    ius.last_user_seek, ius.last_user_scan, ius.last_user_lookup, ius.last_user_update,

    fk.Reference_Count AS fk_ref_count,

    DATEDIFF(DAY, CASE WHEN o.create_date > ca1.sql_startup_date THEN o.create_date

    ELSE ca1.sql_startup_date END, GETDATE()) AS max_days_active,

    FILEGROUP_NAME(i.data_space_id) AS filegroup_name,

    ius.system_seeks, ius.system_scans, ius.system_lookups, ius.system_updates,

    ius.last_system_seek, ius.last_system_scan, ius.last_system_lookup, ius.last_system_update

    FROM sys.indexes i WITH (NOLOCK)

    INNER JOIN sys.objects o WITH (NOLOCK) ON

    o.object_id = i.object_id

    CROSS JOIN (

    SELECT create_date AS sql_startup_date FROM sys.databases WHERE name = 'tempdb'

    ) AS ca1

    CROSS APPLY (

    SELECT OBJECT_NAME(i.object_id/*, DB_ID()*/) AS table_name

    ) AS ca2

    OUTER APPLY (

    SELECT

    ', ' + COL_NAME(object_id, ic.column_id)

    FROM sys.index_columns ic

    WHERE

    ic.key_ordinal > 0 AND

    ic.object_id = i.object_id AND

    ic.index_id = i.index_id

    ORDER BY

    ic.key_ordinal

    FOR XML PATH('')

    ) AS key_cols (key_cols)

    OUTER APPLY (

    SELECT

    ', ' + COL_NAME(object_id, ic.column_id)

    FROM sys.index_columns ic

    WHERE

    ic.key_ordinal = 0 AND

    ic.object_id = i.object_id AND

    ic.index_id = i.index_id

    ORDER BY

    COL_NAME(object_id, ic.column_id)

    FOR XML PATH('')

    ) AS nonkey_cols (nonkey_cols)

    LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON

    dps.object_id = i.object_id AND

    dps.index_id = i.index_id

    LEFT OUTER JOIN sys.dm_db_index_usage_stats ius WITH (NOLOCK) ON

    ius.database_id = DB_ID() AND

    ius.object_id = i.object_id AND

    ius.index_id = i.index_id

    LEFT OUTER JOIN (

    SELECT

    database_id, object_id, MAX(user_scans) AS user_scans,

    ROW_NUMBER() OVER (ORDER BY MAX(user_scans) DESC) AS row_num --user_scans|user_seeks+user_scans

    FROM sys.dm_db_index_usage_stats WITH (NOLOCK)

    WHERE

    database_id = DB_ID()

    --AND index_id > 0

    GROUP BY

    database_id, object_id

    ) AS ius2 ON

    ius2.database_id = DB_ID() AND

    ius2.object_id = i.object_id

    LEFT OUTER JOIN (

    SELECT

    referenced_object_id, COUNT(*) AS Reference_Count

    FROM sys.foreign_keys

    WHERE

    is_disabled = 0

    GROUP BY

    referenced_object_id

    ) AS fk ON

    fk.referenced_object_id = i.object_id

    WHERE

    i.object_id > 100 AND

    i.is_hypothetical = 0 AND

    i.type IN (0, 1, 2) AND

    o.type NOT IN ( 'IF', 'IT', 'TF', 'TT' ) AND

    (

    o.name LIKE @table_name_pattern AND

    o.name NOT LIKE 'dtprop%' AND

    o.name NOT LIKE 'filestream[_]' AND

    o.name NOT LIKE 'MSpeer%' AND

    o.name NOT LIKE 'MSpub%' AND

    --o.name NOT LIKE 'queue[_]%' AND

    o.name NOT LIKE 'sys%'

    )

    --AND OBJECT_NAME(i.object_id /*, DB_ID()*/) IN ('tbl1', 'tbl2', 'tbl3')

    ORDER BY

    --row_count DESC,

    --ius.user_scans DESC,

    --ius2.row_num, --user_scans&|user_seeks

    db_name, table_name,

    -- list clustered index first, if any, then other index(es)

    CASE WHEN i.index_id IN (0, 1) THEN 1 ELSE 2 END,

    key_cols

    PRINT 'Ended @ ' + CONVERT(varchar(30), GETDATE(), 120)

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher (7/23/2014)


    If you're willing to push back against the myth that identity should be default clustering key, and base the clustering key(s) choice on data instead, you can change the bolded code as needed below and then run these commands to review core index info about the table:

    USE [<your_db_name_here>]

    SET DEADLOCK_PRIORITY LOW --probably irrelevant, but just in case

    DECLARE @list_missing_indexes bit

    DECLARE @table_name_pattern sysname

    --NOTE: showing missing indexes can take some time; set to 0 if you don't want to wait.

    SET @list_missing_indexes = 1

    SET @table_name_pattern = '<your_table_name_here>' --'%'=all tables.

    --SET @table_name_pattern = '%'

    PRINT 'Started @ ' + CONVERT(varchar(30), GETDATE(), 120)

    IF @list_missing_indexes = 1

    BEGIN

    SELECT

    GETDATE() AS capture_date,

    DB_NAME(mid.database_id) AS Db_Name,

    dps.row_count,

    OBJECT_NAME(mid.object_id /*, mid.database_id*/) AS Table_Name,

    mid.equality_columns, mid.inequality_columns, mid.included_columns,

    user_seeks, user_scans, ca1.max_days_active, unique_compiles,

    last_user_seek, last_user_scan, avg_total_user_cost, avg_user_impact,

    system_seeks, system_scans, last_system_seek, last_system_scan, avg_total_system_cost, avg_system_impact,

    mid.statement, mid.object_id, mid.index_handle

    FROM sys.dm_db_missing_index_details mid WITH (NOLOCK)

    CROSS APPLY (

    SELECT DATEDIFF(DAY, create_date, GETDATE()) AS max_days_active FROM sys.databases WHERE name = 'tempdb'

    ) AS ca1

    LEFT OUTER JOIN sys.dm_db_missing_index_groups mig WITH (NOLOCK) ON

    mig.index_handle = mid.index_handle

    LEFT OUTER JOIN sys.dm_db_missing_index_group_stats migs WITH (NOLOCK) ON

    migs.group_handle = mig.index_group_handle

    LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON

    dps.object_id = mid.object_id AND

    dps.index_id IN (0, 1)

    --order by

    --DB_NAME, Table_Name, equality_columns

    WHERE

    1 = 1

    AND mid.database_id = DB_ID() --only current db

    AND OBJECT_NAME(mid.object_id) LIKE @table_name_pattern

    --AND mid.object_id IN (OBJECT_ID('<table_name_1>'), OBJECT_ID('<table_name_2>'))

    ORDER BY

    --avg_total_user_cost * (user_seeks + user_scans) DESC,

    Db_Name, Table_Name, equality_columns, inequality_columns

    END --IF

    -- list index usage stats (seeks, scans, etc.)

    SELECT

    ius2.row_num, DB_NAME() AS db_name,

    CASE WHEN i.name LIKE ca2.table_name + '%'

    THEN '~' + SUBSTRING(i.name, LEN(ca2.table_name) + 1 +

    CASE WHEN SUBSTRING(i.name, LEN(ca2.table_name) + 1, 1) = '_' THEN

    CASE WHEN SUBSTRING(i.name, LEN(ca2.table_name) + 2, 1) = '_' THEN 2 ELSE 1 END

    ELSE 0 END, 200)

    ELSE i.name END AS index_name,

    CASE WHEN i.is_unique = 0 THEN 'N' ELSE 'Y' END + '.' +

    CASE WHEN i.is_primary_key = 0 AND i.is_unique_constraint = 0 THEN 'N' ELSE 'Y' END AS [uniq?],

    ca2.table_name,

    i.index_id, --ius.user_seeks + ius.user_scans AS total_reads,

    dps.row_count,

    SUBSTRING(key_cols, 3, 8000) AS key_cols, SUBSTRING(nonkey_cols, 3, 8000) AS nonkey_cols,

    ius.user_seeks, ius.user_scans, ius.user_lookups, ius.user_updates,

    ius.last_user_seek, ius.last_user_scan, ius.last_user_lookup, ius.last_user_update,

    fk.Reference_Count AS fk_ref_count,

    DATEDIFF(DAY, CASE WHEN o.create_date > ca1.sql_startup_date THEN o.create_date

    ELSE ca1.sql_startup_date END, GETDATE()) AS max_days_active,

    FILEGROUP_NAME(i.data_space_id) AS filegroup_name,

    ius.system_seeks, ius.system_scans, ius.system_lookups, ius.system_updates,

    ius.last_system_seek, ius.last_system_scan, ius.last_system_lookup, ius.last_system_update

    FROM sys.indexes i WITH (NOLOCK)

    INNER JOIN sys.objects o WITH (NOLOCK) ON

    o.object_id = i.object_id

    CROSS JOIN (

    SELECT create_date AS sql_startup_date FROM sys.databases WHERE name = 'tempdb'

    ) AS ca1

    CROSS APPLY (

    SELECT OBJECT_NAME(i.object_id/*, DB_ID()*/) AS table_name

    ) AS ca2

    OUTER APPLY (

    SELECT

    ', ' + COL_NAME(object_id, ic.column_id)

    FROM sys.index_columns ic

    WHERE

    ic.key_ordinal > 0 AND

    ic.object_id = i.object_id AND

    ic.index_id = i.index_id

    ORDER BY

    ic.key_ordinal

    FOR XML PATH('')

    ) AS key_cols (key_cols)

    OUTER APPLY (

    SELECT

    ', ' + COL_NAME(object_id, ic.column_id)

    FROM sys.index_columns ic

    WHERE

    ic.key_ordinal = 0 AND

    ic.object_id = i.object_id AND

    ic.index_id = i.index_id

    ORDER BY

    COL_NAME(object_id, ic.column_id)

    FOR XML PATH('')

    ) AS nonkey_cols (nonkey_cols)

    LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON

    dps.object_id = i.object_id AND

    dps.index_id = i.index_id

    LEFT OUTER JOIN sys.dm_db_index_usage_stats ius WITH (NOLOCK) ON

    ius.database_id = DB_ID() AND

    ius.object_id = i.object_id AND

    ius.index_id = i.index_id

    LEFT OUTER JOIN (

    SELECT

    database_id, object_id, MAX(user_scans) AS user_scans,

    ROW_NUMBER() OVER (ORDER BY MAX(user_scans) DESC) AS row_num --user_scans|user_seeks+user_scans

    FROM sys.dm_db_index_usage_stats WITH (NOLOCK)

    WHERE

    database_id = DB_ID()

    --AND index_id > 0

    GROUP BY

    database_id, object_id

    ) AS ius2 ON

    ius2.database_id = DB_ID() AND

    ius2.object_id = i.object_id

    LEFT OUTER JOIN (

    SELECT

    referenced_object_id, COUNT(*) AS Reference_Count

    FROM sys.foreign_keys

    WHERE

    is_disabled = 0

    GROUP BY

    referenced_object_id

    ) AS fk ON

    fk.referenced_object_id = i.object_id

    WHERE

    i.object_id > 100 AND

    i.is_hypothetical = 0 AND

    i.type IN (0, 1, 2) AND

    o.type NOT IN ( 'IF', 'IT', 'TF', 'TT' ) AND

    (

    o.name LIKE @table_name_pattern AND

    o.name NOT LIKE 'dtprop%' AND

    o.name NOT LIKE 'filestream[_]' AND

    o.name NOT LIKE 'MSpeer%' AND

    o.name NOT LIKE 'MSpub%' AND

    --o.name NOT LIKE 'queue[_]%' AND

    o.name NOT LIKE 'sys%'

    )

    --AND OBJECT_NAME(i.object_id /*, DB_ID()*/) IN ('tbl1', 'tbl2', 'tbl3')

    ORDER BY

    --row_count DESC,

    --ius.user_scans DESC,

    --ius2.row_num, --user_scans&|user_seeks

    db_name, table_name,

    -- list clustered index first, if any, then other index(es)

    CASE WHEN i.index_id IN (0, 1) THEN 1 ELSE 2 END,

    key_cols

    PRINT 'Ended @ ' + CONVERT(varchar(30), GETDATE(), 120)

    Nice!! this one goes in my snippets;-)

    😎

  • Identity as a clustered index was not my first choice. The problem is that this is a high volume, high concurrency OLTP with thousands of data inserts per minute. The table is actually partitioned to better manage the data size. The "transaction ID" is used throughout the system for querying. No data is updated, but continuously inserted, even when there are changes to the contents of a transaction. This continuous inserts makes it impossible to create a clustered index on the transaction ID because of data movement and performance crawls. I had to create the identity as the primary key to make sure new inserts are always stored at the end of the table and don't cause data movement. I have a non-clustered index on the transaction ID. BTW, once I made the change data insert performance improved by at least two orders of magnitude.

  • N_Muller (7/23/2014)


    Identity as a clustered index was not my first choice. The problem is that this is a high volume, high concurrency OLTP with thousands of data inserts per minute. The table is actually partitioned to better manage the data size. The "transaction ID" is used throughout the system for querying. No data is updated, but continuously inserted, even when there are changes to the contents of a transaction. This continuous inserts makes it impossible to create a clustered index on the transaction ID because of data movement and performance crawls. I had to create the identity as the primary key to make sure new inserts are always stored at the end of the table and don't cause data movement. I have a non-clustered index on the transaction ID. BTW, once I made the change data insert performance improved by at least two orders of magnitude.

    So transactionid is not naturally increasing? Yeah, that could cause problems, esp. if it's almost truly random. [Transid is not a, yikes, uniqueidentifier is it? You had mentioned several other 'ints' in the table, so I figured the other candidate keys were ints.]

    Wow, the insert performance got 100+ times better from clustering on identity -- really?? Interesting, though, because in some cases I've got 100+ times reduction in I/O from properly clustering on something other than an identity.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Transaction ID was not the best term - I guess I should have used Event ID :-). Event ID is naturally increasing, but there are many changes to an event across the life of the event, and since every change is a new insert, there would be continuous data movement. An event can last a few hours to as long as several months, so there can be an insert that can cause millions of rows of data movement. That's why using Event ID as the clustered index was a really bad choice.

  • N_Muller (7/23/2014)


    Transaction ID was not the best term - I guess I should have used Event ID :-). Event ID is naturally increasing, but there are many changes to an event across the life of the event, and since every change is a new insert, there would be continuous data movement. An event can last a few hours to as long as several months, so there can be an insert that can cause millions of rows of data movement. That's why using Event ID as the clustered index was a really bad choice.

    So EventID is not unique in this table, and it changes every time an event is modified? Yeah, what a pita! OK, as long as the clustering key choice was logically made and not based on just a "rule"/nursery-rhyme-like saying.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher (7/23/2014)


    I suspect there's a better clustering key, rather than the "default" identity, that might clear up a lot of your performance issues. Getting the best clustering key(s) is the first thing you should do, either confirming the current one or determining a better one.

    This is really a good think you could do, to move the varchar(max) columns out of the main table into another. I've been working on NoSQL systems and that approach is pretty much used. That will bring benefits.

    Then, if the varchar(max) columns aren't heavily used, you could force them out of the main table space, the compress the main table (unfortunately, then the "overflow" would not be compressed).

    But if you separate the table by columns, would you need to compress the tables? Maybe no big benefits as like the benefits from the separating.

    However, it's good to see feedback from your work and experience.

    Igor Micev,My blog: www.igormicev.com

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply