Improving Database Design and Performance Using Compression

Introduction

Designing databases is a common activity for most of us, and assigning proper datatypes to attributes within entities is a part of that task. Once promoted to production when thousands of users pound these tables, the DBA usually keeps track of the growth trends of databases and the tables within them. For a DBA, maintaining these statistics (apart from numerous others) is important to see how the application is doing, how the user-response is varying and how the hardware is being utilized.

When an application slows down and the database response is poor, there could be several contributing factors like a bad database layout, sub-optimal database design, ill-designed application code (no SP's / all ad hoc T-SQL), ill-written stored procedures, improper allocation of CPU/memory, and perhaps several combinations of all of these factors. A common problem that I as a DBA come across is usage of inappropriate dataypes for fields. These are created during the design phase but problems with them are not detected until a few months after implementation. This is rampant, especially when a DBA is not part of the design or is hired after the fact. I would like to share one such case with you today.

In this article, the first task is to easily identify suspicious tables where choice of column datatypes could have been suboptimal during design, by using the concept of Row Density (rows per page per table); then the focus is on demonstrating how allocation of inappropriate datatypes for columns can increase disk space usage apart from reducing the overall query performance. Finally, it is being shown how using appropriate dataypes as well as appropriate compression method helps reduce disk space utilization and augments query performance. In the process, it would also be shown how to determine the most appropriate compression method for a given scenario.

This article focussed specifically on Microsoft SQL Server 2008 features. Wherever possible, I have attempted to provide the corresponding feature in Microsoft SQL Server 2005 and 2000.

Building up a scenario

The first step for me is to get a list of a few of the largest tables (in terms of row counts) in a user database. Then I need to get the size of these individual tables (on disk) as well. Having done that, I can figure out the largest tables in terms of rowcount and size on disk. I would also like to see how many rows (on an average) reside on each data page allocated to the table. Let us call it "row density" for our discussion. To clarify, Row Density equals Number of rows per Data Page for a given table.

So, our first task is to get the largest tables in terms of rows. We would use a temporary table, #TEMP_Find_Large_tables and the undocumented system stored procedure sp_MSforEachTable to help. Sp_MSforEachTable runs the same operation/instruction on all user tables in a given database where the ? represents the table name. The code below would provide us the names of user tables and their row counts in descending order. The code is applicable to Microsoft SQL Server 2008, 2005 and 2000. Let's execute it:

/*
 NAME : Find Large Tables.sql
 Author: Suresh Kumar Maganti
*/
if OBJECT_ID('tempdb..#TEMP_Find_Large_Tables') is not null
drop table #TEMP_Find_Large_Tables
go
create table #TEMP_Find_Large_Tables
(Table_Name varchar(100) null
,Row_Count bigint null)
go
insert into #TEMP_Find_Large_Tables (Table_Name, Row_Count)
exec sp_msforeachtable 'select ''?'', count(1) from ? with (nolock)'
go
select * from #TEMP_Find_Large_Tables
order by Row_Count desc
go
if OBJECT_ID('tempdb..#TEMP_Find_Large_Tables') is not null
drop table #TEMP_Find_Large_Tables
go

Here is the result. I am displaying the top five rows only here:

 Table_Name                 Row_Count
--------------------------- ------------
[dbo].[My_Demo_Table]       6378870
[dbo].[Parts_Details]       4365166
[dbo].[Procurement_Details] 4365166
[dbo].[Order_History]       2502062
[dbo].[Annual_reports]      1299454
...

Now that we have the top tables by row-count, I would like to find out how the rows are distributed across pages for my top table.

To get this data, I used different methods in SQL Server 2000 and 2005. In SQL Server 2000, I used to use the DBCC SHOWCONTIG as follows:

dbcc showcontig (1598628738) with All_Indexes, All_Levels, TableResults

The number 1598628738 in the code above represents the Object_ID of the table, My_Demo_Table. I used the option TABLERESULTS so that I could send the results to a temporary table for further calculations. The result set is displayed below partially for space constraints:

ObjectName    ObjectId   IndexName IndexId  Level  Pages   Rows 
------------- ---------- --------- -------- ------ ------- ----------
My_Demo_Table 1598628738 PK_Demo   1        0      202771  6378870 My_Demo_Table 1598628738 PK_Demo   1        1      351     202771 
My_Demo_Table 1598628738 PK_Demo   1        2      1       351

To calculate "Row Density", all I used to do in Microsoft SQL Server 2000 was calculate (Rows / Page) from the above result set where Level = 0. In this case, it would be 6378870 / 202771 = 31 rows per page. So the table, My_Demo_Table had 31 rows per data page.

When SQL Server 2005 came out, I had to change my programming for Row Density by utilizing the dynamic management view, sys.dm_db_index_physical_stats. The code looked like the following:

select * from sys.dm_db_index_physical_stats(392, 1598628738, null, Null, 'DETAILED')

To explain in brief, 392 is my DatabaseID, 1598628738 is the ObjectID of the table, My_Demo_Table, the first NULL represents all indexes on this table, the second NULL represents all Partitions on which this table resides and 'DETAILED' represents the scan level used to obtain statistics after scanning all pages associated with this table. Please see the result set below displayed here partially for the table, My_Demo_Table:

database_id object_id   index_depth index_level page_count  record_count 
----------- ----------- ----------- ----------- ----------- ------------
392         1598628738  3           0           202771      6378870 
392         1598628738  3           1           351         202771 
392         1598628738  3           2           1           351

Next, from the above result set I had to just calculate the 'Row Density' as Record_Count / Page_Count where Index_Level = 0. The code is given below:

select record_count / page_count
 from sys.dm_db_index_physical_stats(392, 1598628738, null, Null, 'DETAILED')
 where index_level = 0

And the result is 31. So my "Row Density" was 31, which means on an average there were 31 rows per page for this table called My_Demo_Table in my Microsoft SQL Server 2005 database. You may note that even though we have moved from SQL Server 2000 to SQL Server 2005, the value of "Row Density" has remained static at 31 although the methods of calculating the value have changed. Having explained the methods briefly for Microsoft SQL Server 2000 and 2005, now let's move on to Microsoft SQL Server 2008.

In SQL Server 2008, I use the undocumented system function, sys.fn_PhysLocFormatter to find out how the rows and pages have been allocated for a given table. However this function alone is insufficient to calculate our row density. So I would use other SQL Server built-in functions and a system stored procedure along with sys.fn_PhysLocFormatter to calculate the row density for each table. To make the flow simple, I will for a step-by-step explanation. The T-SQL code below displays the syntax for using this function, sys.fn_PhysLocFormatter. Let us execute it. Please note that we have to be in the database where this table, My_Demo_Table in this case, resides.:

select sys.fn_PhysLocFormatter (%%physloc%%) Rows_per_Page 
 from dbo.My_Demo_Table

The result is displayed below. Please note that this result is for the table My_Demo_Table only since this is the table that I am using in the FROM clause in the T-SQL code above. This result set is explained below.:

File_Page_Row
--------------
(1:205496:0)
(1:205496:1)
(1:205496:2)
(1:205496:3)
(1:205496:4)
(1:205496:5)
(1:205496:6)
(1:205496:7)
(1:205496:8)
(1:205496:9)
(1:205496:10)
(1:205496:11)
(1:205496:12)
(1:205496:13)
(1:205496:14)
(1:205496:15)
(1:205496:16)
(1:205496:17)
(1:205496:18)
(1:205496:19)
(1:205496:20)
(1:205496:21)
(1:205496:22)
(1:205496:23)
(1:205496:24)
(1:205496:25)
(1:205496:26)
(1:205496:27)
(1:205496:28)
(1:205496:29)
(1:205496:30)
(1:205497:0)
(1:205497:1)
(1:205497:2)
(1:205497:3)
(1:205497:4)
(1:205497:5)
(1:205497:6)
(1:205497:7)
(1:205497:8)
(1:205497:9)
(1:205497:10)
(1:205497:11)
(1:205497:12)
(1:205497:13)
(1:205497:14)
(1:205497:15)
(1:205497:16)
(1:205497:17)
(1:205497:18)
(1:205497:19)
(1:205497:20)
(1:205497:21)
(1:205497:22)
(1:205497:23)
(1:205497:24)
(1:205497:25)
(1:205497:26)
(1:205497:27)
(1:205497:28)
(1:205497:29)
(1:205498:0)
(1:205498:1)
...

Each row in the result set represents one row in the table, My_Demo_Table. So there would be as many rows in this result set as there are rows in the table, My_Demo_Table.

Let's describe the three portions in the result above by studying the first record in the result set: 1:205496:0. First portion, 1 is the id of the data file on which the page (205496) containing this row (first row = Row 0) resides. On scrolling down the result set, it may be observed that rows numbered 0 through 30 in the result set above also reside on the same page, 205496. In other words, there are 31 rows in the page 205496 for the table, My_Demo_Table starting from Row# 0 and ending at Row# 30.

Similarly in the next page 205497 in the result set above, there are rows from 0 through 29. So there are 30 rows in the page 205497. Please note the first row contained in a page is the 0th row in the result set above as the counting starts from 0 onwards instead of from 1. To have as much clarity as possible in this explanation, I have showed two full pages (205496 and 205497) and one partial page (205498) in the result set above. Next, to calculate Row Density, we need to calculate the average number of rows per page for the table, My_Demo_Table.

Using sys.fn_PhysLocFormatter for all User Tables

The function, sys.fn_PhysLocFormatter operates on one table at a time. The execution of this function for a given table generates a large result set even for smaller tables. This is because the result contains one record per row in the table being probed. One more complication is that for a given user table, a page may contain 31 rows and the next page may contain 30 rows and a third page may contain 40 rows and so on depending on the actual data residing in there. For Row Density, the average number of rows per page needs to be calculated for the table as a whole.

For example, a million row table may be residing on, say fifty thousand pages. For Row Density, we need the Total Rows / Total count of pages for the table. Above all, the objective is to have the Row Density for all tables in one execution rather than calculating it individually for each table. To achieve all these objectives together for an efficient analysis, it would be nice if we could generate one row per user table, with each row containing the number of pages associated with the table, the total number of rows in the table, the Row Density and the table size. This way we can have a look at the larger tables with smaller Row Density so that they can be analyzed and their design corrected if possible.

So, a result set like the one below is what we are aiming for:

Table_Name       Pages   Rows_Per_Page  Rows_in_Table   Table_Size_MB
---------------- ------- -------------- --------------- ------------- Order_Details    100021  35.0380        3511193         783
Order_Master     3345    83.1629        278180          26
Customer_Details 2376    526.0118       1249804         19
Customer_Master  358     404.1006       144668          3 
...

To accomplish this requirement, the following programming elements would be used:

sys.fn_PhysLocFormatter - To obtain the row distribution per table
CHARINDEX, SUBSTRING, COUNT, AVG functions - To separate out required elements in the result set from sys.fn_PhysLocFormatter, and calculate Pages for a given table and Rows per Page (Row Density)
sp_MSforEachTable - To perform the above tasks for every user table in the database without resorting to looping with cursors or WHILE loops, and to keep the code short.
#TEMP temporary table - To host and display the final results as Table Name, Page count, Row Count, Row Density and Table Size with one row per user table in the result set.

The Actual Program

Let me explain the flow of the program below from the bottom upwards. The derived table a, just provides us the result set from the execution of the function, sys.fn_PhysLocFormatter for a given user table.

Using the CHARINDEX, SUBSTRING and COUNT functions in the derived table b, we next obtain the Page Number and calculate the number of occurrences of each Page Number in the result set from the derived table, a. For example, for the table My_Demo_Table, the page number 205496 occurs 31 times i.e. it has 31 rows of data, page number 205497 occurs 30 time i.e. it has 30 rows, and so on.

Then in the final SELECT statement we use the COUNT function to calculate the number of pages in the table and the AVG aggregate function to find the average number of Rows per Page. At this stage we have a single row of data for the table that includes Table_Name, Pages and Rows_Per_Page.

Next we encapsulate all this code into SP_MSforEachTable. Most of us are very familiar with this system stored procedure. It performs the same set of prescribed activities on all user tables in a given database. This would enable us to perform the above operation for each user table in the database. The placeholder, '?' actually represents a fresh table name for each instance of execution of the code.

Next, we insert all this data into the temporary table, #TEMP. This table has two calculated columns namely, Rows_in_Table and Table_Size_MB so that we can obtain their values on the fly.

Finally, we just display the contents of the table, #TEMP and then drop this temporary table for clean-up.

Now let's execute the program below to obtain user-friendly results that we seek for analysis and action:

/*******************************
 NAME : Table Size script.sql
 Author : Suresh K Maganti
 Purpose: Get Pages, pages per row, Table size for each table. Find large tables with low row-density.
********************************/
if OBJECT_ID('tempdb..#TEMP') is not null
drop table #TEMP 
create table #TEMP (Table_Name varchar(100) null
 ,Pages int null
 ,Rows_Per_Page decimal(28,4) null
 ,Rows_in_Table as convert(decimal(28, 0), Pages * Rows_Per_Page)
 ,Table_Size_MB as convert(decimal(28, 0), (convert(decimal(28,2), Pages) * 8192.00) / (1024.00 * 1024.00)))
insert into #TEMP (Table_Name, Pages, Rows_Per_Page)
exec sp_msforeachtable 'select ''?'', count (Page_Number) Numer_of_Pages_in_Table, isnull(Avg (isnull(convert(decimal(28, 4), Rows_Per_Page), 0.0000)), 0.0000) Avg_Rows_Per_Page
from
(select substring(SUBSTRING(File_Page_Row, CHARINDEX('':'', File_Page_Row, 1) + 1, 100), 1, charindex('':'', SUBSTRING(File_Page_Row, CHARINDEX('':'', File_Page_Row, 1) + 1, 100), 1)-1) Page_Number, COUNT(1) Rows_Per_Page
from
(select sys.fn_PhysLocFormatter (%%physloc%%) File_Page_Row from ?) a
group by substring(SUBSTRING(File_Page_Row, CHARINDEX('':'', File_Page_Row, 1) + 1, 100), 1, charindex('':'', SUBSTRING(File_Page_Row, CHARINDEX('':'', File_Page_Row, 1) + 1, 100), 1)-1)) b'
select * from #TEMP
order by 2 desc, 3 asc
if OBJECT_ID('tempdb..#TEMP') is not null
drop table #TEMP

Here is the result set:

Table_Name          Pages   Rows_Per_Page  Rows_in_Table   Table_Size_MB
------------------- ------- -------------- --------------- -------------
My_Demo_Table       202771  31.4585        6378872         1584
Parts_Details       22393   58.0295        1299455         175
Procurement_Details 14823   294.4860       4365166         116
Order_History       14506   300.9214       4365166         113
Annual_Reports      12661   102.6344       1299454         99
...

I have shown the partial result set above for our explanation. Please note that due to rounding off, the column Rows_in_Table is within 99.9999% accuracy and not exactly 100% in some cases.

From a glance of the result set above, the two tables, My_Demo_Table and Parts_History could be objects of my investigation as they are the largest tables with smallest "Row Density" values. In this article, we would be looking into design issues with the table, My_Demo_table only.

From past designs and analyses specifically in OLTP environments, I have generally observed that having less number of rows/page per table makes joins with other tables slower. In other words, if we have more number of rows per page for a table, the SQL Server can read more data in less scans. I generally mark my top ten largest tables with the least "Row Density" as candidates for investigation. I always try to find ways to get the "Row Density" up as much as possible in consultation with the application development teams. Sometimes this process requires further normalization as well. In our example, My_Demo_Table table is my largest table with ~6.3 million rows) occupying ~1584 MB of disk space. However, the row density is just around ~31. If we could increase its row density somehow, we could automatically bring down the table size on disk.

Let's see the table structure of the table, My_Demo_Table now:

Sp_help [My_Demo_Table]
Column_name     Type          Computed Length  Prec Scale   Nullable
--------------- ------------- -------- ------- ---- ------- ---------
Order_ID        int           no       4       10   0       no
Product_ID      int           no       4       10   0       no
Specifications  nvarchar      no       400                  yes
Status          varchar       no       255                  yes

Observations and verifications

The column, [Specifications] has the datatype nvarchar(400). In our case, we do not store Unicode values in this column as per the application team.
There is a composite clustered index on the columns (Order_ID, Product_ID) with fillfactor = 100.
There is no compression on this table or its index.
This is a pretty normalized table with just 4 columns.
This table is not partitioned.

Recommendations:

If we could change the datatype for the column, [Specifications] to varchar(400) from nvarchar(400), we could reduce the storage space. This is because this column is not and would not be used to store Unicode data as confirmed by the application development team.
If we could compress the lone index and the table data, we could reduce the table size further.

Evaluation of recommendations

In SQL Server 2000 and 2005, implementation of only the first recommendation above would have been possible as index and table compression were not part of those products. In Microsoft SQL Server 2008, we could implement both the recommendations based on the evaluation below. Let us first estimate our space gains through page compression feature in Microsoft SQL Server 2008 by executing the T-SQL code below:

sp_estimate_data_compression_savings @schema_name = 'dbo', @object_name = 'My_Demo_Table', @index_id = 1, @partition_number = 1, @data_compression = 'Page'

Here is the result:

object_name   schema_name index_id partition_number 
------------- ----------- -------- ----------------
My_Demo_Table dbo         1        1
size_with_current_compression_setting(KB) 
-----------------------------------------
2072184
size_with_requested_compression_setting(KB)
-------------------------------------------
1630112
sample_size_with_current_compression_setting(KB)
------------------------------------------------
40800
sample_size_with_requested_compression_setting(KB)
--------------------------------------------------
32096

That's almost 21% less space being used with page compression as per this estimate. Let us estimate our gains through row compression now by executing the T-SQL code below:

sp_estimate_data_compression_savings 
  @schema_name = 'dbo', 
  @object_name = 'My_Demo_Table', 
  @index_id = 1, 
  @partition_number = 1, 
  @data_compression = 'Row'

Here is the result:

object_name   schema_name index_id partition_number 
------------- ----------- -------- ----------------
My_Demo_Table dbo         1        1
size_with_current_compression_setting(KB) 
-----------------------------------------
2072184
size_with_requested_compression_setting(KB)
-------------------------------------------
2044464
sample_size_with_current_compression_setting(KB)
------------------------------------------------
40672
sample_size_with_requested_compression_setting(KB)
--------------------------------------------------
40128

With Row compression, our gain would be around 1.3% only as per this estimate which is way below 21% with Page compression. So we could definitely benefit from Page compression for this table. It is nice that Microsoft SQL Server 2008 provides us the ability to ascertain the kind of compression that would be most beneficial from a space usage perspective as demonstrated above.

Implementation

In this phase, we would first change the datatype of the column dbo.My_Demo_Table.[Specifications] from nvarchar(400) to varchar(400). Next, we would compress the lone index on this table using PAGE compression. Finally, we would compress the table My_Demo_Table using PAGE compression. Please note that this compression feature is available in Microsoft SQL Server 2008 and not in earlier versions.

So here we go with the implementation using the following code:

--Change the datatype from nvarchar(400) to varchar(400)
alter table [dbo].[My_Demo_Table] alter column [Specifications] varchar(400) null
go
--Compress the clustered index using PAGE compression:
alter index all on [dbo].[My_Demo_Table] rebuild with (fillfactor = 95, Data_Compression = PAGE, maxdop = 1)
go
--Compress the table using PAGE compression (Not necessary if the above step is done):
alter table [dbo].[My_Demo_Table] rebuild Partition = All with (Data_Compression = PAGE, maxdop = 1)
go

Testing

Now let's run our script, 'Table Size Script.sql' again to see the row density and table sizes. please note that we have till now analyzed and compressed only the table, My_Demo_Table. Here is the result set:

Table_Name          Pages    Rows_Per_Page Rows_in_Table Table_Size_MB
------------------- -------- ------------- ------------- -------------
My_Demo_Table       107706   59.2248       6378866       841
Parts_Details       22393    58.0295       1299455       175
Procurement_Details 14823    294.4860      4365166       116
Order_History       14506    300.9214      4365166       113
Annual_Reports      12661    102.6344      1299454       99
...

Please note that for the table, [dbo].[My_Demo_Table] the row density(rows per page) has gone up from 31 to 59 now. At the same time, the table size itself has come down from 1584MB to 841MB, a saving of almost 46.91% disk space. But that is not all. Let's try to compare the statistics before and after our changes to the My_Demo_Table table by simply getting its rowcount using the following code. Please note that memory was cleared before each of the executions below:

set statistics io on
set statistics time on
set statistics profile on
select COUNT(1) from dbo.My_Demo_Table with (nolock)
set statistics io off
set statistics time off
set statistics profile off

Before changes
Table 'My_Demo_Table'. Scan count 1, logical reads 259021, physical reads 3890, read-ahead reads 259021, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 500 ms, elapsed time = 510 ms.

After changes

Table 'My_Demo_Table'. Scan count 1, logical reads 104185, physical reads 1665, read-ahead reads 104185, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 515 ms, elapsed time = 523 ms.

You may note that the logical reads have come down by almost 58%. Physical reads have also reduced by 57% as there are more number of pages per row in the table, My_Demo_Table after datatype changes and compression. CPU time has increased marginally (3%)because there is a storage overhead due to compression. Reduction in logical reads enhanced performance i.e. made queries execute faster when this table was joined with other tables in my tests.

Conclusion

Design flaws can be probed, analyzed and ascertained by our scripts. Implementation of optimal datatypes and appropriate compression level could reduce the disk space usage by around 46% and logical reads by around 58%. Please note that these results could vary by tables depending on the number of columns, their datatypes and characteristics of the columns within the tables. Different studies show an increase in CPU utilization by 1% to 2% due to compression (in our specific case though it was 3%. So please do test and verify your improvements in a test environment before implementation.

Improving Database Design and Performance Using Compression

Introduction

Building up a scenario

Using sys.fn_PhysLocFormatter for all User Tables

The Actual Program

Observations and verifications

Evaluation of recommendations

Implementation

Testing

Conclusion

Rate

Share

Categories

Share

Rate

Improving Database Design and Performance Using Compression

Introduction

Building up a scenario

Using sys.fn_PhysLocFormatter for all User Tables

The Actual Program

Observations and verifications

Evaluation of recommendations

Implementation

Testing

Conclusion

Rate

Share

Categories

Share

Rate

Related content

Time Bomb Design - A Longer Fuse

Name Value Pair - Part II

theSystem

Toward Integrity Part 5

Beginning Database Design - Spot the Flaws