SQL Server: The danger of large data types

Databases are platforms that are designed to securely store and retrieve your data. Perhaps that’s why they’re called a data “base”? So if your data is in a base, you’d want to lay it out in some logic way.

Watch towers go on the corners, barracks go somewhere over there, make doors at each end of the barracks, mess hall goes there, put the drapes up on that window, place the latrine over there, ect, ect…

We do the same with our data. We create tables to store the data, views to mask some data, stored procedures to retrieve the data, ect, ect…

Data is defined by the business. How your store it is planned by you; or, probably someone before you that you really, REALLY dislike because of the choices they made with the DDL/schema.

When you buy a house, you place your furniture in rooms visually. Sometimes it’s really hard to look around the things that the owner has in the room. You think, WHY?!?! This is too big for this room. Why would they do this.

If you’ve worked with data for some time you know where this is going. If you haven’t let me move on. Let’s say you have a table with customers and for whatever reason you place ProductID in the customers table. Why would you do this? It’s like putting a sink in the living room. You’re going to question everything after seeing that.

It’s not limited to things in wrong places though. I’ve seen in many cases where the data type is just entirely wrong.

DATA TYPE & SIZE

In this post I’m going to focus on numeric data; though, it’s worth noting that unicode (nvarchar) is twice the size of non-unicode (varchar) data.

In SQL Server date and numeric types have different sizes depending on what you choose. With that said you’d be wise to choose a type that matches whatever you’re storing. Don’t buy size 14 shoes if you have a size 10 foot, right?

INT types

Bigint: 8 bytes
(-9,223,372,036,854,775,808 to 9,223,372,036,854,775,808)
Int: 4 bytes
(-2,147,483,648 to 2,147,483,647)
smallint: 2 bytes
(-32,768 to 32,767)
tinyint: 1 byte
(0 to 255)

Numeric types

Decimal & Numeric:
Precision:
1-9       |   5 bytes
10-19   |   9 bytes
20-28   | 13 bytes
29-38   | 17 bytes

Float (4 or 8 bytes), real (4 bytes), money (8 bytes), smallmoney (4 bytes)

SCENARIO

Let’s say that you have a student loan database. You want to track students, their loan amount, how much they’ve paid, and how much they still owe.

Money may be a logical data type choice; though, I’ve seen many developers complain about rounding issues and they chose to use numeric or decimal instead.

Smallmoney is only 4 bytes and can support payments of up to $214,748.36. To store this in decimal you’re looking at 5 bytes and storing up to $999,999.99. That’s a really large payment!

What if I want to SUM small money values that would equal more than $214,748.36? It won’t fail. Don’t believe me? Give it a try:

DECLARE @test TABLE (a int, b smallmoney)
INSERT INTO @test (a, b) VALUES (1, 214748), (2, 214748)
SELECT SUM(b) FROM @test

It worked!

Now let’s say that the original developer chose DECIMAL(38,2). What would happen? First, instead of 4 or 5 bytes the data would be 17 bytes! We’d also be storing up to $99.9 Decillion. That’s a real word, I didn’t make it up!! What’s that number look like? It’s nearly too wide for this blog page. HAHA!

$99,999,999,999,999,999,999,999,999,999,999,999.99

So here is the question. Why would you make a data type that is so large? It’s going to store payments and that is more money than exists in the world. I’ve seen this on a few occasions in production databases.

Let’s examine how this affects the table.

I created a simple schema and added PKs after.

CREATE TABLE loans.Student
(    StudentID bigint identity(1,1),
FirstName nvarchar(50),
LastName nvarchar(50)
)
CREATE TABLE loans.Loan
(    LoanID bigint identity(1,1),
StudentID bigint,
LoanDate datetime,
LoanAmount decimal(38, 2),
LoanBalance decimal(38, 2)
)
CREATE TABLE loans.Payments
(    PaymentID bigint identity(1,1),
StudentID bigint,
LoanID bigint,
PaymentAmount decimal(38, 2)
)

Next, I used dbForge Data Generator to populate my tables with enough data to express my point. 200,000 loans, 148,000 students, and 5,000,00 payments. I use this tool because it’s easy and it has a lot of realistic data that can be populated to make my demos feel like production data.

And now the testing:

I have had a script for quite some time that I use to look at tables and their size. Please feel free to use it.

DECLARE @DBID int = DB_ID(N’EpicFail’),
@TableSchema nvarchar(50) = ‘Loans’,
@TableName nvarchar(50) = ‘Loan’,
@TableTwoPartName nvarchar(100)
SET @TableTwoPartName = CONCAT(@TableSchema, ‘.’, @TableName)
–View the indexes on a table
SELECT isc.COLUMN_NAME, isc.DATA_TYPE, ic.*
FROM sys.index_columns ic
INNER JOIN INFORMATION_SCHEMA.COLUMNS isc on (ic.column_id = isc.ORDINAL_POSITION)
and (TABLE_SCHEMA = @TableSchema)
and (TABLE_NAME = @TableName)
WHERE object_id = object_id(@TableTwoPartName)
ORDER BY INDEX_ID
–Display the index levels and their size
SELECT index_depth AS D
, index_level AS L
, record_count AS ‘Count’
, page_count AS PgCnt
, avg_page_space_used_in_percent AS ‘PgPercentFull’
, min_record_size_in_bytes AS ‘MinLen’
, max_record_size_in_bytes AS ‘MaxLen’
, avg_record_size_in_bytes AS ‘AvgLen’
FROM sys.dm_db_index_physical_stats(@DBID, object_id(@TableTwoPartName), 1, 1, ‘DETAILED’)
–Display data about the pages used in an index
SELECT p.index_id, au.*, p.rows
FROM sys.allocation_units au
INNER JOIN sys.partitions p on (p.hobt_id = au.container_id)
WHERE p.object_id = object_id(@TableTwoPartName)

Looking at the table, we can see that the data is stored on 1667 pages. That’s 8k per page so 1667*8k would be 13,336 bytes for the 200,000 rows.

What if we ALTER TABLE and change the datatype to decimal(9,2)?

Now we only have 1064 data pages! We just cut our overall size of the table by a third and didn’t affect the functionality at all!!

How would this affect query performance? Well thinking about it you could say that the less pages there are the better performance you’ll have not only for some things like scans but also for maintenance like statistics or index rebuilds and possibly even backups (depending on how much wasted space is removed).

Here’s a really simple example of a scan:

Note that with the larger data we have more pages. Logical reads are from the buffer (RAM); but, these pages may not always be in RAM and would have to come from storage.

The cost is higher due to the extra pages as well.

Here we have the smaller datatype results:

SUMMARY

Whenever possible, be sure to size your data types appropriately. This practice will ensure the best performance for your production loads.

Need help with a query plan? Check out my website: www.howsmyplan.com

SQL Server: The danger of large data types

DATA TYPE & SIZE

SCENARIO

SUMMARY

Rate

Share

Share

Rate

SQL Server: The danger of large data types

DATA TYPE & SIZE

SCENARIO

SUMMARY

Rate

Share

Share

Rate

Related content

Book Review: Big Red - Voyage of a Trident Submarine

Database Mirroring FAQ: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup?

Inserting Markup into a String with SQL

Networking - Part 4

Speaking at Community Events - More Thoughts