We’ve been using the wrong data types for all the wrong reasons. DBAs, developers, data architects, etc. have all been told to keep data as narrow as possible, using the smallest data type to get the job done. We sacrificed the integrity of our data for the performance of our data, and I’d like to thank Karen López (b|t) for pointing out that performance is NOT our #1 Job.
Disclaimer – Enterprise-ONLY
I’m talking about compression again, so I’m talking about my disclaimer again. Compression is an Enterprise-ONLY feature that affects the structure of your data. That means you can’t even restore a backup of your database to anything other than enterprise or developer editions unless you remove compression before the backup.
Also, this post is about maximizing the benefits of compression to reduce the costs of proper data architecture, not implementing compression blindly. Test it on non-prod, test it again on non-prod, then only implement a documented change with a rollback plan.
Karen López Made Me Do It!
Wait a minute, it’s Lopez, not López, isn’t it? Nope, she actually spells her name differently because there are so many programs out there that think we can get by with 26 letters in the alphabet. To keep Karen happy (it’s in your best interest), we need to change our LastName column from VarChar(50) to NVarChar(50). I know full well that I’m doubling the size of my data for every row, not just Karen’s, to make sure people don’t have to change their names to make me happy.
I’m wrong about half of that last sentence…. Yes, Unicode data is 2 bytes per character uncompressed while non-Unicode data is only 1 byte per character. The key word being uncompressed, because that’s not your only option starting in SQL Server 2008 R2 (as opposed to 2008 when we got compression in general). Look at what BOL has to say about Unicode Compression Implementation for more info on that, with the most important part being that you have to implement at least row-level compression to gain this advantage.
Now we can accommodate Karen without complaints. After all, why should we complain? She’s a great person to have around.
Running Out Of Numbers
Keeping up with my Karen López theme, she made a great presentation with Tom LaRock (b|t) called The Ticking Timebombs in Your Database that focused a lot on picking the right data type for identity columns. Spoiler Alert…know your data well enough to choose either BigInt or Int for an identity column, then start with the maximum negative value for that type.
Look at your existing databases to see if you’re running out of numbers using Tom’s post SQL Server Identity Values Check. You might find that you already have a case where there’s a 4-byte column about to hit the maximum number for an Int. Maybe it wasn’t seeded with a big negative number, or maybe it’s because the database was just more active than it was designed for. The important part is that this timebomb is quietly ticking down to the 3 AM phone call Tom just warned us about.
Now I’d like to stand in opposition to Karen and Tom a bit by suggesting starting a BigInt at the maximum negative value for Int. Say you’re pretty sure an Int is big enough, but you’re not willing to bet your career on it. However, you really like the idea of a 4-byte column as opposed to an 8-byte column. Here’s what you do… Make the column a BigInt (you’re not betting your career here), start it at -2,147,483,648 (maximum negative value for Int), increment by 1, and use compression.
Since the value is in the range of a 4-byte signed Int, the compressed stored value of this column will be 4 bytes. If you happen to go over 2,147,483,647 then you keep your job (probably a good thing) and compression just loses a little of its magic when this column starts using up 8 bytes.
By the time you get to the point that the data can’t be compressed going forward, row-level compression for this single column only considering a single index only at the leaf level has saved you 16 GB. Because it’s likely to be part of the clustered index key, it’s going to be part of all of your nonclustered indexes. Say you have 4 nonclustered indexes on this table, each compressed giving you another 16 GB of savings. So you saved 80 GB of disk space which is also less index space fighting for space in memory on your server.
You lose about half of your potential values when you use this method, but how much does that matter when I’d have to look up the name of my maximum value which is 9,223,372,036,854,775,807. The important part is you avoided the extra cost of a BigInt while not failing miserably if you exceeded the maximum value of an Int.
I have kids and don’t get enough sleep as it is. A compressed BigInt lets me sleep knowing I have both great performance and a large pool of numbers.
Why the Obsession with Karen?
Karen happens to be the perfect example. She’s a great data architect who spells her name wrong to accommodate for not-so-great data architects. Follow that up by her making presentations on using larger numeric data types to avoid running out of numbers. Then I stumbled across her post about my #1 Job when I had compression on my mind, which cause me to make the correlation between compressing the data to get the performance along with the data quality. Add to all this that she’s one of the most outspoken people (end of sentence).
It’s worth giving her (b|t) links again, with a little flare this time. Her blog can be found at DataModel.com, and it’s a must-read for anyone creating or altering a column. If you’re creating or altering lots of columns, then she’s a must-hire consultant. She tweets like a bot with ADHD as @DataChick. Unlike a bot, her tweets are pertinent and interesting…sometimes too interesting.
Just don’t ask her about her dad’s middle name, Canadian “zip codes”, the joys of air travel, or shoes.