• Coincidence has it this editorial came one a day after I was feeling strongly we miss a few important types in SQL Server.

    Lets start with something very close to the topic (in reverse)

    The type varchar(N) is stored as 2 bytes for length, followed by one byte for each character (2+N). Check your tables and you will see that for the majority of columns, this is rather wasteful as most column data is in smaller then 256 characters (or even 64). Typically codes and logical names as well as most labels are rather short!

    My thought was, I miss the smallvarchar, smallnvarchar and smallvarbinary types and the guaranteed small size of these types could work well with some special optimizations during query plan generation.

    Another thought was to make the number of bytes used to store the length dynamic.

    using two bytes to store the length of a 6 byte "code" is insane, any way you look at it.

    A scheme could be used like in UFT-8 where most common lengths (<128) can be stored in just one byte (more efficient encoding is possible).

    This gives a nice idea: http://en.wikipedia.org/wiki/UTF-8#Description

    Such a scheme would also be compatibly with page sizes larger then 8K.

    The problem I worked on

    I have to store IP numbers in a log table and always storing 16 bytes (for IP6) is out of the question. For IP4 I would just need 4 bytes and it can be stored as an integer with some simple conversion logic. Future proofing a system for IP6 means either separate null-able fields for IP4 and IP6 that can complicate code or a mapping table of sorts and both are undesirable to me (for this).

    In the end I settled on varbinary(16) for this purpose as it has simple conversion to int (just cast it) and this is enough for IP4 which we start with. Besides the length difference of 4 for IP4 and 16 for IP6, it also stores the fact if IP4 or IP6 was supplied before logging. Still, using two bytes for that pains me and I rather had used just one byte for this.

    What we are really missing is an IP type that handles this sort of common thing and does so very efficiently!

    Other frequent problems with data type growth

    Deciding when to use tinyint, smallint, int or bigint often brings competing requirements in conflict.

    Do we know for certain and in advance how much "number space" a table is going to require?

    Often the answer is NO and we usually allocate too much space just to be on the safe side.

    Just as text size is variable in nature with extreme cases on the norm, so are integers. I would love a variable length encoded bigint (lets call it varbigint) that uses no more then a byte for very small numbers, yet is treated in all regards (except storage) as a true bigint. Such a type would make an ideal log table primary key and same for other ever growing tables. It would also serve well for numerical columns that almost never get updated with values that require another amount of storage as the previously stored value. Especially when they are null-able too! Lost of candidates here as much data is only recorded and small values are more frequent the large values!

    While this would save a lot of space and make modeling choices easier (very valuable and cost reducing), I see this not as compression as there are no patterns between values exploited. The depreciation of vardecimal in SQL Server 2014 in favor of compression makes this a rather unlikely scenario to ever see the light of day. It seems Microsoft is hell-bend on making every "bad call" they can possibly make regarding SQL Server. Especially in the light that compression is an enterprise only feature!

    I better stop ranting now 🙁