Are these good and appropriate data types?

  • Hi all,

    I inherited a table that I am thinking to change some columns data types. Table has 100+ mln records and 96 columns and serves as a fact table in our data warehouse.

    1. Key column with numeric(15,0) type. All values are actually integers. Would replacing it with int/bigint be better?
    2. as_of_date char(8) - values look like '20180430'. Will replacing it with date data-type will be better?
    3. as_of_month char(6) - values look like '201804'. With what I can replace it? Or leave it as is? It does not have separate columns for year and month. What is best practice for such a case?
    4. Several flag columns char(1). Will replacing them to bit make sense?
    5. It has 22 columns for currencies values like dollars and cents, even with fractions of cents, but they all of float data types. If I script out this table DDL, it does not show any size for these floats. But from sys.columns it has max_length = 8, precision = 53, scale = 0. Would it make a sense to replace them with just numeric(20,4)?

     

    Thanks

    1. bigint would be better.
    2. date would be much better.
    3. don't need to store.  Use code on the display side to show only year and month from the as_of_date column.
    4. yes, absolutely, if there are only 2 values in the char column.
    5. use decimal, with either 2 or 4 decimal places (depending on usage, either 2 or 4 decimals are standard usage).  You could also use the money data type if you prefer, which has 4 decimal places.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Thanks, Scott.

    For #3: They use as_of_month not just for display but a lot in where clauses and join conditions, like a.as_of_month = b.as_of_month

    And about money data type, aren't it deprecated?

  • Don't know if it's officially deprecated, but it has lots of issues, so, yeah, probably better to stick to decimal.

    As to as_of_month, you'd be better off converting that to go against as_of_date, but, if you can't do that, then you might indeed have to store it separately so that you can index it separately.

     

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • You first need to find out what the code that uses the table looks like.  DW tables are NOT usually "proper normalized tables".  For example, removing the "month" column or converting it into a date will surely gum up a shedload of reporting code.  Even something as seemingly innocuous as changing multiple "flag" columns to the BIT datatype could really put the screws to reporting code if they're doing aggregates on those columns because you can't do things like SUM() on a BIT column.

    You also need to be really careful when changing things like FLOAT to something else.  For example, if you change them to Decimal(p,s) without know how they are used, you could end up with some really unfortunate silent rounding problems that can be as bad as the imprecision of the FLOAT datatype and are sometimes worse.  Please see the following MS documentation on that.

    https://docs.microsoft.com/en-us/sql/t-sql/data-types/precision-scale-and-length-transact-sql?view=sql-server-2017

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • You can also use a computed column for as_of_month, there's no need to physically store it again.

    as_of_date date NOT NULL,

    as_of_month AS CONVERT(varchar(6), as_of_date, 112),

    That column is fully usable by all queries and other code, it's just not physically stored, it's generated on the fly whenever code actually needs the value.

     

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Money is deprecated, but never going away. I wouldn't worry about it.

  • ScottPletcher wrote:

    You can also use a computed column for as_of_month, there's no need to physically store it again. as_of_date date NOT NULL, as_of_month AS CONVERT(varchar(6), as_of_date, 112), That column is fully usable by all queries and other code, it's just not physically stored, it's generated on the fly whenever code actually needs the value.  

    You can, however, increase performance a fair bit if it is materialized as a PERSISTED computed column and use CHAR(6) instead of VARCHAR(6).  It can also be indexed quite nicely.

    Of course, I probably not do either.  I'd use the tried and true >= and < method to do the lookups  and do the grouping in the code.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    ScottPletcher wrote:

    You can also use a computed column for as_of_month, there's no need to physically store it again. as_of_date date NOT NULL, as_of_month AS CONVERT(varchar(6), as_of_date, 112), That column is fully usable by all queries and other code, it's just not physically stored, it's generated on the fly whenever code actually needs the value.  

    You can, however, increase performance a fair bit if it is materialized as a PERSISTED computed column and use CHAR(6) instead of VARCHAR(6).  It can also be indexed quite nicely. Of course, I probably not do either.  I'd use the tried and true >= and < method to do the lookups  and do the grouping in the code.

    The easiest of them all...

    SELECT CONVERT(int, (YEAR(GETDATE()) * 100) + MONTH(GETDATE()));
  • Jason A. Long wrote:

    Jeff Moden wrote:

    ScottPletcher wrote:

    You can also use a computed column for as_of_month, there's no need to physically store it again. as_of_date date NOT NULL, as_of_month AS CONVERT(varchar(6), as_of_date, 112), That column is fully usable by all queries and other code, it's just not physically stored, it's generated on the fly whenever code actually needs the value.  

    You can, however, increase performance a fair bit if it is materialized as a PERSISTED computed column and use CHAR(6) instead of VARCHAR(6).  It can also be indexed quite nicely. Of course, I probably not do either.  I'd use the tried and true >= and < method to do the lookups  and do the grouping in the code.

    The easiest of them all...

    SELECT CONVERT(int, (YEAR(GETDATE()) * 100) + MONTH(GETDATE()));

    Fewest number of bytes so far, as well.  If you want to go real short, convert to the first of the month and store it as a DATE.

    Just to be sure, though... I know it's just an example but have to say it outloud... GETDATE() will make an indeterminate formula that cannot be persisted.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • As regards the original "1. Key column ...", it should at least be demoted to a secondary key (to make the clustering key unique).

    This table almost certainly should be clustered first on as_of_date, particularly if it's (almost) always specified in WHERE conditions.  That's one reason having as_of_month in the table is such a terrible idea.

    Now, yes, the user should be able to conveniently enter whole months, such as 2019[-]04 and 2019[-]05.  But the app should produce a query like this:

    WHERE ((as_of_date >= '20190401' AND as_of_date < '201900501') OR (as_of_date >= '20190501' AND as_of_date < '20190601')

    rather than this:

    WHERE as_of_month IN ('201904', '201905')

    The query doesn't have to directly reflect the user input.

    With as_of_month present in the table, it naturally gets used, and thus it gets indexed, as noted above.  Then, you query a month and it is way too slow (since it hit the "tipping point" and caused a table scan).  So, to avoid that, you end up having to build multiple covering indexes headed by as_of_month, to cover all queries that use as_of_month.  Lots of overhead that overall causes query delays and wastes resources vs clustering by as_of_date and querying by it.

    Identity keys have their uses, but not for every table, or even the majority of them.  Review the situation carefully before making a final decision on the oh-so-critical clustering key.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Jeff Moden wrote:

    Fewest number of bytes so far, as well.  If you want to go real short, convert to the first of the month and store it as a DATE. Just to be sure, though... I know it's just an example but have to say it outloud... GETDATE() will make an indeterminate formula that cannot be persisted.

    I do know that the old school DATETIME & SMALLDATETIME are fundamentally different in their byte patterns than the newer DATE, TIME & DATETIME2 datatypes... So it does stand to reason that some sort of implicit conversion would need to take place if you using the original date/time functions like GETDATE(). That said, it seems they have that mapping figured out fairly well... To the point that converting an predicate column from datetime to date doesn't prevent SARGability.

    Out of curiosity, do you see the same behavior if you swap out GETDATE() for SYSDATETIME?

    Also, if you want to get really crafty any really shave bytes, just copy the datetime paradigm... but use "months from zero" instead of days and save it as a smallint.  That'll take you all the way out to 4630-08... or... if you don't need to go negative... use a binary(2) like an unsigned small int and go all the way to 7361-04

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply