Field Sizes in Staging database : all varchar(2000)?

  • NEWBIE to DW : the IT Manager wants to use varchar2(2000) for all fields in all tables.

    I thought we should be more realistic. Sure, use varchar2, but if it is a state, use 2 or even 4.

    Any norms is this area? Anything "bad" about using varchar2(2000) for everything?

    Manager's idea is to make sure we capture the data in staging exactly like it is in the flat file.

    Thanks!

    Joe

  • I think it depends on number of factors like disk space and processing time of ETL, you also might face problems in converting those values to their original shape in future, converting all data in varchar(2000) is not a good idea indeed

  • varchar2 ? Is this an Oracle question? This is a Microsoft SQL Server Forum.

    In general it is proper practice to choose the appropriate data type from the outset. It is fundamental to the success of a database in terms of performance, longevity and feeds into maintenance costs as well.

    I am not sure about Oracle internals but in SQL Server there is a penalty for using a wider column than necessary in terms of memory allocations for space where data will never reside because the column is always far wider than the maximum expected data lengths.

    Again, in general, anyone proposing to use a wider datatype than is necessary "just in case we receive wider data later" should not be influencing data modeling decisions.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • devereauxj (4/13/2012)


    NEWBIE to DW : the IT Manager wants to use varchar2(2000) for all fields in all tables.

    In general DWH Staging tables are modeled after the OLTP tables from where they are going to source the data.

    Is it wrong to use varchar2(2000) for all non-numeric columns on staging tables? well, it is not elegant but it will work. You may want to let IT Manager know - very politely - that max size for varchar2() moved from 2,000 to 4,000 around Ora8i.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • Here's how I see it.

    If you use varchar(2k) in all fields, it's usually easier to import the data, especially if you have quality issues. Once it's in there, you have to rely on your conversion to get it set for the final tables, but you do have it in a database, and you can manipulate it there. I like this if I have flat files, web services, etc, where I might have some flaky connection or a loss of the file after some time.

    If I use proper fields, then I need a solid import process that can handle problem data and clean it before it's staged. The movement to the OLTP database (or warehouse) is then easier.

    Which is better? depends on where you want to spend time on the process.

    I like importing into generic tables, then moving and cleaning to a 2nd staging table with valid datatypes (if I have space) and then moving with some MERGE process to the final tables.

  • Short and simple answer:

    If your flat file is fixed width (delimited) use CHAR data type else use VARCHAR

    ~ Lokesh Vij


    Guidelines for quicker answers on T-SQL question[/url]
    Guidelines for answers on Performance questions

    Link to my Blog Post --> www.SQLPathy.com[/url]

    Follow me @Twitter

  • lokeshvij (7/19/2012)


    Short and simple answer:

    If your flat file is fixed width (delimited) use CHAR data type else use VARCHAR

    I am not sure I agree with that. What if your fixed-width data file contains lines more than 8000 bytes wide? Second, it could be considered a waste of space to have a staging table use CHAR columns when most of the data values have trailing blank spaces. I am all for using the right data type for the data when discussing destination tables but in staging tables all bets are off. I am not sure we need to match CHAR to fixed-width files. VARCHAR would be my default choice for a staging table.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • devereauxj (4/13/2012)


    the IT Manager wants to use varchar2(2000) for all fields in all tables...

    ...Manager's idea is to make sure we capture the data in staging exactly like it is in the flat file.

    Ask manager if he plans to user varchar2(2000) also on the core FACT and DIM tables - if not, how is he/she planning to ensure staging varchar2(2000) fits into properly defined columns on FACT and DIM?

    Again... Staging columns datatype and lenghts should be modeled after the source system and never ever larger than the definition in core FACT/DIM tables.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • I agree with PaulB.

    When a flat file is presented, the layout/structure is typically well known. That is part of the initial discovery.

    If there are freeform notes or something, I can totally see a reason to make that column in the staging table larger.

    In general, best practice is to start bringing the data in using a table definition as close to the final destination table as possible. There will be some tradeoffs depending upon how well you can perform an initial scrub/cleanse from the flat file. But never grab numeric or date data and place it in a varchar without good reason.

    I typically use SSIS to import flat files. I always run through the flat file and define it with the proper datatypes up front. This not only makes life easier in subsequent stages, but it lets me know up front if the data structure I was given for the flat file is correct, or even if there is errant data in any of the file's fields.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply