Home Forums Data Warehousing Strategies and Ideas Data warehouse concept regarding large flat files with a large amount of dimensions RE: Data warehouse concept regarding large flat files with a large amount of dimensions

  • I'm by far not an expert as I'm still pretty new to the DW world myself, but I did design a system similar to yours just with less fields. I approached my design under the Kimball methodologies and kept things very, very simple.

    I import a few files that have millions of records in a single flat file. Sometimes they are more than that. It's very easy for me to import a single file with over 10 million records into a single staging table using bulk insert. However, I am not dealing with the amount of fields you are facing either.

    One method that you could try that is something that I have put into practice is transforming data down to the hard disk level. This is one of the suggestions Kimball mentions a few times in order to take the pressure off other systems such as your RDBMS doing all the work for you. Using SQL Server (SSIS) to parse the flat files into smaller chunks down to the hard disk could lessen the pain for you much like has for me in rather large flat files. Processing smaller chunks is going to be a lot easier than one big chunk.

    After that, if the amount of fields is still the bottleneck and not necessarily the amount of rows, then you might want to consider the possibility of splitting those fields up into multiple staging tables and or fact tables. It's really the only way you're going to get around this issue unless you completely do all your auditing, transformation and so forth down to the disk level before SQL even loads the data into the DW.

    Another possibility is one we are considering in the future for non-relational chaotic data is NoSQL solutions. The Hadoop framework more specifically has a couple solutions that specifically excel with data that have a high amount of fields. This includes your scenarios of 100+ fields or more. I cannot for the life of me remember the engine name more specifically, but I do know they exist and have been mentioned to excel in performance regardless of how many fields you have in your data.

    But, that's an entirely different ballgame and a rather large solution than what you likely need. I only mention it to give food for thought.