Flexible file loading

  • A big part of the data warehouse I'm currently working on is loading files from various vendors. One of them changes files for various reasons once a month, and/or release new files, and/or we request new datasets from them. They send us data definitions of these files but they aren't always accurate. We currently have 55 packages in the job and I just added a 56th. We want to get away from this approach, of having a package for each file. They're parameterized and standardized as much as possible, but still a pain to maintain.

    I'm looking into building a script task that will read from file definition data from tables; filename, column names, sizes, etc. The task would then use this data to compare the files (but only in test/QA because once a change is implemented the files are pretty stable), then load them into the warehouse.

    So basically all the work would be done in this script task. Read the file, check when told to (probably use an environment variable for that), then load into the database. And the master package, instead of running 50+ child packages, would run this script task.

    Any thoughts on performance issues or other approaches?

    Thanks!

  • Your script task will of course work, but I am not sure that you're really making maintenance any easier...you're just moving it around to a different place. Arguably it would be even more difficult to maintain the code in that script task, as well as troubleshooting issues. Once a script task is used, you're also going to process row-by-row instead of batches of data...performance will suffer.

    As much as it is a pain, having separate packages for each file is the best way to maintain atomic control. I'd suggest looking at Biml as a means to simplify the generation and maintenance of these packages. I am currently working on a personal project whereby I am using Biml to read file formats and configuration from a source-to-target mapping document (in Excel), and then generate the staging packages. Although not perfect (yet), it's already saved me an incredible amount of time when I need to add and/or change packages. 

    My two cents, hope it helps.

  • That is very helpful. Thanks!

  • I would consider using Python as your scripted task and look into using the module with Python called Pandas. You can read pretty much anything into Pandas with a couple lines of code and not have to define a schema. That means, if the files change, whether it's adding/removing fields or changing the names or data types, Pandas will still dynamically read the data as long as the extension is not constantly changing.

    The idea of Pandas is basically to create a data frame in memory. You can perform pretty much anything you normally do in SQL in Pandas. It can act like your dynamic staging table before you even load it into your database. For example, instead of aggregating data in your database, you can aggregate it in Pandas first and then load it into staging tables.

    For you, the idea would be to let Pandas dynamically read these files. You can extract the header from Pandas and load that into your database. You can then compare it to a table you maintain of approved headers. That way if the headers change, then you can do something with it like hold the data in staging until you can update the header definitions. The same applies for the data types and so forth. That way Pandas is dynamically loading the data for you into a staging environment that may be a catch-all in terms of schema, but it's your definition tables that are validating and conforming the data based on what is approved and not approved.

    You can do similar with just straight SQL too. Rip out the meta from the files themselves and compare them to the definitions you have stored in the database. Then act on it based on that. Powershell or doing the dreaded cmd executions can do that for you. File names, header, data types, and so forth is what you are after and comparing to define decisions in your data flow. I just would suggest Python with Pandas because it's easy to achieve dynamic loading without complex scripting or dynamic SQL.

  • Thanks! I'll look into it.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply