using SSIS to perform data validation in ETL app

  • I load data into a data warehouse (SQL server 2000 but migrating to 2005 soon) from several data sources, mostly flat files, on a daily basis. Right now everything is still DTS but I am in the process of converting to SSIS. I have now been tasked with providing error reports on the incoming data. The validation rules are legion, but at least I can break them down by the data source I am loading from (this is for NYS data warehouse data compliance). I'm going to start off with just a handful of rules, but I need to design a scalable system whereby I can add rules quickly and simply.

    I have some ideas on how to structure the database to keep track of the errors for each record and facilitate reporting, but I am struggling with how to implement the rules. Should I put them in a stored procedure and pass it each record as it is read by the data pump task for processing? I like that idea as opposed to loading the file then running the validations as a batch. Is using the data pump task going to be too slow and cumbersome? I load about 100k records everyday. This number would be greatly reduced if I could find an easy way to load just delta changes by comparing to the previous file. What do you thing about using checksum for this? I'd have to include every field.

    Thanks in advance for any replies, or for even having read this far in the post!

    Jean

  • I'm surprised that no replies have been generated by this yet. I guess, if nothing else, my reply will bump your thread so that it is active again.

    Anyhow, I would recommend not using a stored procedure in your data flow to do this. The only way you can execute a Sp in a data flow is by using the OLE DB Command task and this task is executed row-by-row. So in your example, the SP would need called 100K times to get your file processed.

    I also don't think you need to push the data into a table before doing your validations. One of the advantages of using SSIS is that it allows you to process large amounts of data in memory and do your validations there w/o before you write the data to the DB.

    Can you give me an example of what kind of validations you are trying to do and how many of your 100K rows you expect to pass/fail the validations?

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • Thanks so much for responding, John. Here are the answers to your questions:

    Can you give me an example of what kind of validations you are trying to do?

    1) Most of the validations are fairly simple and scalar, like code look-ups, some are a bit more complex, like making sure that if a certain field is populated, then a dependent field must also be populated (e.g. if Exit_Code is populated, Exit_date must also be populated ) There are some date range checks too, e.g. Entry_Date must precede Exit_Date. Some of the rules are very arcane (a specialty of NYS SED), but I don't plan to implement those for a while, though I'd like my system to be able to handle them. Some of those involve checking data against previously loaded data.

    how many of your 100K rows you expect to pass/fail the validations?

    2) I expect an error rate of < 5%, except for a one certain time of year when it will be considerably greater.

    Jean

  • From what I'm hearing, doing all of your validations within the SSIS pipeline is what I would do. I am a bit confused though as to your concept of 'Rules'. Are you saying that you want to make your validations data driven based off of rows in a Rules table of sorts? Can you elaborate on this?

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • No, by "rules" I mean the criteria used to determine if data is valid or not, so I guess they are code-driven? I worked with a Rules Validation Engine a few years back and wish I had access to it now. In a nutshell, this is the project: 1. Extract data from various data sources via flat file. Depending on the type of file ( I have about 200 files right now representing 5 types) This part is almost complete.

    2. Validate rows - check each field for compliance to "rules", good rows get loaded to one table in DB, route bad rows to error table along with some extra data like error code, timestamp. Each row in source may result in several rows in error table, since > 1 field may be in error. There is very little actual transformation that needs to be done to source data.

    3. Error rows become source for web-based error reports (SSRS of course!) that end users can refer to to correct the error in the original source system. Process is repeated daily so corrections in source bubble up to error reports in at most 24 hours.

    tx!

  • OK, that's what I was hoping! You've got 2 options here and I'll go through them both seperately.

    Option #1: SSIS import to staging table w/ SP driven validations

    -- Use data flow task to load file into a staging table.

    -- Create SP (or group of SPs) to house your validation data. If you feel the need to break the validations into multiple SPs, I would create a 'master' SP that calls each of the validation SPs. The validation SP(s) will be responsible for identifying conditions that create Error Rows, creating error rows, and marking the staged row as a non-candidate to move to the final production table.

    -- From Control Flow, use execute SQL Task to run validation (or master) SP.

    -- Use data flow task to move candidate rows into production table.

    PROS

    -- Validation logic contained within SP allows for easier maintenance of validation code. Changes to logic happen within SP and SSIS package does not need to be edited.

    CONS

    -- Database overhead - Using the staging table approach pushes the work and therefore overhead onto your database instance. Your import and validations are now competing for CPU, memory, and disk with your production database users. There is also additional log file activity associated with inserting data into your staging table and then updating those rows.

    Option #2: Validations in SSIS dataflow

    -- Use data flow task to load file into production table.

    -- add validations as data flow tasks between the Source (flat file) and destionation (production table) connections. Since 1 row in the file can create multiple error conditions, you'll have to allow the 'failed' rows to continue through the data flow and just 'mark' them as non-candidates (just as you would w/ the staging table approach). At some point just before your data flow gets to the destination connection for the production table, use a conditional split to only allow candidate rows into your production table. The split off error rows would go to a second destination connection that is tied to your error table.

    PROS

    -- Less Database Contention - Validations executed within the SSIS memory buffers so there's no contention w/ database instance. Rows are only inserted into the DB after we know which table they go into. Less writes to DB = less log activity.

    CONS

    -- Validation logic lives in SSIS package - this means chagnes or additions to validation logic need to be made in the SSIS package and then deployed.

    Which is better? Just like anything else, you'd probably get 10 answers from 10 different professionals that you ask. Option #1 allows for easier control over validation and option #2 is more of a traditional ETL process. You'll have to decide which is best for you. I personally would go w/ #2 unless I found that validation logic/rules would be changing frequently.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • John,

    First of all I really do appreciate you taking the time to help me, a perfect stranger (well maybe I'm not that perfect 🙂

    Anyhow - you have successfully answered my question - I know now that it is possible to do what I need to do in SSIS, and how to get started. I tend to side with option #2 as well, but my motives are suspect. #2 is an opportunity to try something new; #1 - been there done that, as far as using those tools.

    BTW, I had mentioned trying to filter the source flat files and send through only delta changes. I found an SSIS add-on tool called "CheckSum Transformation" that looks like it will enable me to do this in SSIS. Any experience with that? I'm sure I'll have more questions as I get further along - hope to bump into you here again!

    Jean

  • I have no experience w/ the Checksum Transformation, but you could accomplish the same thing with the SCD transformation or by using lookups. That is assuming that you could identify a business key from the data in the flat files.

    Performance wise, I'm not sure how that would scale, but it would accomplish the same function in that you'd only apply changes to the data

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • Hello John Rowan,

    Your Validation rules design is really great. I am planning to have the Validation rules get designed with the Rule Engine Concept.

    By the way, taking you a bit deviation from the topic, in your last post you have mentioned that SCD can be handled by using SCD transformation or Lookup Transformation. Can you elaborate on this please.

    I am having a confusion in using Lookup and Other transformations in SSIS when the table consists of Type 1 and Type 2 SCD Columns.

    How to design in SSIS is really confusing for me. Kindly check out.

    Thanks

    Chandu

  • The Slowly Changing Dimension data flow transformation is a wizard that walks you through defining your SCD. In it, you tell it what your business key is and what your attribute types are (historical, changing, static) and it builds out the SSIS data flow tasks to implement your SCD data flow. I've used it many times and it works pretty slick and performs well too. Once it builds out your data flow objects, you can walk through the logic and see exactly what it is doing to get the job done. Keep in mind that if you need to add columns to the dataflow such as LoadDate or any other default columns that you might use, you'll need to add a Derived Column transformation manually.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • I didn't know about the SCD Transformation. I've got to get out more. I was however, able to get the CheckSum Transformation to accomplish the same thing, check out the link below for some easy-to-follow instructions) but apparently there are issues with the CheckSum hashing algorithms (warning - there are unresolved bugs in the default: CRC32). Does anyone have an opinion as to which method to use for detecting and processing only new and changed data in an ETL process?

    Perhaps I should start another thread?

    http://www.ssistalk.com/2007/03/09/ssis-using-a-checksum-to-determine-if-a-row-has-changed/

  • I don't know if they are really bugs. The Checksum function does not guarantee uniquness so you could end up with a row being identified that is not a valide candidate. Personally, I'd use the SCD transformation.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • Hi,

    If you are looking for some complex SCD then you can use Kimball Method SSIS slowly Changing Dimension componet

    (http://www.codeplex.com/kimballscd).

  • ya, CRC32 was identifying every record following a changed record as having a new CheckSum value and therefore causing lots of duplicate records to be inserted. Switching to the "Original" algorithm fixed this, but I've heard, as you concur, that it does not guarantee uniqueness, although it is very rare for it to miss something.

  • John,

    I tried the SCD transformation and found that it is very similar to CheckSum except for the wizard and the fact that no extra hash column is added to DB rows to check incoming source against, so it must all done be done dynamically. This might make SCD less efficient. The wizard was great but for some reason it defaults to mapping the fields in flat file to columns in DB in alpha order, so I ended up having to configure it manually pretty much identically to the way I configured CheckSum. I am going to stick with SCD transformation on the assumption that it is more reliable at detecting changes. Thanks for letting me know about SCD.

    I am now going to add a few derived columns - I thought "Create_Date" and "Change_Date" might be very useful for tracking updates.

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply