How to Validate Data Coming into SSIS from Excel Spreadsheet?

  • I'm a newbie here with SSIS. I've worked extensively with multiple versions of SQL Server for about 20 years, but am working on a project that's using SSIS extensively.

    The project involves polling a specific folder once per day to determine if clients have uploaded files into this folder. If so, I've got coding in place thus far in SSIS that loops through each of these files in the folder. Next on my list is the required coding to process each of these incoming Excel files and ultimately get the structure of the files adjusted properly to where the meet the required schema for import into an underlying SQL Server database.

    Before I get into any of the data insertions from the Excel spreadsheets to the underlying SQL Server database, I'm researching exactly how to apply various "data validation" criteria to each Excel file to verify that the client has indeed sent me a "valid" Excel file that can legitimately be worked with in making any necessary transformations to the file to get it in the proper structure for import into a SQL Server database. I'm having problems determining exactly how/where this data validation will need to take place. I'm not even sure if "data validation" is even the right thing to call what I'm trying to do.

    I'm trying to figure out how to look at a few important columns within each Excel file and determine if the data within those columns is of a valid/expected format to even warrant dealing with the file, or if the file needs to be "failed" so we can send by to the customer for adjustment and resubmission to us.

    For example, if SSIS finds that the "SSN" (social security #) field unexpectedly has "Hello" in that column for one of the records instead of the expected 9-digit format for an individual's social security #, that's a big enough deal to warrant failing the file to not be dealt with in the import process.

    I'd like to have multiple data validation events occur simultaneously in addition to the SSN field example shown above. I might want to validate that all SSN entries contain 9 digits, as well as making sure that all entries in the "State" column contain a 2-letter state code, and also might want to verify that the format of the entries in the "ZipCode" field follow the standard 5-digit zip code format. If I find only 4-digits in a row's "ZipCode" field or I find "Chevrolet" in the "ZipCode" field for a record, or I find an issue with the "SSN" or "State" fields as described above, SSIS needs to somehow locate those issues to where I know to stop the file import process and flag things to send the file back to the client to correct those issues.

    I've researched Derived Columns and Script Tasks and other items and just can't seem to land upon a clear example / understanding of how to implement the initial data checks / validation efforts such as what I've described above.

    If anyone can clearly explain how to accomplish what I've described above or can point me in the direction of any online articles / tutorials that directly address how to accomplish this sort of thing, I'd be greatly appreciative.

    Thanks in advance!

  • The problem with loading data from Excel is that the providers "guess" the data types according to the contents of the column.

    So if the first time a column contains only numbers, SSIS will take a numeric data type. If the second time the column suddenly contains text, the data type will change to string and SSIS will crash because the metadata has changed.

    So your first check should be to verify if the data types are the expected ones.

    You can simply do this by trying to load the Excel data to a staging table. If it succeeds, all is well. If it fails for whatever reason, you can discard the Excel file. There are constructs in SSIS to deal with these types of failures: you can connect the data flow to a File System Task using an OnFailure precedence constraint (the red arrow) and move the Excel file to an error directory for example.

    If the Excel file has loaded to the staging table, you can do validation of the data itself. You can use TSQL for the easiest tasks, but if it gets too complex (for example regular expressions), you might want to use a .NET script component in the SSIS data flow.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Thanks for the info. Will post back here moving forward if I get stumped. Thanks again.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply