© 2009 - John Burnette
Often, DBAs are kept busy with requests to assist users and developers get information from Point A to Point B in a hurry. Even with the vast array of Microsoft and third-party development tools available for SQL server, it sometimes is necessary to think outside of the box to get some data imported into a relational database, even if the process is less than efficient (and sometimes downright ugly)! Here is one function that I think many database administrators should have in their arsenal of adhoc scripts when called upon to assist with ETL activities and researching data transformation issues.
Those who have worked with DTS, SSIS, and other technologies often come across situations where columns between source flat (text) files and a destination SQL table fail to line up correctly -- whether due to "dirty" data, having delimiters within the data fields, mismatched datatypes/datalengths or unanticipated changes in either the file layout or the DDL (schema) of the destination SQL table. To find the offending piece of the puzzle, it can help to use the SQL datatype VARCHAR(MAX) , released starting with SQL 2005, to Bulk Import or BCP directly to a single column for every row in the file. Although this type of action might initially seem odd to some hardcore DBAs, it can be very helpful in an ETL situation to locate and correct data truncation issues which provide an ETL process to complete successfully.
When a source file is padded (i.e. importing a fixed-width file of "n" characters per line), it is easy to parse through the data using a typical SUBSTRING function to validate expected locations of the data fields. However, when using a delimited file layout as a source, the SUBSTRING function method can be replaced using this simple T-SQL function, ufn_parsefind.
This function takes a complete string (or in this example, a complete row of the source file's data) along with a delimiter string (space, comma, semicolon, pipe, or any weird combination of characters that you think is unique to the source file) along with the "N"th occurrence that you would like to return. So if a line of data in a flat file is loaded into SQL physical table as a complete single VARCHAR(MAX) column of text,
The format of COLUMN1 = RowID|SourceDataFile|Product|Price|Qty|RecvDate|Status
Notice that there is an improper character in the 5th field on Rows 0185 and 0187. Using the ufn_parsefind function, it is possible to write queries that perform a validation against every row in the SQL table as such:
SELECT * FROM [SampleTable] WHERE isnumeric( dbo.ufn_parsefind( COLUMN1,'|',5 ) ) = 0
which would return only the rows having invalid numerical data in the 5th field. Passing a parameter for occurrences that are invalid (i.e. the 9th column in the above example) will simply return a NULL value which can be handled by the calling code statement.
Even with a text editor capable of handling very large files, having the speed and coding flexibility of Transact-SQL to zero in on potential errors in source files having millions of records can flag many problems related to incorrect or invalid data being transformed to the destination location. In the example above, a DBA would notice a potential data quality issue in the GAPProductSet2 data file in the upstream processes which may need to be researched and corrected.
DISCLAIMER: As a production DBA with years of experience in very large databases (VLDBs), I know that some of my colleagues will immediately bring forth feedback that one should never place functions in the WHERE clause as I have done above...in a production environment, I definitely agree that it might cause serious performance issues. However, in an ETL and data warehouse environment (i.e. staging), the ability to quickly find and repair such data quality issues is a much higher priority and provides very little risk to other applications, even with larger data sets over 10M rows on an enterprise class staging server.
Finally, I must admit that I was hesitant to rewrite and submit for publication a script with functionality that many older ETL gurus might already have in their toolboxes - however, many of the scripts I have viewed to date involve heavy logic with substring functions that require an advanced degree to alter...I hope that the readers will enjoy the simple and concise code this solution provides and the ease of modification to other situations without having a PHD in computer science!