Using a function to parse/return the Nth field of a delimited column

Anandanvijay-804030, 2009-11-06 (first published: 2009-10-15)

Often, DBAs are kept busy with requests to assist users and developers get information from Point A to Point B in a hurry. Even with the vast array of Microsoft and third-party development tools available for SQL server, it sometimes is necessary to think outside of the box to get some data imported into a relational database, even if the process is less than efficient (and sometimes downright ugly)! Here is one function that I think many database administrators should have in their arsenal of adhoc scripts when called upon to assist with ETL activities and researching data transformation issues.

Those who have worked with DTS, SSIS, and other technologies often come across situations where columns between source flat (text) files and a destination SQL table fail to line up correctly -- whether due to "dirty" data, having delimiters within the data fields, mismatched datatypes/datalengths or unanticipated changes in either the file layout or the DDL (schema) of the destination SQL table. To find the offending piece of the puzzle, it can help to use the SQL datatype VARCHAR(MAX) , released starting with SQL 2005, to Bulk Import or BCP directly to a single column for every row in the file. Although this type of action might initially seem odd to some hardcore DBAs, it can be very helpful in an ETL situation to locate and correct data truncation issues which provide an ETL process to complete successfully.

When a source file is padded (i.e. importing a fixed-width file of "n" characters per line), it is easy to parse through the data using a typical SUBSTRING function to validate expected locations of the data fields. However, when using a delimited file layout as a source, the SUBSTRING function method can be replaced using this simple T-SQL function, ufn_parsefind.

This function takes a complete string (or in this example, a complete row of the source file's data) along with a delimiter string (space, comma, semicolon, pipe, or any weird combination of characters that you think is unique to the source file) along with the "N"th occurrence that you would like to return. So if a line of data in a flat file is loaded into SQL physical table as a complete single VARCHAR(MAX) column of text,

0183|ColumbiaDataSet|Winter Jacket|54.99|400|2009-08-10|InStock
0184|ColumbiaDataSet|Winter Jacket|87.50|300|2009-08-14|InStock
0185|GapProductSet2|Winter Jacket|45.99|2x0|2009-08-09|InStock
0186|GapProductSet1|Winter Jacket|52.99|800|2009-08-11|InStock
0187|GapProductSet2|Winter Jacket|52.99|a00|2009-08-11|InStock

Notice that there is an improper character in the 5th field on Rows 0185 and 0187. Using the ufn_parsefind function, it is possible to write queries that perform a validation against every row in the SQL table as such:

SELECT * FROM [SampleTable] WHERE isnumeric( dbo.ufn_parsefind( COLUMN1,'|',5 ) ) = 0

which would return only the rows having invalid numerical data in the 5th field. Passing a parameter for occurrences that are invalid (i.e. the 9th column in the above example) will simply return a NULL value which can be handled by the calling code statement.

Even with a text editor capable of handling very large files, having the speed and coding flexibility of Transact-SQL to zero in on potential errors in source files having millions of records can flag many problems related to incorrect or invalid data being transformed to the destination location. In the example above, a DBA would notice a potential data quality issue in the GAPProductSet2 data file in the upstream processes which may need to be researched and corrected.

DISCLAIMER: As a production DBA with years of experience in very large databases (VLDBs), I know that some of my colleagues will immediately bring forth feedback that one should never place functions in the WHERE clause as I have done above...in a production environment, I definitely agree that it might cause serious performance issues. However, in an ETL and data warehouse environment (i.e. staging), the ability to quickly find and repair such data quality issues is a much higher priority and provides very little risk to other applications, even with larger data sets over 10M rows on an enterprise class staging server.

Finally, I must admit that I was hesitant to rewrite and submit for publication a script with functionality that many older ETL gurus might already have in their toolboxes - however, many of the scripts I have viewed to date involve heavy logic with substring functions that require an advanced degree to alter...I hope that the readers will enjoy the simple and concise code this solution provides and the ease of modification to other situations without having a PHD in computer science!

/* Copyright © 2009 - John Burnette -- All Rights Reserved */ CREATE FUNCTION dbo.ufn_parsefind ( @EntString varchar(max), @Delimiter varchar(10), @Occurrence bigint ) RETURNS varchar(max) AS BEGIN DECLARE @CurString varchar(max) DECLARE @Pos bigint DECLARE @Loop bigint -- REQUIRE DELIMITER AT END OF STRING IF right(@EntString,1)<>@Delimiter SET @EntString = @EntString + @Delimiter -- ESTABLISH CORRECT SYNTAX FOR DELIMITER IN PATINDEX FUNCTION SET @Delimiter = '%' + @Delimiter + '%' SET @Loop = 1 SET @Pos = patindex(@Delimiter, @EntString) -- LOOP THROUGH IF DELIMTERS FOUND IF @Pos = 0 BEGIN SET @CurString = Null END ELSE BEGIN WHILE @Loop <= @Occurrence and @Pos <> 0 BEGIN SET @Pos = patindex(@Delimiter, @EntString) SET @CurString = left(@EntString,@Pos-1) SET @EntString = right(@EntString,len(@EntString)-len(@CurString)-1) SET @Loop = @Loop + 1 END END -- DEFAULT A NULL FOR BLANK VALUES IF isnull(@CurString,'')='' or len(@CurString)<1 SET @CurString = NULL -- RETURN VALUE RETURN @CurString END GO

The Art of Data Ingestion - Part 1

by Sarah Dugan

SQLServerCentral

Data movement is a fundamental piece of a data engineer’s duties, and recently I’ve been thinking about the art of data movement. What are some of the most important pieces that a data engineer needs to think about when confronted with data ingestion? There is of course data exporting as well, and in that case, […]

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

5 (8)

You rated this post out of 5. Change rating

2022-11-11

5,986 reads

Discuss

Importing Data From Excel Using SSIS - Part 1

by Additional Articles

MSSQLTips.com

ETL (Extract Transform and Load)

Recently while working on a project to import data from an Excel worksheet using SSIS, I realized that sometimes the SSIS Package failed even though when there were no changes in the structure/schema of the Excel worksheet. I investigated it and I noticed that the SSIS Package succeeded for some set of files, but for others it failed. I found that the structure/schema of the worksheet from both these sets of Excel files were the same, the data was the only difference. How come just changing the data can make an SSIS Package fail? What actually causes this failure? What can we do to fix it?

2012-10-08

3,887 reads

Managing Data Removal Using SSIS

by Frank A. Banin

SQLServerCentral.com

ETL (Extract Transform and Load)

Managing Data removal During ETL. First of Comprehensive tools designed for ease of use especially in enterprise Projects.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

3.33 (12)

You rated this post out of 5. Change rating

2011-09-13

4,436 reads

Discuss

Dynamic ETL with SSIS

by Sarah Dugan

SQLServerCentral.com

ETL (Extract Transform and Load)

Learn how to dynamically load data from ETL load files using SSIS as a shell. The code downloads files from FTP, parses them and loads them into the database.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

4.65 (72)

You rated this post out of 5. Change rating

2013-05-03 (first published: 2010-10-19)

27,576 reads

Discuss

SSIS for Multiple Environments

by Zach Mattson

SQLServerCentral.com

Integration Services (SSIS)

In this article, Zach Mattson shows us how you can set up SSIS to handle multiple application environments and easily move packages from development to QA to production.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

3.89 (27)

You rated this post out of 5. Change rating

2010-04-27

9,848 reads

Discuss

Using a function to parse/return the Nth field of a delimited column

Rate

Share

Share

Rate

Using a function to parse/return the Nth field of a delimited column

Rate

Share

Share

Rate

Related content

The Art of Data Ingestion - Part 1

Importing Data From Excel Using SSIS - Part 1

Managing Data Removal Using SSIS

Dynamic ETL with SSIS

SSIS for Multiple Environments

Cookies on SQLServerCentral