Simple Method for Importing Ragged files

There may come a time when you are not in control of the input file format and it is in a decidedly non-standard format. I recently had to process such a feed file which had a metadata row both at the start and the end of the file. The file could be represented something like this, with only the “x,y,z” records being of importance (along with the column headers if we can get this info).

Col1	Col2	Col3
xxx start
x	y	z
x	y	z
x	y	z
xxx end

The existence of a problematic first row is itself not a huge problem, as we can select to omit "n" rows from the start of a flat file when defining the connection manager; It's the last row that causes the problem.

Datawise, the issue is that for the red rows there is only one column: "xxx start" and "xxx end" and no comma separators, while the general data row is defined as "x,y,z".

This is not an easy problem to describe but I tried various keyword combinations on Google to see how others were coping with this issue.

Some people advise using the script task: http://www.sql-server-performance.com/article_print.aspx?id=1056&type=art

Some use the Conditional Split: http://www.sqlis.com/54.aspx

I've also seen packages which call VBscript code to open the text file, read all the lines and then write them back into the text file minus the problematic rows.

All of these are interesting technical solutions, however I decided against them for various reasons:

(1) I didn't want to write any code. I really wanted the package to be as simple as possible, so that it would be easily maintainable for other people who have to support it later on. Also, I have found adding scripts to slow down processes so I generally try to avoid them.

(2) In my case there are dozens of columns (maybe 70 or so), so I wanted the designer to define the column names for me by looking at the text file. The problem with the conditional split is that the column names are all manually defined one-by-one and many substring functions need coding.

(3) Finally I didn't want to use a script outside the package. I don't like deploying anything more than a dtsx file when moving between environments which again would make maintenance more difficult.

In short, my aim was to have the entire process encapsulated in a single package with no calls being made to outside processes, and have it as simple as possible to build and maintain.

So, how to do this? The way I do it now is quite simple (below):

chart

Initially I import the entire file into a single staging table. The Flat File source is defined as "Ragged right" and has one column which is 8000 chars long. The name of this column is meaningless as it is simply a placeholder. This is imported into a staging table which has one column - defined as varchar(8000). This way the entire file is always imported successfully. We now have the table populated with the correct rows and additionally the 2 problem rows.

Next we remove the extra rows. In my case this is a delete statement with a simple where clause as the length of the problem rows is constant and is significantly shorter than that of a normal data row. You'll need some way of distinguish this row(s) from the others and if you're lucky it'll be as simple as my case. If not, you might have to filter using the column delimiters. You might even try to use an identity column on the table and use min and max functions to find the problem rows (actually min + 1 because of the column headings). I didn't try using the identity method, as I'm always concerned (probably needlessly) that the order of inserts might not necessarily be guaranteed to be identical to that of the text file so I prefer to use separate logic. If the source file is generated form another process, the format of the problem lines should be constant so the logic should be straightforward.

Next I export the remaining rows into a temporary file. This is a Flat File defined as "Ragged right", again with one column. The staging table above is the source and data is exported from it to the staging Flat File.

Finally, this staging file is treated as a standard Flat File and imported as per usual in another data flow task. This means that the column names can now be determined from the source file by the designer.

We have added a couple of data-flow tasks which complicates the package a little, but it's all pretty straightforward and transparent. If / when a recordset source can be used, we can do away with the persisted staging Flat File to make it even neater in the future.

How to Dynamically and Iteratively Populate An Excel Workbook from SQL Server

by M. Deschene

SQLServerCentral.com

Integration Services (SSIS)

Integration Services is a great ETL tool, allowing you to build complex and dynamic transformations. New author Marie Deschene brings us a

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

4.16 (79)

You rated this post out of 5. Change rating

2009-10-16 (first published: 2008-01-09)

57,880 reads

Discuss

How to Asynchronously Execute a DTS package from ASP or ASP.NET

by Additional Articles

SQLTeam.com

Integration Services (SSIS)

The Data Trasformation Services are a powerful tool, and sometime its features are so useful that you’d like to invoke a DTS package not only from SQL Server but from an external program.

To do this you have several choices: you can use the DTSRun.exe tool or you can do it leveraging the SQL-DMO features.

Unfortunately if you’re developing a web application (ASP, ASP.Net or whatever you use) none of them seems to be the right choice: too much problems, too much effort and a very modest results. In addition none of these solutions can be called asynchronously: if you just need to implement a “fire-and-forget” technique, you just cannot do that!

2005-05-24

2,663 reads

Discuss

Easy Package Configuration

by Additional Articles

SQLDTS.com

Integration Services (SSIS)

One of the age old problems in DTS is moving packages between your development, test and production environments. Typically a series of manual edits needs to be done to all the packages to make sure that all the connection objects are pointing to the correct physical servers. This is time consuming and gives rise to the possibility of human error, particularly if the solution incorporates many DTS packages. Many companies have provided their own custom solutions for managing this problem but these are still workarounds for a problem that is inherently DTS's.

2004-12-14

1,761 reads

Discuss

Get all from Table A that isn't in Table B

by Additional Articles

SQLDTS.com

Integration Services (SSIS)

A common requirement when building a data warehouse is to be able to get all rows from a staging table where the business key is not in the dimension table. For example, I may want to get all rows from my STG_DATE table where the DateID is not in DIM_DATE.DateID.

2004-11-19

4,068 reads

Discuss

For Loop Container Samples

by Additional Articles

MSDN Communities

Integration Services (SSIS)

One of the new tasks in SQL Server 2005 is the For Loop Container. In this article we will demonstrate a few simple examples of how this works. Firstly it is worth mentioning that the For Loop Container follows the same logic as most other loop mechanism you may have come across, in that it will continue to iterate whilst the loop test (EvalExpression) is true. There is a known issue with the EvalExpression description in the task UI being wrong at present. (SQL Server 2005 Beta 2).

2004-11-18

3,221 reads

Discuss

Simple Method for Importing Ragged files

Rate

Share

Categories

Share

Rate

Simple Method for Importing Ragged files

Rate

Share

Categories

Share

Rate

Related content

How to Dynamically and Iteratively Populate An Excel Workbook from SQL Server

How to Asynchronously Execute a DTS package from ASP or ASP.NET

Easy Package Configuration

Get all from Table A that isn't in Table B

For Loop Container Samples

Cookies on SQLServerCentral