how to import a series of files with different columns

Question

how to import a series of files with different columns

Jane777

SSC Rookie

Points: 35
More actions
November 6, 2014 at 2:57 pm

#298214

hi.
I have a lot of txt files with data that I need to import. They are "|" separated data files.
Some of them have the columns like this:
ID|name|address|score|email|city
-------------------------------
some of them have the columns like this:
ID|name|score|city
-------------------------------
and some of the files may contain other columns, but all of the files have ID, name, score columns , and only those 3 columns are what I need. All the rest columns are useless, and I don't need to take care of them.
Can SSIS do the work? Thanks

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 1

qinzhen (11/6/2014)
hi.
I have a lot of txt files with data that I need to import. They are "|" separated data files.
Some of them have the columns like this:
ID|name|address|score|email|city
-------------------------------
some of them have the columns like this:
ID|name|score|city
-------------------------------
and some of the files may contain other columns, but all of the files have ID, name, score columns , and only those 3 columns are what I need. All the rest columns are useless, and I don't need to take care of them.
Can SSIS do the work? Thanks

There are several ways of doing this, somewhat depends on the circumstances which fits best.

Quick questions:

1) Does each file have a single format (number of columns consistent within the files)?

2) Are there many file formats or only a handful?

3) Can you distinguish between formats before loading the files?

😎

Jane777 SSC Rookie Points: 35 More actions · Answer 2

Eirikur Eiriksson (11/6/2014)
There are several ways of doing this, somewhat depends on the circumstances which fits best.
Quick questions:
1) Does each file have a single format (number of columns consistent within the files)?
2) Are there many file formats or only a handful?
3) Can you distinguish between formats before loading the files?
😎

1) yes.

2) there are hundreds of files and more files may be added in future,

3) No. I can't know the exact format before loading the files. However, there is a "date" in each file's name (for example, info_2012_3_21_15_30.txt). And file with earlier "date" has fewer columns, and file with more recent "date" has more columns. ( the person who generates the daily files adding more and more columns as per business requirement changes. ) However, I don't know when the columns were added, and it will be impossible for me to open each file and try to figure out when the additional columns were added. Anyway, the file with the most recent "date" in its filename has all the columns possibly occur in other files. But in future, there may be more columns added in, I wish my SSIS package can handle future files without much effort.:-D

thanks,

Jeff Moden SSC Guru Points: 1004664 More actions · Answer 3

qinzhen (11/7/2014)
Eirikur Eiriksson (11/6/2014)
There are several ways of doing this, somewhat depends on the circumstances which fits best.
Quick questions:
1) Does each file have a single format (number of columns consistent within the files)?
2) Are there many file formats or only a handful?
3) Can you distinguish between formats before loading the files?
😎
1) yes.
2) there are hundreds of files and more files may be added in future,
3) No. I can't know the exact format before loading the files. However, there is a "date" in each file's name (for example, info_2012_3_21_15_30.txt). And file with earlier "date" has fewer columns, and file with more recent "date" has more columns. ( the person who generates the daily files adding more and more columns as per business requirement changes. ) However, I don't know when the columns were added, and it will be impossible for me to open each file and try to figure out when the additional columns were added. Anyway, the file with the most recent "date" in its filename has all the columns possibly occur in other files. But in future, there may be more columns added in, I wish my SSIS package can handle future files without much effort.:-D
thanks,

I wish I had the time to do this for you but I don't so you'll have to settle for an explanation.

1. I use BULK INSERT to read the header row and the first 3 rows of data (Stop at row 4) into a single column blob.

2. I use Delimited Split 8K (you'll need to use something else if it's wider than 8k bytes) to split the header to column names and figure out where the columns are that I want.

3. I check to make sure the first 3 data rows have the same number and type of delimiters as the header row.

4. I use the information I got in step 2 to build a BCP format file to control the input, ignoring columns that I don't actually want. Yeah, I use xp_CmdShell for this.

5. I do the import using the BCP format file to a staging table.

6. I validate the data in the table and, if required, check against the target table for "upsert" possibilities and mark each valid row for insert or update.

7. I do the "upsert".

8. Since it's all automated, I go have a smoke and a beer. 😛

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 4

Jeff Moden (11/7/2014)
qinzhen (11/7/2014)
Eirikur Eiriksson (11/6/2014)
There are several ways of doing this, somewhat depends on the circumstances which fits best.
Quick questions:
1) Does each file have a single format (number of columns consistent within the files)?
2) Are there many file formats or only a handful?
3) Can you distinguish between formats before loading the files?
😎
1) yes.
2) there are hundreds of files and more files may be added in future,
3) No. I can't know the exact format before loading the files. However, there is a "date" in each file's name (for example, info_2012_3_21_15_30.txt). And file with earlier "date" has fewer columns, and file with more recent "date" has more columns. ( the person who generates the daily files adding more and more columns as per business requirement changes. ) However, I don't know when the columns were added, and it will be impossible for me to open each file and try to figure out when the additional columns were added. Anyway, the file with the most recent "date" in its filename has all the columns possibly occur in other files. But in future, there may be more columns added in, I wish my SSIS package can handle future files without much effort.:-D
thanks,
I wish I had the time to do this for you but I don't so you'll have to settle for an explanation.
1. I use BULK INSERT to read the header row and the first 3 rows of data (Stop at row 4) into a single column blob.
2. I use Delimited Split 8K (you'll need to use something else if it's wider than 8k bytes) to split the header to column names and figure out where the columns are that I want.
3. I check to make sure the first 3 data rows have the same number and type of delimiters as the header row.
4. I use the information I got in step 2 to build a BCP format file to control the input, ignoring columns that I don't actually want. Yeah, I use xp_CmdShell for this.
5. I do the import using the BCP format file to a staging table.
6. I validate the data in the table and, if required, check against the target table for "upsert" possibilities and mark each valid row for insert or update.
7. I do the "upsert".
8. Since it's all automated, I go have a smoke and a beer. 😛

Jeff has in essence answered the question, I'll just throw in few more bullets:

1) there are four common methods of dealing with different and unknown file layouts/formats.

a. The process opens the file, counts the delimiters, detects the column placement and either routes the file to the appropriate preconfigured import or constructs one on the fly. This is typically done in an SSIS Script Task.

b. The process imports the file line by line, splits it on the fly and loads the preferred columns to the destination table. This is normally done in an SSIS Script Transformation.

c. The process splits the file and loads it into a staging table with enough columns, then constructs an SQL statement to load the file into the destination.

d. The process loads the file, line by line into a staging table, detects the format/layout and splits it from there.

The difference between these methods is simply where in the process the detection and filtering work is done, coincides normally with the comfort zone of the developer.

😎

Jeff Moden SSC Guru Points: 1004664 More actions · Answer 5

There's one other difference depending on how it's coded. Performance. RBAR inserts tend to be incredibly slow compare to multi-row set-based inputs.

Which brings me to a subject...

Lot's of folks prefer row by row imports because if a row is bad, they can easily isolate it. Some wrongly curse BULK INSERT and BCP for not being able to do such a thing but both can sequester bad rows if you set up the error handling available in both. That will begin to allow you to avoid RBAR imports.

If you combine the native error handling of BULK INSERT and BCP along with the idea that you probably should never import directly to a final table, you can get some rather incredible validation and import rates. Always use a staging table. Think about it... would you put something in your mouth if you didn't know what it was?

Thee other thing to help you avoid RBAR during imports is to remember the golden rule for imports. The data should be in a predictable, stable, and proper format. If it's not, then you really need to talk with the folks providing the data and get them to get their act together.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

aaron.reese SSChampion Points: 13498 More actions · Answer 6

If I understand the OPs requirement then the proposed solutions are correct by overly complicated.

The OP wants to always take the first three columns and doesn't care about the rest.

When you open the DFT source task and supply it with the sample file, it will identify and derive column names and types for you. You need to manually adjust these so that everything after the third pipe "|" is treated as a single column with a width of 4000 and the record terminator is the CRLF (or LF).

When you map the output of the DFT source to the next stage, simply ignore the 4th column

Jane777 SSC Rookie Points: 35 More actions · Answer 7

actually, the columns I am interested are not necessarily the first three columns in files.

I will try Jeff and Eirikur's suggestion . thanks!

Ells SSCrazy Eights Points: 9901 More actions · Answer 8

Awesome response. You learn something new every day.

E.:cool: