Guarantee entire flat file loaded

Question

Guarantee entire flat file loaded

cycle90210

SSC Journeyman

Points: 88
More actions
January 12, 2016 at 4:19 pm

#324016

Just looking to get some confirmation or maybe another train of thought.
We are loading in flat files from a file source to a staging table to queue up records for processing.
SSIS 2012
MSSQL 2012 Update 1
I am looking for the best way to guarantee that the entire flat file is loaded or 0 rows are loaded. If the FF source is used without transactions and it fails halfway through; several of these rows may be committed.
My basic pattern is:
Start SQL execution task ( "BEGIN TRANSACTION")
Data flow task of importing FF src in to OLE DB Destination with the OLE DB connection as "RetainSameTransaction"
End SQL execution task ( "COMMIT TRANSACTION") on Success
End SQL execution task ( "ROLLBACK TRANSACTION") on Error
Are there any gotchas I may encounter with this approach? In my testing; the only issue I've found is if developers are using a WITH (NOLOCK) hint during the load; they may be shown.

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply

Eirikur Eiriksson SSC Guru Points: 182939 More actions · Answer 1

Quick suggestion, load the file into a separate table (ie. bulk load), check if everything is there (row count, sum etc.) and simply truncate the table if unsuccessful.

😎

cycle90210 SSC Journeyman Points: 88 More actions · Answer 2

This is definitely possible and the table I do load it in to is a "staging" table of sorts (with interface status columns, date loaded, filename loaded, etc).

The issue with this is I don't know how many rows are in the flat file to do a row count against. We don't have a "file ender" record or any guarantees

Eirikur Eiriksson SSC Guru Points: 182939 More actions · Answer 3

cycle90210 (1/13/2016)
This is definitely possible and the table I do load it in to is a "staging" table of sorts (with interface status columns, date loaded, filename loaded, etc).
The issue with this is I don't know how many rows are in the flat file to do a row count against. We don't have a "file ender" record or any guarantees

Counting the rows is a simple scripting task, don't have an example at hand but it should be easy to find on the big G

😎

Alvin Ramard SSC-Forever Points: 41190 More actions · Answer 4

You could also use a Row Count in the Data Flow and compare that with the rows loaded, but that will only give you an accurate count if all the rows get read.

Alvin Ramard
Memphis PASS Chapter[/url]

All my SSC forum answers come with a money back guarantee. If you didn't like the answer then I'll gladly refund what you paid for it.

For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]

Luis Cazares SSC Guru Points: 183706 More actions · Answer 5

Eirikur Eiriksson (1/13/2016)
cycle90210 (1/13/2016)
This is definitely possible and the table I do load it in to is a "staging" table of sorts (with interface status columns, date loaded, filename loaded, etc).
The issue with this is I don't know how many rows are in the flat file to do a row count against. We don't have a "file ender" record or any guarantees
Counting the rows is a simple scripting task, don't have an example at hand but it should be easy to find on the big G
😎

In my previous job, we used a separate file with the row counts that should be contained in the data flat file, as well as one or two additional aggregates which would work as check numbers. This separate file was generated by the same process generating the flat file, so this would prevent any errors including corrupt files that would contain less rows.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

cycle90210 SSC Journeyman Points: 88 More actions · Answer 6

Appreciate the replies. I'll have to keep these in mind when we get to design a new system in place (hopefully I won't need to if I can convince them database > flat file 🙂

Unfortunately; this is coming from a legacy system which I am just "injecting" some SSIS logic to help with manual routing. I won't be able to add anything to the existing files or program changes but maybe I can figure out some form of a checksum process.

Eirikur Eiriksson SSC Guru Points: 182939 More actions · Answer 7

How big are these files? One option is to load the file into a variable using t-sql and openrowset, then count the lines by comparing the length against the length after replacing all linefeed character with a blank character.

😎

Steve Jones - SSC Editor SSC Guru Points: 742273 More actions · Answer 8

I think you'd need a separate process that counts rows, knowing what a "row" is in the file format as a

delimiter.

You can look for error rows in your flow, but that would mean you're assuming the rows are correctly read somehow by SSIS and there aren't errors there.

If you do this, we'd love an article on this. I'm sure some people would like a technique that shows they can determine if an entire file loads.

Eirikur Eiriksson SSC Guru Points: 182939 More actions · Answer 9

Steve Jones - SSC Editor (1/13/2016)
I think you'd need a separate process that counts rows, knowing what a "row" is in the file format as a
delimiter.
You can look for error rows in your flow, but that would mean you're assuming the rows are correctly read somehow by SSIS and there aren't errors there.
If you do this, we'd love an article on this. I'm sure some people would like a technique that shows they can determine if an entire file loads.

I've used both methods, ssis script and t-sql many times, works like a charm every time. Better still is to create a hash before and after, always use that for payment files etc.

😎

Jeff Moden SSC Guru Points: 1004725 More actions · Answer 10

Steve Jones - SSC Editor (1/13/2016)
I think you'd need a separate process that counts rows, knowing what a "row" is in the file format as a
delimiter.
You can look for error rows in your flow, but that would mean you're assuming the rows are correctly read somehow by SSIS and there aren't errors there.
If you do this, we'd love an article on this. I'm sure some people would like a technique that shows they can determine if an entire file loads.

Even with this, you still have a bit of the "mouse guarding the cheese". The only true method of knowing how many rows should be in a file is to receive two files... one with the data that you want to load and one that tells you the number of rows (and a checksum to help determine the row data itself is correct even if the number of rows are) that should be in the file.

I say "only true method" but that's not quite right. Another way to do it is to have a header and footer in the file with the row count, etc, but that can really cause processes to go haywire because things like BULK INSERT, BCP, and other methods of import frequently count delimiters and not actual EOL markers and they frequently have a difficult time handling multiple record layouts within the same file.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Eirikur Eiriksson SSC Guru Points: 182939 More actions · Answer 11

Jeff Moden (1/13/2016)
Even with this, you still have a bit of the "mouse guarding the cheese". The only true method of knowing how many rows should be in a file is to receive two files... one with the data that you want to load and one that tells you the number of rows (and a checksum to help determine the row data itself is correct even if the number of rows are) that should be in the file.
I say "only true method" but that's not quite right. Another way to do it is to have a header and footer in the file with the row count, etc, but that can really cause processes to go haywire because things like BULK INSERT, BCP, and other methods of import frequently count delimiters and not actual EOL markers and they frequently have a difficult time handling multiple record layouts within the same file.

More common is to include a summary header, preferably with a checksum (seeded) for tamper proofing.;-)

😎

Steve Jones - SSC Editor SSC Guru Points: 742273 More actions · Answer 12

My only thought here is that if you know you have windows formatted files with a \r at the end, you can count these with a seperate, non SSIS process. Then you can compare to lines read.

Perhaps some mouse guarding the cheese, but if depends if you're trying to ensure SSIS works correctly or ensure that something else.

Personally I'd trust SSIS if this is working and handle error rows inside the ETL. While you might have issues with some files, I'd deal with those at the time rather than coding to prevent them.

Orlando Colamatteo SSC Guru Points: 182289 More actions · Answer 13

Counting the number of lines in a flat-file in a Script Task in SSIS is trivial and wicked fast. You can then take that count and compare it to the number of rows in the table after your Data Flow Task completes.

string fileName = @"C:\Document1.txt";

int lineCount = 0;

using (StreamReader sr = new StreamReader(fileName))

{

while ((sr.ReadLine()) != null)

{

lineCount++;

}

Knowing that the number of lines in the file matches the number of rows in the table only confirms you loaded all lines in the file, however. It tells you nothing about what the data provider intended to or was supposed to send you. For that you need a header or trailer line (or separate file as some mentioned) with a line count you can use to validate against the line/row count.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

cycle90210 SSC Journeyman Points: 88 More actions · Answer 14

I may do a write up on this as it seems to generate several good remarks and approaches!

I think there are 2 real ways to address this:

1. If you are able to customize the input/output from the source system; a checksum file and a header/footer line can be a sure way to guarantee things.

2. If you are not able to customize the source system; things get to be a "best guess" situation.

This may deal with transaction setup; doing a checksum of sorts; reading the rows separately and then comparing row counts to a best guess.

I will need to experiment but how does a StreamReader determine a "line"? If I have a CRLF in the middle of a text field; will stream reader count this as a line even though it is in a text field?