January 12, 2016 at 4:19 pm
Just looking to get some confirmation or maybe another train of thought.
We are loading in flat files from a file source to a staging table to queue up records for processing.
SSIS 2012
MSSQL 2012 Update 1
I am looking for the best way to guarantee that the entire flat file is loaded or 0 rows are loaded. If the FF source is used without transactions and it fails halfway through; several of these rows may be committed.
My basic pattern is:
Start SQL execution task ( "BEGIN TRANSACTION")
Data flow task of importing FF src in to OLE DB Destination with the OLE DB connection as "RetainSameTransaction"
End SQL execution task ( "COMMIT TRANSACTION") on Success
End SQL execution task ( "ROLLBACK TRANSACTION") on Error
Are there any gotchas I may encounter with this approach? In my testing; the only issue I've found is if developers are using a WITH (NOLOCK) hint during the load; they may be shown.
January 12, 2016 at 10:09 pm
Quick suggestion, load the file into a separate table (ie. bulk load), check if everything is there (row count, sum etc.) and simply truncate the table if unsuccessful.
😎
January 13, 2016 at 6:31 am
This is definitely possible and the table I do load it in to is a "staging" table of sorts (with interface status columns, date loaded, filename loaded, etc).
The issue with this is I don't know how many rows are in the flat file to do a row count against. We don't have a "file ender" record or any guarantees
January 13, 2016 at 7:37 am
cycle90210 (1/13/2016)
This is definitely possible and the table I do load it in to is a "staging" table of sorts (with interface status columns, date loaded, filename loaded, etc).The issue with this is I don't know how many rows are in the flat file to do a row count against. We don't have a "file ender" record or any guarantees
Counting the rows is a simple scripting task, don't have an example at hand but it should be easy to find on the big G
😎
January 13, 2016 at 8:12 am
You could also use a Row Count in the Data Flow and compare that with the rows loaded, but that will only give you an accurate count if all the rows get read.
For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]
January 13, 2016 at 8:45 am
Eirikur Eiriksson (1/13/2016)
cycle90210 (1/13/2016)
This is definitely possible and the table I do load it in to is a "staging" table of sorts (with interface status columns, date loaded, filename loaded, etc).The issue with this is I don't know how many rows are in the flat file to do a row count against. We don't have a "file ender" record or any guarantees
Counting the rows is a simple scripting task, don't have an example at hand but it should be easy to find on the big G
😎
In my previous job, we used a separate file with the row counts that should be contained in the data flat file, as well as one or two additional aggregates which would work as check numbers. This separate file was generated by the same process generating the flat file, so this would prevent any errors including corrupt files that would contain less rows.
January 13, 2016 at 8:57 am
Appreciate the replies. I'll have to keep these in mind when we get to design a new system in place (hopefully I won't need to if I can convince them database > flat file 🙂
Unfortunately; this is coming from a legacy system which I am just "injecting" some SSIS logic to help with manual routing. I won't be able to add anything to the existing files or program changes but maybe I can figure out some form of a checksum process.
January 13, 2016 at 9:10 am
How big are these files? One option is to load the file into a variable using t-sql and openrowset, then count the lines by comparing the length against the length after replacing all linefeed character with a blank character.
😎
January 13, 2016 at 9:16 am
I think you'd need a separate process that counts rows, knowing what a "row" is in the file format as a
delimiter.
You can look for error rows in your flow, but that would mean you're assuming the rows are correctly read somehow by SSIS and there aren't errors there.
If you do this, we'd love an article on this. I'm sure some people would like a technique that shows they can determine if an entire file loads.
January 13, 2016 at 9:25 am
Steve Jones - SSC Editor (1/13/2016)
I think you'd need a separate process that counts rows, knowing what a "row" is in the file format as adelimiter.
You can look for error rows in your flow, but that would mean you're assuming the rows are correctly read somehow by SSIS and there aren't errors there.
If you do this, we'd love an article on this. I'm sure some people would like a technique that shows they can determine if an entire file loads.
I've used both methods, ssis script and t-sql many times, works like a charm every time. Better still is to create a hash before and after, always use that for payment files etc.
😎
January 13, 2016 at 9:27 am
Steve Jones - SSC Editor (1/13/2016)
I think you'd need a separate process that counts rows, knowing what a "row" is in the file format as adelimiter.
You can look for error rows in your flow, but that would mean you're assuming the rows are correctly read somehow by SSIS and there aren't errors there.
If you do this, we'd love an article on this. I'm sure some people would like a technique that shows they can determine if an entire file loads.
Even with this, you still have a bit of the "mouse guarding the cheese". The only true method of knowing how many rows should be in a file is to receive two files... one with the data that you want to load and one that tells you the number of rows (and a checksum to help determine the row data itself is correct even if the number of rows are) that should be in the file.
I say "only true method" but that's not quite right. Another way to do it is to have a header and footer in the file with the row count, etc, but that can really cause processes to go haywire because things like BULK INSERT, BCP, and other methods of import frequently count delimiters and not actual EOL markers and they frequently have a difficult time handling multiple record layouts within the same file.
--Jeff Moden
Change is inevitable... Change for the better is not.
January 13, 2016 at 9:46 am
Jeff Moden (1/13/2016)
Even with this, you still have a bit of the "mouse guarding the cheese". The only true method of knowing how many rows should be in a file is to receive two files... one with the data that you want to load and one that tells you the number of rows (and a checksum to help determine the row data itself is correct even if the number of rows are) that should be in the file.
I say "only true method" but that's not quite right. Another way to do it is to have a header and footer in the file with the row count, etc, but that can really cause processes to go haywire because things like BULK INSERT, BCP, and other methods of import frequently count delimiters and not actual EOL markers and they frequently have a difficult time handling multiple record layouts within the same file.
More common is to include a summary header, preferably with a checksum (seeded) for tamper proofing.;-)
😎
January 13, 2016 at 2:00 pm
My only thought here is that if you know you have windows formatted files with a \r at the end, you can count these with a seperate, non SSIS process. Then you can compare to lines read.
Perhaps some mouse guarding the cheese, but if depends if you're trying to ensure SSIS works correctly or ensure that something else.
Personally I'd trust SSIS if this is working and handle error rows inside the ETL. While you might have issues with some files, I'd deal with those at the time rather than coding to prevent them.
January 13, 2016 at 4:46 pm
Counting the number of lines in a flat-file in a Script Task in SSIS is trivial and wicked fast. You can then take that count and compare it to the number of rows in the table after your Data Flow Task completes.
string fileName = @"C:\Document1.txt";
int lineCount = 0;
using (StreamReader sr = new StreamReader(fileName))
{
while ((sr.ReadLine()) != null)
{
lineCount++;
}
}
Knowing that the number of lines in the file matches the number of rows in the table only confirms you loaded all lines in the file, however. It tells you nothing about what the data provider intended to or was supposed to send you. For that you need a header or trailer line (or separate file as some mentioned) with a line count you can use to validate against the line/row count.
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
January 14, 2016 at 5:52 am
I may do a write up on this as it seems to generate several good remarks and approaches!
I think there are 2 real ways to address this:
1. If you are able to customize the input/output from the source system; a checksum file and a header/footer line can be a sure way to guarantee things.
2. If you are not able to customize the source system; things get to be a "best guess" situation.
This may deal with transaction setup; doing a checksum of sorts; reading the rows separately and then comparing row counts to a best guess.
I will need to experiment but how does a StreamReader determine a "line"? If I have a CRLF in the middle of a text field; will stream reader count this as a line even though it is in a text field?
Viewing 15 posts - 1 through 15 (of 23 total)
You must be logged in to reply to this topic. Login to reply