Advice on importing multiple txt files with different file structures

  • Cross-posted from The Thread:

    Looking for advice on the best way to import multiple fixed-width txt files (50+) with different file structures into SQL Server (separate table for each file).

    We receive all the files at once, although we don't necessarily receive every file out of the 50+ every time. They each have a unique reportID as *part* of the filename, but the filenames change.

    Was thinking of putting all these into one package, since we need to load, compare and respond to our regulatory agency on a very tight turnaround, and hoping to automate the busy-work.

    Current plan:

    Build 50+ ForEachLoop containers with a dataFlow task in each, looking for the existence of a file with '*<reportID>*.txt' as a filetype, hard-coded columns in each dataflow that correspond to the file structure and dump into the correct tables.

    Alternate idea from previous conversation: use one ForEachLoop/DataFlow using XML templates/flat file source/XML source to determine the file structure, and import into the appropriate table. (still need to research this one Stefan's idea, just thinking out loud on-line)

    Alternate alternate idea: build each piece as a sub-package, and only run the ones that need run by "programmatic determination", as I think Stefan put it.

    Any suggestions/advice welcomed and requested.

    Thanks,

    Jon

    ---------------------------------------------------------
    How best to post your question[/url]
    How to post performance problems[/url]
    Tally Table:What it is and how it replaces a loop[/url]

    "stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

  • jcrawf02 (9/15/2010)


    Build 50+ ForEachLoop containers with a dataFlow task in each, looking for the existence of a file with '*<reportID>*.txt' as a filetype, hard-coded columns in each dataflow that correspond to the file structure and dump into the correct tables.

    Jon, this specific statement above worries me in general. Are these unique identifiers or numeric codes. The reason I'm curious, and may have an impact on some solutioning, is if a report is "0801"... your files with datetimes on them for August 1st are going to be very interesting each year.

    Personally I'd go with the single foreach container with a detection algorithm to move to different dataflows in the same loop for the different files. Much luck. πŸ™‚


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Craig Farrell (9/15/2010)


    jcrawf02 (9/15/2010)


    Build 50+ ForEachLoop containers with a dataFlow task in each, looking for the existence of a file with '*<reportID>*.txt' as a filetype, hard-coded columns in each dataflow that correspond to the file structure and dump into the correct tables.

    Jon, this specific statement above worries me in general. Are these unique identifiers or numeric codes. The reason I'm curious, and may have an impact on some solutioning, is if a report is "0801"... your files with datetimes on them for August 1st are going to be very interesting each year.

    Personally I'd go with the single foreach container with a detection algorithm to move to different dataflows in the same loop for the different files. Much luck. πŸ™‚

    Thanks Craig, has to do with the actual report definition, not the datetime or anything. example WC1 vs WC2 vs WC3 to indicate which version of the 'Well Child' report is being submitted. The date/time stuff is also part of the name, but I'll figure that out separately with a load date or some such.

    ---------------------------------------------------------
    How best to post your question[/url]
    How to post performance problems[/url]
    Tally Table:What it is and how it replaces a loop[/url]

    "stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

  • I'd suggest something along the line of:

    DECLARE @Files TABLE (FileName varchar(100));

    declare @SQL varchar(max),

    @FilePath varchar(100),

    @FileName varchar(100);

    set @FilePath = 'C:\';

    set @SQL = 'dir /b ' + @FilePath;

    INSERT INTO @Files execute xp_cmdshell @SQL;

    CREATE TABLE #temp (FileData varchar(max));

    declare cursor cFiles FOR

    select FileName from @Files;

    open cFiles;

    fetch next from cFiles into @FileName;

    while @@fetch_status = 0 begin

    set @FileName = @FilePath + @FileName;

    set @SQL = 'BULK INSERT #temp FROM ''' + @FileName + '''';

    EXEC (@SQL);

    if charindex('filename1 match criteria', @FileName) > 0

    insert into <base table1> (column list)

    select substring(FileData, startpos, NumberOfChars) Field1,

    substring(FileData, startpos, NumberOfChars) Field2 -- repeat as necessary

    from #temp

    else if charindex('filename2 match criteria', @FileName) > 0

    insert into <base table2> (column list)

    select substring(FileData, startpos, NumberOfChars) Field1,

    substring(FileData, startpos, NumberOfChars) Field2 -- repeat as necessary

    from #temp

    else (repeat for all 50+ types of files

    fetch next from cFiles into @FileName

    end

    close cFiles

    deallocate cFiles

    This way, you don't have to worry about derived columns, transformations, etc.

    If you desire, you could put each of those inserts into a separate procedure, and just execute that.

    ;-):w00t:

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • jcrawf02 (9/15/2010)


    Thanks Craig, has to do with the actual report definition, not the datetime or anything. example WC1 vs WC2 vs WC3 to indicate which version of the 'Well Child' report is being submitted. The date/time stuff is also part of the name, but I'll figure that out separately with a load date or some such.

    Pleasure. Just wanted to make sure your reportnames didn't match things that could end up in date_additions (Peer of mine had report 100101 that was busting due to a similar mechanic, until they figured out the 12 files were <reportname>_20100101.csv...).

    One advantage to that method is that you can simply drop in, or disable, reports as needed without building more then the exact data flow it needs, and the name in the algorithm to send it to the new object calls.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Hey Wayne, does substring still work if the data for the VARCHAR(MAX) is no longer stored in row, and goes out to a separate LOB page?

    I thought I remembered dropping file data like that into anything but an XML field and trying to treat it as XML caused problems if the row ended up too wide, but, um, yeah, I think I'll just ask for the clarification. πŸ™‚


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Craig Farrell (9/15/2010)


    Hey Wayne, does substring still work if the data for the VARCHAR(MAX) is no longer stored in row, and goes out to a separate LOB page?

    I thought I remembered dropping file data like that into anything but an XML field and trying to treat it as XML caused problems if the row ended up too wide, but, um, yeah, I think I'll just ask for the clarification. πŸ™‚

    I believe so, but I've never tried it. How about running a test and letting us know?

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • WayneS (9/15/2010)


    I believe so, but I've never tried it. How about running a test and letting us know?

    Why was I afraid you'd say that? New thread created here. Didn't want to derail entirely. πŸ™‚

    [EDIT:] Figured out my oops in the test, it works fine. [/EDIT]


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • jcrawf02 (9/15/2010)


    Cross-posted from The Thread:

    Looking for advice on the best way to import multiple fixed-width txt files (50+) with different file structures into SQL Server (separate table for each file).

    We receive all the files at once, although we don't necessarily receive every file out of the 50+ every time. They each have a unique reportID as *part* of the filename, but the filenames change.

    Was thinking of putting all these into one package, since we need to load, compare and respond to our regulatory agency on a very tight turnaround, and hoping to automate the busy-work.

    Current plan:

    Build 50+ ForEachLoop containers with a dataFlow task in each, looking for the existence of a file with '*<reportID>*.txt' as a filetype, hard-coded columns in each dataflow that correspond to the file structure and dump into the correct tables.

    Alternate idea from previous conversation: use one ForEachLoop/DataFlow using XML templates/flat file source/XML source to determine the file structure, and import into the appropriate table. (still need to research this one Stefan's idea, just thinking out loud on-line)

    Alternate alternate idea: build each piece as a sub-package, and only run the ones that need run by "programmatic determination", as I think Stefan put it.

    Any suggestions/advice welcomed and requested.

    Thanks,

    Jon

    In the past, I've done more like what Wayne did in his code and I avoided SSIS/DTS altogether in the process. I also went a little bit further... I'd read the delimited column headers from the first row of the file and used that information to dynamically determine what I was going to do with the file and which columns (their position varied from file to file in a most unpredictable manner and the number of columns varied, as well) where going where based on the names of the columns. I don't know what the structure of all of your files are nor what headers (and, therefor, the data) are present but this seems like it should be an easy thing to do with BULK INSERT and maybe the use of the undocumented xp_Dirtree 'C:\',1,1 (as an example) extended stored procedure (or use the DOS DIR command like Wayne did).

    Heh... yeah. I know... not a very helpful post but I did want to let you know that I kind of like Wayne's approach on this...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (9/15/2010)


    ... and maybe the use of the undocumented xp_Dirtree 'C:\',1,1 (as an example) extended stored procedure (or use the DOS DIR command like Wayne did).

    Ahh, that would also allow you to not need to open up xp_cmdshell for use. Excellent idea.

    Heh... yeah. I know... not a very helpful post but I did want to let you know that I kind of like Wayne's approach on this...

    And I did that half seriously... and half just joking Jon since it came off the thread.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Heh. I built a whole import system for unpredictable inputs (basically, spreadsheet CSV files) (can't reveal the source... NDA hasn't expired yet) using methods very basically similar to yours but mine where 100% dynamic and 100% self configuring from 11 columns (9 were "fixed") to more than 800 columns all with varying names from file to file.

    They previously had a Perl script that got one 30,000 row 800 column file ready for import in about 40 minutes. When I was done, I was importing (not just getting ready for import) up to 6 files in a little over two minutes.

    Ah... sorry... didn't mean to make it sound like a brag. I just want people to know that it's not only possible to do in T-SQL, it was easier to do and performed better than what they tried to do in both DTS and SSIS.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Just to make it interesting πŸ˜› my DBAs have deigned not to grant BULK INSERT, so swell though your solution seems to be Wayne, it no workey.

    In light of that roadblock, I'm still working on it in SSIS.

    ---------------------------------------------------------
    How best to post your question[/url]
    How to post performance problems[/url]
    Tally Table:What it is and how it replaces a loop[/url]

    "stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

  • jcrawf02 (9/16/2010)


    Just to make it interesting πŸ˜› my DBAs have deigned not to grant BULK INSERT, so swell though your solution seems to be Wayne, it no workey.

    In light of that roadblock, I'm still working on it in SSIS.

    Oof, that's a shame since that's the way to do it with XML templates. The only thing I needed the CLR for was directory polling and file transfer.

    With your SSIS process, are you doing 1 loop and assessing the file type there? It'll be a lot faster than running through all the files 50 times and a lot less work on your part. If you don't want to link 50 different parsing/importing processes to the one loop that directs the files, you can use the main loop to move the files to specific directories and then run one import for all the files in each directory. That might be visually much easier to handle and then you can just run each process one after the other instead of jumping all around depending on which file is read.

    Once I have some processes finished, I should be able to take another look at the source types in SSIS to see which is the best.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Stefan Krzywicki (9/16/2010)


    jcrawf02 (9/16/2010)


    Just to make it interesting πŸ˜› my DBAs have deigned not to grant BULK INSERT, so swell though your solution seems to be Wayne, it no workey.

    In light of that roadblock, I'm still working on it in SSIS.

    Oof, that's a shame since that's the way to do it with XML templates. The only thing I needed the CLR for was directory polling and file transfer.

    With your SSIS process, are you doing 1 loop and assessing the file type there? It'll be a lot faster than running through all the files 50 times and a lot less work on your part. If you don't want to link 50 different parsing/importing processes to the one loop that directs the files, you can use the main loop to move the files to specific directories and then run one import for all the files in each directory. That might be visually much easier to handle and then you can just run each process one after the other instead of jumping all around depending on which file is read.

    Once I have some processes finished, I should be able to take another look at the source types in SSIS to see which is the best.

    Yeah, think I'm going that route, Jack sent me here to Tim Mitchell's blog, which looks like what I want it to do, so I'm going to give that a shot.

    ---------------------------------------------------------
    How best to post your question[/url]
    How to post performance problems[/url]
    Tally Table:What it is and how it replaces a loop[/url]

    "stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

  • jcrawf02 (9/16/2010)


    Stefan Krzywicki (9/16/2010)


    jcrawf02 (9/16/2010)


    Just to make it interesting πŸ˜› my DBAs have deigned not to grant BULK INSERT, so swell though your solution seems to be Wayne, it no workey.

    In light of that roadblock, I'm still working on it in SSIS.

    Oof, that's a shame since that's the way to do it with XML templates. The only thing I needed the CLR for was directory polling and file transfer.

    With your SSIS process, are you doing 1 loop and assessing the file type there? It'll be a lot faster than running through all the files 50 times and a lot less work on your part. If you don't want to link 50 different parsing/importing processes to the one loop that directs the files, you can use the main loop to move the files to specific directories and then run one import for all the files in each directory. That might be visually much easier to handle and then you can just run each process one after the other instead of jumping all around depending on which file is read.

    Once I have some processes finished, I should be able to take another look at the source types in SSIS to see which is the best.

    Yeah, think I'm going that route, Jack sent me here to Tim Mitchell's blog, which looks like what I want it to do, so I'm going to give that a shot.

    Great! I'm glad you have a resource and sorry I wasn't more help.

    I did get a chance to look at the two source types: Flat File and XML. I think you can use either, but I'd test both to see which is faster (and to make sure they both work πŸ™‚ ). The Flat File has an option for Fixed Width, so that should work for you. The XML allows parsing by XSD which should let you do the SSIS version of Wayne's solution.

    If you do try both and they both work, I'd love to hear about speed differences and anything you discover about ease of use, etc...

    I'm about to embark on something similar myself. If I find out anything interesting, I'll let you know.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply