Importing Large Datasets

  • SQL Server 2008 R2

    We have an application that essentially sits and waits for external text files to arrive. Some of the files can be quite large. The problem is after one of these large imports, performance goes down the drain... probably due to statistics and/or indexes becoming so out of sync. This is a C# app that calls stored procedure to insert the data. My first instinct was to call a rebuild stats or indexes from the app when one of these big files completes. It turns out that the login this runs under does not have permission to rebuild stats/indexes though.

    I'm at a loss. I don't know how to deal with these big files when they arrive. There must be a better way. Setting up a scheduled job to do the stats rebuild is difficult since these larger files are random.

    How are others dealing with this situation?

    Thanks

    .

  • BSavoie (4/23/2015)


    SQL Server 2008 R2

    We have an application that essentially sits and waits for external text files to arrive. Some of the files can be quite large. The problem is after one of these large imports, performance goes down the drain... probably due to statistics and/or indexes becoming so out of sync. This is a C# app that calls stored procedure to insert the data. My first instinct was to call a rebuild stats or indexes from the app when one of these big files completes. It turns out that the login this runs under does not have permission to rebuild stats/indexes though.

    I'm at a loss. I don't know how to deal with these big files when they arrive. There must be a better way. Setting up a scheduled job to do the stats rebuild is difficult since these larger files are random.

    How are others dealing with this situation?

    Thanks

    1. How many bytes in the typical "large" file?

    2. How many lines in the typical "large" file?

    3. How many "fields" in the typical "large" file?

    4. What type of file is it? CSV, True CSV, TSV, Fixed Field, or ???

    5. Does the login have privs to run a stored procedure? (Don't worry about other privs... just need to know if it can run a stored procedure).

    Shifting gears a bit, the problems that you're running into are the reasons why I do all of this through T-SQL and xp_CmdShell, which actually isn't the security risk that most would have you believe.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff, thanks for the reply. I'll work on tracking down those specifics for you by tomorrow. What I can tell you so far is that the files are "positional", flat files. As is so often the case, I've sort of landed unwittingly on this project long after design decisions were etched in stone by folks no longer working for the agency. Now I'm left to pick up the baton & run with it.

    More tomorrow...

    Thanks, have a great day!

    .

  • BSavoie (4/24/2015)


    Hi Jeff, thanks for the reply. I'll work on tracking down those specifics for you by tomorrow. What I can tell you so far is that the files are "positional", flat files. As is so often the case, I've sort of landed unwittingly on this project long after design decisions were etched in stone by folks no longer working for the agency. Now I'm left to pick up the baton & run with it.

    More tomorrow...

    Thanks, have a great day!

    Those are actually my favorite types of files. They're called "fixed field" files because every row is the same width and any given field in the file always starts at exactly the same character number on every record in the file. With the use of a "BCP Format File", they're the absolute fastest way to import large volumes of data because neither Bulk Insert nor BCP has to do the work of finding out where the delimiters are.

    It also means that you have to have some knowledge of what the record layout actually is. Knowing the starting position of each field and the name of each field is enough to figure everything else out. If you don't already have it, ask the file provider for a "record layout" to get that information.

    And, yeah... I'll be happy to help you learn how to do all of this pretty easily. In fact, if you have the record layout in some electronic form, we can usually drop it into Excel, clean it up a bit, and then write a simple formula to auto-magically build the information that we'd put in the BCP Format File.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Awesome, thanks. I have great respect for your reputation Jeff. I've followed your posts for many years, and have tried to live my SQL Server life according to the principles of RBAR! 🙂

    I do have all the layouts in Excel already. There are basically seven different file types, they are all "fixed field" files. It looks like the file sizes range from as small as 100 bytes, all the way up to 50mb or so. The big ones, the 50MB ones that I'm concerned about have a 364 byte header and 25 field detail records, each about 390 bytes in size. I have used BCP on other assignments, never with a format file though, most of my experience with BCP has been for one time only imports. The problem in this case is that would essentially amount to a rewrite for this system.

    Right now a C# app is reading the entire contents of each incoming file and sticking it in a SQL Server binary column, in a table called "Files", then another C# process comes along and processes all the new records in the Files table one at a time. I know you're going to tell me how this is the worst possible choice for importing data files on so many levels, and of course I whole heartedly agree, but this is what I've inherited for now. To give you a little more background, they want me to go add "with nolock" to all the sprocs in this process to try and improve performance, they also wrote a custom scheduler app because they don't like SSIS, so that's the sort of thing I'm up against here.

    I will take all your BCP advice and start working on a replacement "proof of concept" for all this when it fails to meet processing time requirements in the not too distant future. In the mean time, unless someone has a better idea, in order to keep this thing afloat, I'm thinking of something along the lines of a scheduled job to keep track of the file count and rebuild stats and/or indexes when the file count exceeds some count/time threshold. Or maybe a trigger on the Files table that keeps track of the file count over time and triggers a stats/index rebuild. I completely understand that this is a hideous process, that said, there must be a better way to get SQL Server to keep up with all these records being added until I can get it re-architected.

    .

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply