One large vs several small files

  • Files we receive from our vendor are growing larger at a rapid pace. One of the files has surpassed 6 GB. There is no end in sight and they have no plans or means to mitigate this. They send the files and we just have to deal with them. Differential files might be a possibility at some point, but we'll need to process full files on a regular basis regardless.

    My concern is that we are going to get files that will choke the server, or at least bog things down severely.

    One thought I had was to split the large files into smaller ones. Would, say a 10 GB file be better or worse to process than five 2 GB files staggered over time? I'm thinking if we do this, we're only processing, say, 4 - 6 GB of data at a time.

    This is on SQL Sever 2016, by the way, using SSIS 2015.

    Any thoughts or suggestions?

    Thanks!

  • What exactly are you doing with the files?  If you're just streaming them into a database it shouldn't matter.

  • You can also batch the inserts so that you insert 10,000 rows, then commit. Then repeat.

  • It's a bit more than just streaming into a database. There is some processing, ultimately using a Task Factory Upsert to update/insert. Along the way there are some transformations and, of course, error handling.

  • kdbarrett 88655 - Friday, June 15, 2018 10:35 AM

    It's a bit more than just streaming into a database. There is some processing, ultimately using a Task Factory Upsert to update/insert. Along the way there are some transformations and, of course, error handling.

    To be honest, I'd say be very happy they're sending you the full monty and NOT sending you differentials because most sources I've had to work with don't actually do it right even when they think they know how to and then you'd have to request a full true up file and deal with it anyway.

    I also like would NOT use SSIS for such things even though it can certainly be done there. It's not a bad thing.  I just don't care for it.  For example, I'd much rather do column matching programmatically rather then doing the click'n'drag mapping that you usually have to do in SSIS. 

    When presented with full-monty files, I tend to cheat a bit.  I'll do a true up once a month but, for daily or week loads, I use a bit of a different method... I load the previous file and the current file and use EXCEPT to find the differences.  I also don't do that in the final database where the data is going.  I keep a "Scratch" database for that and I keep it in the SIMPLE recovery model because it should never contain anything that must persist.  This also allows me to take huge advantage of things that the SIMPLE recovery model allows for such as Minimal Logging, etc.

    You can also take a shortcut if you have daily files to load.  If you keep yesterdays files in tables, then you don't need to load them again to do a compare to today's files.  And, if you do the minimal logging correctly, you can easily build a clustered index on the table as a part of the import instead of doing it separately, which can take a huge amount of time.  You just need for the clustered index to be in the same order as the lines in the file to take true advantage of that.

    Once you have yesterday's file and today's file loaded, then you can easily isolate the "differences" (EXCEPT is easy and fairly fast although handcrafted methods can sometimes be made faster) and create a much smaller table of differences to validate and merge into the final data on the main database.

    I'll also use BULK INSERT with a format file.  Anecdotally, the loads seem faster but I've never done a measured comparison.  You can also document the format file after the last line, which makes the "documentation travel with the code".

    They came up with TF610 in (IIRC) 2008 to allow for minimal logging to continue for partially full tables (also made it so that INSERT/SELECT can also be minimally logged).  That also means that you can use the batch size parameter to control how much you load at one time instead of writing your own hook.  To be honest though, I generally use the parameter to force a load of the whole file.  2 or 3 million rows just aren't a problem.  Besides, do you really think you'll do a successful restart if it fails on one of the batches?  If you're loading a hundred million rows, I might see trying to restart a failed load but, because it's normally a data failure but that's the trick... don't let it fail.  Sequester bad rows as they occur (there are settings for that) and keep the puppy going!

    Speaking of TF610, there's an article from 2008 that's still very appropriate and has some incredible tips on how to achieve high performance loads.  It's a long article but I strongly recommend reading every word in the article and then setting up some tests with your real data to see how well it can actually be made to work. It also not only includes batch size recommendations, but explains why (generally, the bigger the better)  Here's the link. 
    https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008/dd425070(v=sql.100)

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • p.s.  To answer the question that was asked... no... I wouldn't split up the files.  It's just another "touch" on the data that could screw something up and complicates things because you now have more than one file to load.  If you had a 200GB file, then you might want to split it up for parallel loads... just remember that will take twice the space (the original and the split files) and take extra time to accomplish, which might outweigh any advantages including those of the parallel load (which may actually be worse because it's going to drive the R/W heads on the disks nuts unless you have a shedload off SSDs that you can waste expend on things like this which may negate the need to do a parallel load to begin with).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply