Flat File Comparison Approaches

  • We are about to upgrade to SQL Server 2016.  Every day, we receive a "complete" text file for import.  That means the file contains inserts, updates, deletes, AND unchanged rows.  However, we only want to bring in the inserts, updates, and deletes.

    What is the best way of doing this?  The vendor/consultant who does our data processing wants to use Hadoop, even though we currently use SSIS and the destination tables are in SQL Server.  I considered the SCD task in SSIS, but it has a reputation for being slow.  

    Please bear in mind that this text file that we receive daily has over 1 million rows and is about 100 MB in size.  What approach would give us the highest performance?  Should we even be using SSIS for this type of incremental data load?

  • imani_technology - Friday, June 1, 2018 11:10 AM

    We are about to upgrade to SQL Server 2016.  Every day, we receive a "complete" text file for import.  That means the file contains inserts, updates, deletes, AND unchanged rows.  However, we only want to bring in the inserts, updates, and deletes.

    What is the best way of doing this?  The vendor/consultant who does our data processing wants to use Hadoop, even though we currently use SSIS and the destination tables are in SQL Server.  I considered the SCD task in SSIS, but it has a reputation for being slow.  

    Please bear in mind that this text file that we receive daily has over 1 million rows and is about 100 MB in size.  What approach would give us the highest performance?  Should we even be using SSIS for this type of incremental data load?

    Add a 'Transaction Date' column to the import file and discard those rows whose transaction date is before the last import date.

    TransactionDate = Max(DateCreated, DateModified, DateDeleted)

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • The thing is, we have no control over the actual import file, since it comes from an outside vendor.  However, maybe we can add the Transaction Date column as part of the ETL process.

  • imani_technology - Friday, June 1, 2018 12:08 PM

    The thing is, we have no control over the actual import file, since it comes from an outside vendor.  However, maybe we can add the Transaction Date column as part of the ETL process.

    It needs to be in the source to be any use.
    If there are no useful date columns in the source, this problem is just going to get worse and worse.
    Is there a unique, ascending primary key in the source? If there is, INSERTs and DELETEs can be handled fairly quickly (assuming DELETEs are identified as 'record exists in table but not in source file')
    But UPDATEs are a big problem. You need to compare all columns looking for differences. With large data volumes, this process is not going to be speedy. In fact, you'd probably be better off doing a full TRUNCATE/RELOAD.

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • 100MB and a million rows isn't all that big for a daily ETL process, using hadoop would be ridiculous.  What are more details about the job?  Is this supposed to be a complete reload of the data set?  Or do you have say millions of rows and this is just a subset of that that the vendor sends you?

    I would assume the rows are somehow marked for insert/update/delete, does the vendor send you the unchanged rows as updates?  One option is to keep a copy of yesterdays files in your database and do a compare between the two and throw out the rows that aren't changed.  Basically just keep yesterday's file in a staging table, load the new file into a staging table and do the compare there.

  • The source file has millions of rows, but only a subset of that data will be included in the ultimate destination table.  

    And I agree, Hadoop is overkill. 

    I wonder if it would be prudent to just bulk copy the text file into a SQL Server table and then take a set-based approach.  Would that work better?

  • imani_technology - Friday, June 1, 2018 1:21 PM

    The source file has millions of rows, but only a subset of that data will be included in the ultimate destination table.  

    And I agree, Hadoop is overkill. 

    I wonder if it would be prudent to just bulk copy the text file into a SQL Server table and then take a set-based approach.  Would that work better?

    That would certainly be possible, what percentage of the file is actually being used?  If you can easily determine that just from the file without having to do any comparisons it might be easier to just filter those out during the import in SSIS.

  • If you truly receive everything in a single file, stop looking a gift horse in the mouth.  Import it all to a working table, do what ever validations and indexing you need to, and then swap/rename it with the original table.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Friday, June 1, 2018 2:23 PM

    If you truly receive everything in a single file, stop looking a gift horse in the mouth.  Import it all to a working table, do what ever validations and indexing you need to, and then swap/rename it with the original table.

    Amen to that brother 😀
    I have done similar to this several times in the past with variable length rows, datatypes and redundant red tape.

    Far away is close at hand in the images of elsewhere.
    Anon.

  • The problem is, the full file could be partially wrong at times.

  • imani_technology - Monday, June 4, 2018 10:36 AM

    The problem is, the full file could be partially wrong at times.

    You're going to have to expand on that, if you want any reasonable feedback.

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • imani_technology - Monday, June 4, 2018 10:36 AM

    The problem is, the full file could be partially wrong at times.

    And perhaps the most important question here, is, how would you know if any of the data was wrong?  Are the rows marked specifically as updates, inserts, or deletes?   Or is it just a complete copy of all rows that are considered "current" ?  If the data is vendor provided, then they should be responsible for any errors made.  If you have no way to determine what data is incorrect, then to be blunt, it just doesn't matter..  There's nothing you could do about it anyway other than bitch at the vendor when someone finally notices a mistake.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply