Best option to insert data from Staging table to main table

  • Hello every one,

    I have a requirement when I have to load the data from staging table to main table. The data should be loaded at a particular frequency and I want to insert only new records and along with the data in staging I also want to insert some additional fields which I should get from some other tables.
    Staging table will be having millions of records and new records will be getting inserted continuously.

    For this I can go with NotExist or Except. Which is the better way to achieve this . Is there any other options where this can be done fast.

    I am having unique identity columns in both the tables.

    Thanks.

  • I'd add a field to the staging table that allows me to batch things. So if I loaded 10 records, I'd mark those 10 as "in process", and if there are 20 more records added, I'll process them later. This lets me better deal with continuous changes. When I'm done, i'd mark these as loaded.

    In this way, often what I'd do is test  either deleting matching records or use the not exists to insert these. I haven't tested EXCEPT, but some people like it. Honestly, you should test these all on your system and evaluate the performance impact. However, I like the simplicity of deleting matches, then inserting new rows.

    One note, if you have identity columns, can you be sure they match? I worry that changes in the target system will cause you issues here.

    Is the staging table emptied ever?

  • KGNH - Wednesday, July 4, 2018 3:37 AM

    Hello every one,

    I have a requirement when I have to load the data from staging table to main table. The data should be loaded at a particular frequency and I want to insert only new records and along with the data in staging I also want to insert some additional fields which I should get from some other tables.
    Staging table will be having millions of records and new records will be getting inserted continuously.

    For this I can go with NotExist or Except. Which is the better way to achieve this . Is there any other options where this can be done fast.

    I am having unique identity columns in both the tables.

    Thanks.

    Also i would think about having a hash of the columns in the staging table and a corresponding hash of the columns in the destination table. That way i need not compare all of the columns to decide if there it is new record or not. This is in case you dont have a field called updated_on in the staging table

  • Steve Jones - SSC Editor - Thursday, July 5, 2018 11:03 AM

    I'd add a field to the staging table that allows me to batch things. So if I loaded 10 records, I'd mark those 10 as "in process", and if there are 20 more records added, I'll process them later. This lets me better deal with continuous changes. When I'm done, i'd mark these as loaded.

    In this way, often what I'd do is test  either deleting matching records or use the not exists to insert these. I haven't tested EXCEPT, but some people like it. Honestly, you should test these all on your system and evaluate the performance impact. However, I like the simplicity of deleting matches, then inserting new rows.

    One note, if you have identity columns, can you be sure they match? I worry that changes in the target system will cause you issues here.

    Is the staging table emptied ever?

    Thanks Steve. In my case its not continuous. Staging table will be loading continuously and that will be moving to main table by a job which actually runs at particular frequency.

    The identity columns in both the tables will be matching and the data in both the tables should match. There won't be any deletion or insertion happens in main table other than the records getting from Staging. Even the staging table not cleared after moving to main table. Since it is a direct transfer of complete data we have taken key columns as identity in both tables. Otherwise there is no need to have that identity in main table.

    Only thing is in the main table there will be some additional columns added and along with the columns from Staging which should also be updated wheile moving data from staging to main.

  • george_at_sql - Thursday, July 5, 2018 11:11 AM

    KGNH - Wednesday, July 4, 2018 3:37 AM

    Hello every one,

    I have a requirement when I have to load the data from staging table to main table. The data should be loaded at a particular frequency and I want to insert only new records and along with the data in staging I also want to insert some additional fields which I should get from some other tables.
    Staging table will be having millions of records and new records will be getting inserted continuously.

    For this I can go with NotExist or Except. Which is the better way to achieve this . Is there any other options where this can be done fast.

    I am having unique identity columns in both the tables.

    Thanks.

    Also i would think about having a hash of the columns in the staging table and a corresponding hash of the columns in the destination table. That way i need not compare all of the columns to decide if there it is new record or not. This is in case you dont have a field called updated_on in the staging table

    Thanks George. Actually I don't think we need to compare all the columns. The data in destination table will get updated only after inserting from source to destination. But once we insert the data to destination we need not worry if the data gats updated or not. If even for the same values any new record comes to source we have to insert it again.

  • This seems to me to be an ETL (Extract, Transform, Load) scenario and for these I tend to use an SSIS package. You can include the excellent suggestions from Steve and George within the package and use a Lookup transformation to determine whether a not a row should be inserted. You can also redirect rows (in your case matching rows) from the Lookup task to a table for later inspection, should you wish to.
    During the design and testing phases you can add Data Viewers between the Data Flow tasks to look at the data as it is being processed.
    On the negative side, you mentioned 'millions of rows' which (depending on your hardware spec.) could be pretty time-consuming using an SSIS package.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply