Loading Non Duplicated Data Using SSIS

  • Comments posted to this topic are about the item Loading Non Duplicated Data Using SSIS

  • The problem is scalability. Aggregations are blocking transforms. Hence, they block the data flow in the pipeline and start bloating the memory (as they keep the entire result set in memory to perform the aggregation).

    When tasked with millions or trillions of records, it starts to get ugly (your servers start running out of memory, paging is seen on disk and ultimately the package comes to a crawl).

    The simplest approach in these cases would be to dump everything into a temporary SQL Server table, and then let T-SQL handle it. SQL is built for data manipulation and management, .NET objects (which is what an SSIS package translates to) are not.

    Thanks & Regards,
    Nakul Vachhrajani.
    http://nakulvachhrajani.com

    Follow me on
    Twitter: @sqltwins

  • It is more easy to use the Sort-task. Use the Check-box 'Remove rows with duplicate sort values'. It is a blocking task, but works fast even with millions of rows! It is also blocking if you write data to a temporary table, ..., ..., ... because the work must be done, before it is possible to continue. And writing and reading data to and from TempDB takes time!

  • csj - Monday, January 30, 2017 11:18 PM

    It is more easy to use the Sort-task. Use the Check-box 'Remove rows with duplicate sort values'. It is a blocking task, but works fast even with millions of rows! It is also blocking if you write data to a temporary table, ..., ..., ... because the work must be done, before it is possible to continue. And writing and reading data to and from TempDB takes time!

    Yes, for smaller datasets (a couple of hundred rows), Sort is fine.
    For larger datasets, we run into the same problem - Sort is a blocking transformation and therefore can become a performance bottleneck.

    Thanks & Regards,
    Nakul Vachhrajani.
    http://nakulvachhrajani.com

    Follow me on
    Twitter: @sqltwins

  • Nakul Vachhrajani - Monday, January 30, 2017 10:34 PM

    The problem is scalability. Aggregations are blocking transforms. Hence, they block the data flow in the pipeline and start bloating the memory (as they keep the entire result set in memory to perform the aggregation).

    When tasked with millions or trillions of records, it starts to get ugly (your servers start running out of memory, paging is seen on disk and ultimately the package comes to a crawl).

    The simplest approach in these cases would be to dump everything into a temporary SQL Server table, and then let T-SQL handle it. SQL is built for data manipulation and management, .NET objects (which is what an SSIS package translates to) are not.

    I have to ask when the last time it was that you had "trillions of records" in a staging table to load into another table.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Nakul Vachhrajani - Monday, January 30, 2017 11:38 PM

    csj - Monday, January 30, 2017 11:18 PM

    It is more easy to use the Sort-task. Use the Check-box 'Remove rows with duplicate sort values'. It is a blocking task, but works fast even with millions of rows! It is also blocking if you write data to a temporary table, ..., ..., ... because the work must be done, before it is possible to continue. And writing and reading data to and from TempDB takes time!

    Yes, for smaller datasets (a couple of hundred rows), Sort is fine.
    For larger datasets, we run into the same problem - Sort is a blocking transformation and therefore can become a performance bottleneck.

    So post the T-SQL alternative. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Monday, January 30, 2017 11:51 PM

    Nakul Vachhrajani - Monday, January 30, 2017 10:34 PM

    The problem is scalability. Aggregations are blocking transforms. Hence, they block the data flow in the pipeline and start bloating the memory (as they keep the entire result set in memory to perform the aggregation).

    When tasked with millions or trillions of records, it starts to get ugly (your servers start running out of memory, paging is seen on disk and ultimately the package comes to a crawl).

    The simplest approach in these cases would be to dump everything into a temporary SQL Server table, and then let T-SQL handle it. SQL is built for data manipulation and management, .NET objects (which is what an SSIS package translates to) are not.

    I have to ask when the last time it was that you had "trillions of records" in a staging table to load into another table.

    I encountered this as part of the product we are currently working on where we need to migrate enterprise data from a legacy on-premise application to the cloud. Sorts and Aggregations in SSIS became performance killers and we had to resort to doing all the data computations in staging (an on-premise SQL database) using T-SQL (joins and the rest).

    At a simplistic level, we:
    - Extracted all data to staging
    - Transform: All cleansing and transformations done using T-SQL 
    - Load: SSIS was only used to blindly take data from the staging and pump it to the cloud platform

    Thanks & Regards,
    Nakul Vachhrajani.
    http://nakulvachhrajani.com

    Follow me on
    Twitter: @sqltwins

  • Nakul Vachhrajani - Monday, January 30, 2017 11:38 PM

    csj - Monday, January 30, 2017 11:18 PM

    It is more easy to use the Sort-task. Use the Check-box 'Remove rows with duplicate sort values'. It is a blocking task, but works fast even with millions of rows! It is also blocking if you write data to a temporary table, ..., ..., ... because the work must be done, before it is possible to continue. And writing and reading data to and from TempDB takes time!

    Yes, for smaller datasets (a couple of hundred rows), Sort is fine.
    For larger datasets, we run into the same problem - Sort is a blocking transformation and therefore can become a performance bottleneck.

    I did not know what computer you are using, but on my small surface it takes less than a half minute for more than 6 mill rows. On a small server less than 10 seconds!!!!
    Carsten

  • csj - Tuesday, January 31, 2017 3:05 AM

    Nakul Vachhrajani - Monday, January 30, 2017 11:38 PM

    csj - Monday, January 30, 2017 11:18 PM

    It is more easy to use the Sort-task. Use the Check-box 'Remove rows with duplicate sort values'. It is a blocking task, but works fast even with millions of rows! It is also blocking if you write data to a temporary table, ..., ..., ... because the work must be done, before it is possible to continue. And writing and reading data to and from TempDB takes time!

    Yes, for smaller datasets (a couple of hundred rows), Sort is fine.
    For larger datasets, we run into the same problem - Sort is a blocking transformation and therefore can become a performance bottleneck.

    I did not know what computer you are using, but on my small surface it takes less than a half minute for more than 6 mill rows. On a small server less than 10 seconds!!!!
    Carsten

    keep in mind, this is not only about number of rows, but number of columns and type of data to be processed.
    I have never encounter a data to be imported, where it was simple and easy to identify the duplicates and remove them.
    and even though my datasets have been under few thousand rows it took more than few minutes to process them on average.

  • My 6 mill rows have 11 columns - Firstname, Lastname, Street, Zipcode, .... with the right datatypes and it takes 8 sec from a table and 6.5 sec from a flatfile with no datatypes changed, so all columns is string with length 50

    If I take the columns from the table 2 times, so I am sorting on 22 columns it takes 14 sec. My computer is a desktop with 32 GB RAM and 2 quadcore.

  • carsten.saastamoinen - Tuesday, January 31, 2017 6:48 AM

    My 6 mill rows have 11 columns - Firstname, Lastname, Street, Zipcode, .... with the right datatypes and it takes 8 sec from a table and 6.5 sec from a flatfile with no datatypes changed, so all columns is string with length 50

    If I take the columns from the table 2 times, so I am sorting on 22 columns it takes 14 sec. My computer is a desktop with 32 GB RAM and 2 quadcore.

    so let me get this straight, you are running a simple 11 columns query with no datatype matching/conversion or transforming on an 32GB dual CPU/ 8-core(which is if you count multithreading possible hit at least double vCPU count)  monster
    and you feel that  your processing time is a typical scenario?

    well I did a test run of simple select on 5 column table with all columns as varchar(50) except last which is varchar(8000) and record count 6.7million rows
    and it takes 36 sec to simply select the data on a SQL  VM with 4 CPU @ 2.40GHz and 8Gb  RAM

  • vl1969-734655 - Tuesday, January 31, 2017 7:16 AM

    carsten.saastamoinen - Tuesday, January 31, 2017 6:48 AM

    My 6 mill rows have 11 columns - Firstname, Lastname, Street, Zipcode, .... with the right datatypes and it takes 8 sec from a table and 6.5 sec from a flatfile with no datatypes changed, so all columns is string with length 50

    If I take the columns from the table 2 times, so I am sorting on 22 columns it takes 14 sec. My computer is a desktop with 32 GB RAM and 2 quadcore.

    so let me get this straight, you are running a simple 11 columns query with no datatype matching/conversion or transforming on an 32GB dual CPU/ 8-core(which is if you count multithreading possible hit at least double vCPU count)  monster
    and you feel that  your processing time is a typical scenario?

    well I did a test run of simple select on 5 column table with all columns as varchar(50) except last which is varchar(8000) and record count 6.7million rows
    and it takes 36 sec to simply select the data on a SQL  VM with 4 CPU @ 2.40GHz and 8Gb  RAM

    We are talking about Stating DB - so typical a BI-solution. 32 GB and 8 Core is not a monster, but a small Server special when we are discussing loading million of rows. And it is not a normal scenario to find duplicates on VARCHAR(8000) columns, but more a normal scenario to find duplicates on 22 columns with different datatypes - a mix of VARCHAR(20), VARCHAR(40), INT, SMALLINT. DATE, ..... So 14 sec on 22 columns finding duplicates is a more realistic scenario.

  • I appreciate learning about new widgets in SSIS as I'm still ramping up with the tool.  This and the Sort technique may come in handy someday and I thank the posters for these tips.

    Performance is something you can argue until the cows come in.  It's good to keep it in mind but will be different for almost every situation.

  • Nakul Vachhrajani - Monday, January 30, 2017 11:58 PM

    Jeff Moden - Monday, January 30, 2017 11:51 PM

    Nakul Vachhrajani - Monday, January 30, 2017 10:34 PM

    The problem is scalability. Aggregations are blocking transforms. Hence, they block the data flow in the pipeline and start bloating the memory (as they keep the entire result set in memory to perform the aggregation).

    When tasked with millions or trillions of records, it starts to get ugly (your servers start running out of memory, paging is seen on disk and ultimately the package comes to a crawl).

    The simplest approach in these cases would be to dump everything into a temporary SQL Server table, and then let T-SQL handle it. SQL is built for data manipulation and management, .NET objects (which is what an SSIS package translates to) are not.

    I have to ask when the last time it was that you had "trillions of records" in a staging table to load into another table.

    I encountered this as part of the product we are currently working on where we need to migrate enterprise data from a legacy on-premise application to the cloud. Sorts and Aggregations in SSIS became performance killers and we had to resort to doing all the data computations in staging (an on-premise SQL database) using T-SQL (joins and the rest).

    At a simplistic level, we:
    - Extracted all data to staging
    - Transform: All cleansing and transformations done using T-SQL 
    - Load: SSIS was only used to blindly take data from the staging and pump it to the cloud platform

    Understood and agreed but "trillions" of rows in a staging table?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Shifting gears a bit, I wouldn't use SSIS for any of this other than maybe a process controller.  Between things like WITH IGNORE DUPLICATES as a part of an index definition and some very fast methods for identifying duplicates in T-SQL along with possibly taking advantage of some high-speed minimal logging techniques, I'd write a stored procedure for this.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply