ETL Process taking a while to execute

  • Hi,

    we have an ETL process that is running everynight, and we get data (text files) about say 4 million rows. And only about 300 thousand rows have been updated, so we have to do a delete statment for the rest of the rows, which is taking a while to execute. Do you all know any other way around these? We thought of cursors but same overhead for cursors are high also. Also temp table same way, temp db would be filled, same performance issue.

    Thanks,

    FK

  • Cursors = really, really bad idea

    Could you describe what happens in a little more detail please?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Agree, cursors are bad and we are staying away from those!!

    The current delete statment takes about 872729 ms, which is kind of high, and what the delete does is delete the unchanged data (3.5 million rows) and keep the changed once .5 million rows.

  • Could you describe the ETL process a bit more please?

    You may find it more efficient to copy the rows you want to keep to a temp table, truncate the original table and copy the rows back.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Wel, I think that should work. Thanks.

    Our ETL process the following..

    We have mainframe files that get loaded at night, and basically the ETL will load the file into a table, once there using slowly changing dimension we check for various fileds if they were updated or what needs to be updated, so for the once we don't want to do anything, we delete based on one column flag, and then continue to load the rest to the appropriate fact tables. So that delete step was taking a while, we created an index (non-clustered) on it but still did not help, I will try to create a temp table or even a regular table, and do the truncate. Make sense. Thanks.

  • Which column did you create an index? How big is the table? If the table is big and there are many indexes on the table, it will take a while to delete it.

    Did you mention you use cursor or just use the 'Delete' statement?

    DELETE x

    FROM Tablex x

    INNER JOIN temptable t ON x.col1 = t.col1

    WHERE x.Delcol = 'Y'

  • The table is about 4 Million rows, very big, but the temp table idea seems to be working, thanks.

  • Generally, if you're deleting up to half the table (maybe even more) it's faster to save the rows you want, then trunate the table than to run the delete.

    Less impact on the tran log too, since truncate is a minimally logged operation.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • i was in a project in which we had the same issue but with excel.. what we did was all the sql code validation or operations that we had in the ETL was moved to stored procedures and we reduced the time to almost %80.it worked. and also if the file has information that is not needed it could be deleted before doing any operation.

  • The truncate solution seems best, but if you really get stuck having to delete millions of rows, one technique that has worked for me is setting a rowcount of something like 50000 or 100000 and looping through the delete statement until count = 0. Keeps the tran log manageable. Each delete statement usually would take a few seconds, and even if you loop through 30-40 times, total time may only be a few minutes.

  • Be aware that the SCD transform in SSIS does generally not scale well on large datasets.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply