Refreshing data - deletion of existing rows slow

  • Hi all, I am posting in this forum because I am after general comments rather than specifics.

    We have an operational database whose data is massaged and imported into a secondary database used for reporting purposes.

    This reporting database has a requirement whereby we take a snapshot of operational data for each month as at the end of each month.

    This is so at any point in time we can run a report and get the same repeatable, historical result for any particular month.

    Each table within the reporting database has as part of its PK a column ss_id (snapshot ID) to allow for this monthly reporting requirement.

    We use SSIS to import the data which runs in an acceptable timeframe - apart from one aspect which I will get to.

    We have is the secondary requirement which states that the SSIS ETL can be run in an adhoc manner at any time during the month and it should replace ('refresh') the existing data for that month within the reporting database, so that the customer has the most up to date information available.

    The source of the problem is that we have 70-odd tables in the reporting database some of which have around 6 million rows for a particular snapshot.

    When we 'refresh' a snapshot we simply delete all the rows in the reporting database for each of these tables where the row has the most recent (i.e. the highest value) ss_id.

    These tables have FK constraints on them which are disabled in order to speed up the deletion process.

    The deletion process itself takes about 20 minutes and re-enabling the constraints takes about 40 minutes, which makes in total 1 hour of wasted time that achieves nothing of significance 😉

    There are a limited number of indexes that support the FK constraints and testing indicates that creating them where they do not currently exist does not provide a significant improvement in the time taken for the deletion of rows.

    I was hoping to find some way of reducing the time taken for this 'refresh' process.

    Things I have tried so far (in no particular order):

    1) Deleting rows in batches by calling DELETE TOP (@batchSize) within a loop and trying different values for @batchSize

    2) Partitioning the tables along the ss_id and then (at least for those who are not the target of FK constraints)

    switching the content out to a staging table which is then dropped - we would like to partition our

    tables anyway due to a requirement in the near future to be able to easily merge and/or archive off data

    along snapshot id

    This works OK for the 47 tables that do not have FKs on them, processing is a few seconds as opposed to minutes - but

    these are not the tables with the large number of rows 🙂

    3) Using two filegroups, [primary] and [MostRecentSnapshot], [primary] containing all rows/partitions up to but not

    including the most recent snapshot, [MostRecentSnapshot] containing the rows/partition for the most recent snapshot.

    I thought somehow if I could physically separate the data I could then remove it in one hit - however this has

    proven to be a dead end.

    Things which I tried but didn't do a very good job at

    -----------------------------------------------------

    1) Delta'ing each table's data so as to avoid having to remove all rows, just the ones which have been added/modified/deleted

    since the last ETL run - this has not proven very successful, mainly because of the way I implemented it, it got very complex

    very fast and so I flagged it for revisiting later if nothing else worked...

    Things I haven't tried

    ----------------------

    1) I was thinking of doing away with the FK constraints totally and then staging the data through tables which had

    the constraints intact - that way I would only be re-enabling constraints against a more 'static' set of data, i.e.

    5-7 million rows each time as opposed to a growing number of rows in the base tables.

    So what I am after is some general comments, suggestions and resource references on other possible approaches.

    It may well be that I am stuck with what I have, that is OK, I just need to try and exhaust as many possibilities as I can before giving up.

    thanks to all

    ujb

  • Speaking generally, it really sounds like you're on the right track using partitioning. Andy Novick (MVP, consultant, nice guy) has presented at PASS and our local users group on using partitions to handle hundreds of gb of data, doing just the sort of actions you're talking about here. If you attended this year's PASS Summit you can get his slide deck and, when they're available, a recording of the session. Otherwise, you might swing by his web site[/url], because I think the stuff is there too.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thanks for that, good reference, much appreciated, learned some new useful stuff...

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply