Deduplicating records from monthly table backups?

  • Oh... and backup is the wrong word.  They don't want a backup I don't really mean off-line backup because at any point in time they need to be able to see what the table looked like as of march or april or decemember within a moments notice.  Like with a SQL query on hand.

  • Kevlarmpowered wrote:

    heb1014 wrote:

    Hey Jeff.  Great idea!  I'm going to add this to my notes.

    When you have a new historical period, you create a new table and modify the view definition?

    At the end of the month a job kicks off that does...

    select * into HugeTable_mm_yyyy

    And at 25 billion rows and counting, you can imagine that select into is not exactly... speedy.

    If you had a datestamp column on this table, that was set each time a row was inserted/updated, you could just select rows that have been updated/inserted since the last run.

  • Jonathan AC Roberts wrote:

    Kevlarmpowered wrote:

    heb1014 wrote:

    Hey Jeff.  Great idea!  I'm going to add this to my notes.

    When you have a new historical period, you create a new table and modify the view definition?

    At the end of the month a job kicks off that does...

    select * into HugeTable_mm_yyyy

    And at 25 billion rows and counting, you can imagine that select into is not exactly... speedy.

    If you had a datestamp column on this table, that was set each time a row was inserted/updated, you could just select rows that have been updated/inserted since the last run.

    Nope... no date/time column that could be tied to this.  The only thing that identifies it is the table name.

  • Kevlarmpowered wrote:

    Jonathan AC Roberts wrote:

    Kevlarmpowered wrote:

    heb1014 wrote:

    Hey Jeff.  Great idea!  I'm going to add this to my notes.

    When you have a new historical period, you create a new table and modify the view definition?

    At the end of the month a job kicks off that does...

    select * into HugeTable_mm_yyyy

    And at 25 billion rows and counting, you can imagine that select into is not exactly... speedy.

    If you had a datestamp column on this table, that was set each time a row was inserted/updated, you could just select rows that have been updated/inserted since the last run.

    Nope... no date/time column that could be tied to this.  The only thing that identifies it is the table name.

    Not true... you have a column called "Period" that will do the job nicely for a Partitioned View.

    Kevlarmpowered wrote:

    Oh... and backup is the wrong word.  They don't want a backup I don't really mean off-line backup because at any point in time they need to be able to see what the table looked like as of march or april or decemember within a moments notice.  Like with a SQL query on hand.

    The Partitioned View would keep ALL of it ONLINE and greatly reduce the backups not to mention totally eliminating the arduous process, time, and resources required to copy a 25 Billion row table each month.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Not true... you have a column called "Period" that will do the job nicely for a Partitioned View.

     

    Nope... period is the expected period of posting, not the period when it was entered.  So the expected period could change from month to month because it's not when it 'actually' happened but when it when it is projected to happen.  So some of those periods are pushed out to infinity 12/1/2099 because we may never post.  The period then gets adjusted when we get closer to actually posting until it goes into past tense when it actually happened.

  • What you're being asked for is an SCD2 model.  note this is usually NOT kept within your OLTP solution.

    You would need to keep 2 dates in addition to whatever business data is in the table, essentially the begin and end dates of when this particular version of this particular row were the truth.  essentially every month, if any rows had been modified, expired the old version and insert a new version, which starts when the previous version expired.

     

    https://en.wikipedia.org/wiki/Slowly_changing_dimension

     

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (4) wrote:

    What you're being asked for is an SCD2 model.  note this is usually NOT kept within your OLTP solution.

    You would need to keep 2 dates in addition to whatever business data is in the table, essentially the begin and end dates of when this particular version of this particular row were the truth.  essentially every month, if any rows had been modified, expired the old version and insert a new version, which starts when the previous version expired.

    https://en.wikipedia.org/wiki/Slowly_changing_dimension

    Yup... that's what I need.  Trying to figure out a safe/easy way to script the existing data from what it is into that SCD.  The how is... the question.

  • You're probably not going to like the answer, but the only thing that comes to mind is a new table and load it using SCD2 on a month by month basis.  this is a LOT of work

    Essentially, assuming OLD and NEW were your working tables:

    1. copy the first month of data from OLD into your NEW table, using the begin of that month as the "begin" date and 12/31/2199 as the end date
    2. compare the next month in OLD to the data in NEW with a end date of131/2199
    3.   if something changes, then expire the row in NEW with the month end at you're processing, then insert the updated row from OLD
    4. increment to another month and go back to step 2.

    once you're all done - NEW has the data you want with the mechanism by which you can see what was true when.

    This can definitely be done in a set-based fashion, but you'd really want  to design this out for performance and look for down or quiet times to run this, because this is going to put some strain on your hardware.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (4) wrote:

    You're probably not going to like the answer, but the only thing that comes to mind is a new table and load it using SCD2 on a month by month basis.  this is a LOT of work

    Ya... I was hoping there was an easier way because there are 36+ months of tables and about 8TB of data to go through in total.  It is less than ideal but sounds like the only way is the hard way.

Viewing 9 posts - 16 through 23 (of 23 total)

You must be logged in to reply to this topic. Login to reply