Deduplicating records from monthly table backups?

Question

Post reply

Deduplicating records from monthly table backups?

Kevlarmpowered

Say Hey Kid

Points: 713
More actions
October 2, 2019 at 12:05 am

#3685173

I have a database that has monthly backups of a table for the past five years... i.e., they copied the table at the end of the month to a table called theHugeTable_mm_yyyy. Now as you can imagine, tables are huge... and the majority of the table data didn't change from one month to the other. Anyone have any advice on how to merge all of those records into a single table?
In my mind, we can cut down a significant amount of storage requirements if we switch to an effective dating mechanism. So if a record is new a new record goes in. If the record changes, we add a new record effectively dating it for that month. I'm guessing I'd have to come up with a key and then hashing each row to identify if any of the data has changed in the row to then tag it with an effective change.

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply

jcelko212 32090 SSCrazy Eights Points: 9304 More actions · Answer 1

Have you looked at archive tools? This is not a SQL job.

Please post DDL and follow ANSI/ISO standards when asking for help.

Jonathan AC Roberts SSCoach Points: 18407 More actions · Answer 2

The DDL and some sample data of the type of duplicates you want to move would be helpful.

heb1014 Hall of Fame Points: 3790 More actions · Answer 3

Sounds like a candidate for a temporal table. Have you checked that out?

Phil Parkin SSC Guru Points: 247198 More actions · Answer 4

heb1014 wrote:

Sounds like a candidate for a temporal table. Have you checked that out?

Might fix them up going forward, but won't solve the immediate problem, which sounds like it has two parts:

How do I remove existing redundant data?
How can I fix up my current process so that it works in an incremental, rather than snapshot, way.

But, as an earlier post suggests, we need more detail before being able to help with either.

Jeff Moden SSC Guru Points: 1004682 More actions · Answer 5

jcelko212 32090 wrote:

Have you looked at archive tools? This is not a SQL job.

Actually, it's the perfect job for SQL Server.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004682 More actions · Answer 6

Kevlarmpowered wrote:

I have a database that has monthly backups of a table for the past five years... i.e., they copied the table at the end of the month to a table called theHugeTable_mm_yyyy. Now as you can imagine, tables are huge... and the majority of the table data didn't change from one month to the other. Anyone have any advice on how to merge all of those records into a single table?
In my mind, we can cut down a significant amount of storage requirements if we switch to an effective dating mechanism. So if a record is new a new record goes in. If the record changes, we add a new record effectively dating it for that month. I'm guessing I'd have to come up with a key and then hashing each row to identify if any of the data has changed in the row to then tag it with an effective change.

Are you saying that they copy only the rows for each month to HugeTable_mm_yyyy and that there's one of those tables for every month? If so, most of this problem is already solved. I just need some more information.

Is there a column in each of those tables that reliably depicts what month a row belongs to (ie, DATETIME or DATE or DATETIME2 column)?
Are you using the Enterprise Edition or the Standard Edition of SQL Server?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Kevlarmpowered Say Hey Kid Points: 713 More actions · Answer 7

They copy the entire table...

HugeTable_03_2019 includes everything from the beginning of time to the end of the month 03/2019.

HugeTable_04_2019 includes everything from the beginning of time to the end of the month 04/2019... all the new records and records that were updated.

HugeTable_09_2019 includes everything from the beginning of time to the end of the month 09/2019... all the new records and records that were updated.

The tables look like this...

CREATE TABLE [HugeTable_02_2019 ] (

[ID] [bigint] IDENTITY(1,1) NOT NULL,

[Account] [varchar](30) NULL,

[Period] [datetime] NULL,

[Amount] [money] NULL,

[varchar](5) NULL,

[Balance] [money] NULL,

[Charge] [money] NULL,

[Name] [varchar](30) NULL)

This is for SQL 2016 Enterprise

Jeff Moden wrote:

Kevlarmpowered wrote:
I have a database that has monthly backups of a table for the past five years... i.e., they copied the table at the end of the month to a table called theHugeTable_mm_yyyy. Now as you can imagine, tables are huge... and the majority of the table data didn't change from one month to the other. Anyone have any advice on how to merge all of those records into a single table?
In my mind, we can cut down a significant amount of storage requirements if we switch to an effective dating mechanism. So if a record is new a new record goes in. If the record changes, we add a new record effectively dating it for that month. I'm guessing I'd have to come up with a key and then hashing each row to identify if any of the data has changed in the row to then tag it with an effective change.
Are you saying that they copy only the rows for each month to HugeTable_mm_yyyy and that there's one of those tables for every month? If so, most of this problem is already solved. I just need some more information.

<li style="list-style-type: none;">
<li style="list-style-type: none;">
Is there a column in each of those tables that reliably depicts what month a row belongs to (ie, DATETIME or DATE or DATETIME2 column)?

<li style="list-style-type: none;">
<li style="list-style-type: none;">
Are you using the Enterprise Edition or the Standard Edition of SQL Server?

This reply was modified 6 years, 2 months ago by Kevlarmpowered.

Cary Hower-563110 Say Hey Kid Points: 682 More actions · Answer 8

Why don't you just keep the latest backup table and delete the rest?

Jeff Moden SSC Guru Points: 1004682 More actions · Answer 9

Better than what Carey Hower just proposed, why not just leave the table named as HugeTable with no date marker and just back that up? What they're doing is insane.

I did ask which Edition of SQL Server they were using but it doesn't actually matter. What I recommend is to use a Partitioned View (NOT a partitioned table because it has some problems that people don't consider and is total overkill for a simple historical table). Regardless of the type of partitioning, if it's done correctly (1 Filegroup with 1 File per period), then you'll be able to stop backing up legacy data that will never change simply by setting the older Filegroups to READ_ONLY and backing then up one final time. After that, you',ll only need to backup the currently active period. This will also make "Get-back-in--business-after-a--disaster" restores MUCH quicker and it will also be easy and much quicker to make a much smaller sub-set of the data for development purposes.

I did such a thing with one of my larger historical/audit tables of 5 years and the backup times dropped from 10 hours to about 6 minutes. We're on year 9 now and it still only takes about 10 minutes to do the backups because we're no longer backing up years of data that will NEVER change. I did screw up and use partitioned tables, which make it more difficult to restore to a development environment and I'm going to switch it to a Partitioned View, but it has saved tons of DR backup tapes and tons of time/resources even though we're going on a decade of legacy data that we're required to keep online forever (and I'm working on THAT requirement, as well).

Best of all, it's very simple to do and very simple to automate. I've not had to lift a finger on it for more than 4 years now. It works like a Timex. ("Takes a lickin' and keeps on ticken').

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

heb1014 Hall of Fame Points: 3790 More actions · Answer 10

Hey Jeff. Great idea! I'm going to add this to my notes.

When you have a new historical period, you create a new table and modify the view definition?

Jeff Moden SSC Guru Points: 1004682 More actions · Answer 11

heb1014 wrote:

Hey Jeff. Great idea! I'm going to add this to my notes.
When you have a new historical period, you create a new table and modify the view definition?

Actually, it's important to have the next month already setup and online so that you don't have to scramble at month end nor do any special timing, etc. And, yes, it's as simple as rebuilding the view dynamically. A consistent naming convention will seriously help there. I also name the Filegroup and the single file it contains to be the same just to keep the automation (and the obviousness of the naming) super simple.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Kevlarmpowered Say Hey Kid Points: 713 More actions · Answer 12

Cary Hower-563110 wrote:

Why don't you just keep the latest backup table and delete the rest?

Because then I will lose what the values were back in March versus December. It's not just new records going in, it's people going back in time and making modifications to previously entered records because of "reasons".

I need to be able to see what the table looked like in March or April or December.

Jeff Moden SSC Guru Points: 1004682 More actions · Answer 13

p.s. If you really want to do it right with Partitioned Views, create each period's FileGroup and File in a separate database using the exact same naming convention but with a "ZZZ" at the beginning of the database name. Some folks are annoyed by all the databases but the "ZZZ" does two things... it first moves the databases to the end of just about any list of databases (especially in object explorer) and it indicates the purpose of the database... "ZZZ" is a common cartoon annotation of someone sleeping and you're putting data "to sleep". 😀

The separate databases have the advantage of not being in the main database, which means your "get back in business" restore of the main database will be MUCH quicker. It will also allow you to easily restore just the last several months of historical data to dev boxes without having to have a huge amount of disk space available for the dev box.

Both of those are huge advantages of Partitioned Views over Partitioned tables.

Some will remind you and possibly complain that you won't be able to have as many periods in a Parttioned View than in a Partitioned Table. While that may be true (I think the limit is 254 partitions in a partitioned view), that will suffice for more than 20 years if the partitions are "monthly" partitions. You can certainly do the trick of combining monthly periods into yearly periods for anything older than, say, 3 years and extend that even more.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Kevlarmpowered Say Hey Kid Points: 713 More actions · Answer 14

heb1014 wrote:

Hey Jeff. Great idea! I'm going to add this to my notes.
When you have a new historical period, you create a new table and modify the view definition?

At the end of the month a job kicks off that does...

select * into HugeTable_mm_yyyy

And at 25 billion rows and counting, you can imagine that select into is not exactly... speedy.