Best practice(s) for refreshing data in several tables

Question

Best practice(s) for refreshing data in several tables

Mike Scalise

SSCrazy Eights

Points: 8145
More actions
June 20, 2019 at 1:15 am

#3655559
Hi,
I need to write a stored procedure to refresh data daily in approximately 500 tables (pulling from data in a separate database obviously). I know there are a bunch of different ways to approach this, but I'm looking to some of you who have done this numerous times to see what you think is the best approach. Before I mention the options I'm thinking, you should know that the source and target databases are on the same SQL Server instance (SQL Server 2016 Ent).
Option 1: Truncate all 500 tables, do INSERTS into the target tables SELECTing from the source tables.
The advantage to this approach in my view is that, because I'd be truncating all of the tables first, there's no concern that any queries run against the target database during the refresh would get data from two different days, depending on how far along the refresh process is and what tables are being queried.
The disadvantage is that, for the entirety of the refresh, there's just simply no data available.
Option 2: SELECT INTO #temp tables from the source for the 500 tables, then TRUNCATE/INSERT INTO the target tables from the temp tables.
The advantage of this approach is that all of the previous day's data is in the database through most of the refresh, and only for the last portion of the SP will there be no data available.
My concern with this approach is that I'm not sure how much faster that would be than Option 1 and I'm not sure what creating temp tables over and over will do to tempdb over time.
Option 3: Do some sort of comparison of each table between source and target and see what adjustments need to be made to synchronize.
Aside from the actually pulling of the data from the source and target, this approach would have to be the fastest in terms of what action queries would need to be run. If on;y 50% of the tables in a given day change, then that's only going to be 50%'s worth of inserts, updates, or deletes that would need to happen. However, I would imagine there is a bit of overhead in pulling the data from each side, and I'm sure the coding for this would be fair bit more complex than a truncate/insert full refresh type of thing. I know there's some ways I could do this like EXCEPT in both directions, maybe the MERGE command (although not my preferred option), etc.
What do you all think? Thanks in advance.

Mike

EDIT: Changed EXISTS to EXCEPT.
- This topic was modified 4 years, 10 months ago by Mike Scalise.
Mike Scalise, PMP
https://www.michaelscalise.com

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply

Phil Parkin SSC Guru Points: 246712 More actions · Answer 1

How about using Jeff's favourite method, SYNONYMs?

Have two versions of all 500 tables, which are truncated/loaded on alternate days. Use synonyms to switch between them every day. Your 'no data available' issue becomes a matter of seconds.

If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 2

Phil,

Thanks for the reply. I hadn't thought to use synonyms. Do you have a link to an example of how I would use a synonym in this way. I'm close to understanding what you're suggesting but I'm not sure I completely get it.

If I have two version of each table, I get that I can simply update the synonym to switch between versions "a" and "b" and that change takes merely seconds. However, I don't get the truncate/load on alternate days thing. How does that help the situation?

Thanks,

Mike

Mike Scalise, PMP
https://www.michaelscalise.com

Phil Parkin SSC Guru Points: 246712 More actions · Answer 3

Imagine the two sets of tables are named A and B.

It's early morning on a new day, and synonyms are currently pointing at set A, which includes the day before yesterday's data.

The data truncate/load fills up table set B.

On completion, the synonyms are 'repointed' to table set B, which now contains yesterday's data.

Tomorrow, the same thing happens, with A and B reversed.

If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

Phil Parkin SSC Guru Points: 246712 More actions · Answer 4

This is a technique which I have read about, but never implemented, therefore I don't feel qualified to direct you to specific links. I'm just waiting for the opportunity to give it a go myself.

If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 5

Phil,

Thank you .That makes perfect sense. That's a good solution to the 'no data available' issue. Now it's a question of whether I want to truncate/re-populate or try to work in something like MERGE.

Mike

Mike Scalise, PMP
https://www.michaelscalise.com

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 6

Have you looked into database snapshots for this? It would save you a heck of a lot of coding.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 7

Chris,

Thanks for the suggestion! I'm not sure a database snapshot would work for me in this case because the target database is a slightly different structure from the source database.

I'm currently restoring a backup of one of our production databases on another server daily, for two reasons. First, as part of my maintenance plan, it validates/verifies that I have a working and restore-able backup each day. Second, it serves as the source database for this ETL process I'm writing, so I don't need to query my live system to populate the aforementioned target database, which has the slightly different structure. Unfortunately, I can't even use the tablediff tool to generate scripts because I'm comparing a result set with a table (and I think the tool takes two tables--so I would need to temporarily create tables if I wanted to go that route).

My latest thought is to do a combination of DELETE (using EXISTS) + INSERT (using EXCEPT) to ensure the target tables always match the source. Depending on how fast (or slow) that process is, I may test the synonym trick that Phil mentioned before to make things even more efficient.

I guess not having had to write ETL processes in recent times, I wasn't sure if there were new features with SQL Server 2016/17 that may make things easier...but I'm thinking the DELETE/INSERT statements might still be a very viable approach--and be faster than my original full refresh (TRUNCATE/INSERT) approach. I'll need to do some testing to be sure though.

Thanks,

Mike

Mike Scalise, PMP
https://www.michaelscalise.com

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 8

Hi Mike

One of our solutions here uses a snapshot of a replicated database to replace a backup/restore process which used to take hours (about 12 of them). We don't talk directly to the snapshot though, because like you there are differences between the original db and the reporting copy. Instead we have a whole new database which has views of the snapshot tables to provide folks with the data as they expect it to look. It has one or two static tables too. Since it's only small tables and views of the big tables, it's got a much smaller footprint than the original & the snapshot. The key difference between this solution and the one you are proposing is that the io burden in your solution is distributed across two physical databases (which could be on different storage solutions, hence splitting the io burden in two). This is of course impossible to do with our solution.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

lbhlittleton SSC Enthusiast Points: 153 More actions · Answer 9

Hi,

It might be worth looking at Partition Switching as a possible option as well:

https://www.jamesserra.com/archive/2012/07/sql-server-table-partition-switching/

Jeff Moden SSC Guru Points: 1003873 More actions · Answer 10

lbhlittleton wrote:

Hi, It might be worth looking at Partition Switching as a possible option as well: https://www.jamesserra.com/archive/2012/07/sql-server-table-partition-switching/

How will partition switching help you between databases?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1003873 More actions · Answer 11

I guess what I'd really like to know is why do you have the second set of fully replaceable tables to begin with? What are in these tables and why can't one database simply refer to the tables in another?

That would also prevent the need to dink around with indexes, etc, etc.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 12

Hi,
It might be worth looking at Partition Switching as a possible option as well:
https://www.jamesserra.com/archive/2012/07/sql-server-table-partition-switching/

Thank you! I'll look into that!

This reply was modified 4 years, 10 months ago by Mike Scalise.
This reply was modified 4 years, 10 months ago by Mike Scalise.

Mike Scalise, PMP
https://www.michaelscalise.com

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 13

Jeff,

You ask some great questions. Are you suggesting that I create a separate database with only views (that are named with the original names of the tables I would've refreshed) and that simply query the source database for the columns and records I need? Any existing reports/queries against this target database would continue to work because the view names would be the table names (and I would have deleted the tables so the names would be available for the views).

Do I have that right? Then there's no refreshing whatsoever. I'm not sure about performance and if indexed views are event a possibility. I think I need to specify WITH SCHEMABINDING when creating the views, so I don't know how that would work across databases and if I could replace the source database each day without an issue.

Thanks,

Mike

Mike Scalise, PMP
https://www.michaelscalise.com

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 14

ChrisM@Work wrote:

Hi Mike One of our solutions here uses a snapshot of a replicated database to replace a backup/restore process which used to take hours (about 12 of them). We don't talk directly to the snapshot though, because like you there are differences between the original db and the reporting copy. Instead we have a whole new database which has views of the snapshot tables to provide folks with the data as they expect it to look. It has one or two static tables too. Since it's only small tables and views of the big tables, it's got a much smaller footprint than the original & the snapshot. The key difference between this solution and the one you are proposing is that the io burden in your solution is distributed across two physical databases (which could be on different storage solutions, hence splitting the io burden in two). This is of course impossible to do with our solution.

Chris,

Thank you. I like your solution a lot. Is my explanation below with an entire database of views similar to what you have? I want to make sure I'm understanding you correctly.

Mike

Mike Scalise, PMP
https://www.michaelscalise.com