Inserting large data volumes into a table as fast as possible

Question

Inserting large data volumes into a table as fast as possible

Hall of Fame

Points: 3331

March 5, 2020 at 11:42 pm

I'm doing development on large SCD2 tables (100M+ rows). When I've fixed bugs and it's time to re-run, I need to restore the data back to a known point, currently via truncate and insert from a backup table.

This process is taking FOREVER and slowing down my development.

Sample code to restore the tables:

/* Need to drop the foreign key constraints */
ALTER TABLE [new].[RLDX_OPERATIONAL] DROP CONSTRAINT [FK_RLDX_OPERATIONAL_RLDX_AUDIT_TRAIL_rldx_audit_key]
ALTER TABLE [new].[RLDX_PERSONAL_IDENTIFIERS] DROP CONSTRAINT [FK_RLDX_PERSONAL_IDENTIFIERS_RLDX_AUDIT_TRAIL_rldx_audit_key]
ALTER TABLE [new].[RLDX_ARCHIVE] DROP CONSTRAINT [FK_RLDX_ARCHIVE_RLDX_AUDIT_TRAIL_rldx_audit_key]
GO

TRUNCATE TABLE new.RLDX_PERSONAL_IDENTIFIERS
GO

-- Load RLDX_AUDIT_TRAIL and RLDX_OPERATIONAL (not shown here)

-- This is the big table giving me grief
TRUNCATE TABLE new.RLDX_ARCHIVE;
SET IDENTITY_INSERT new.RLDX_ARCHIVE ON
INSERT INTO new.RLDX_ARCHIVE
(
    rldx_archive_key
   ,rldx_audit_key
   ,recnum
   ,hospital_type
   ,record_status
   ,facility_identifier
   ,stay_number
   ,episode_sequence_number
   ,episode_start_date
   ,episode_end_date
   ,notification_date
   ,mrn
   ,surname
   ,given_names
   ,birth_date
   ,death_date
   ,age
   ,sex
   ,address
   ,suburb
   ,postcode
   ,state_of_usual_residence
   ,country_of_birth
   ,country_of_birth_sacc
   ,facility_trans_from
   ,facility_trans_to
   ,supi
   ,moh_auid
   ,ahs_hospital
   ,primary_diag_status_flag
   ,updated_person_address_flag
   ,updated_person_name_flag
   ,replica_uuid_epi
   ,replica_uuid_sty
   ,replica_uuid_pcd
   ,replica_uuid_dia
   ,dc_extract_date
   ,rldx_valid_from
   ,rldx_valid_to
   ,rldx_current_record
)
SELECT *
  FROM new.BACKUP2_RLDX_ARCHIVE
ORDER BY
       rldx_archive_key
SET IDENTITY_INSERT new.RLDX_ARCHIVE OFF
GO

-- This is different than the other tables
-- We MUST set the rldx_archive_key to the MAX(recnum)
-- so that new row processing gets a valid recnum
DECLARE @max_recnum BIGINT;
SELECT @max_recnum = MAX(recnum) FROM new.RLDX_ARCHIVE; 
PRINT 'MAX_RECNUM='+CAST(@max_recnum AS VARCHAR(18));
DBCC CHECKIDENT('new.RLDX_ARCHIVE', RESEED, @max_recnum);
DBCC CHECKIDENT('new.RLDX_ARCHIVE');
GO

/* Restore foreign key constraints */
ALTER TABLE [new].[RLDX_OPERATIONAL] WITH CHECK ADD CONSTRAINT [FK_RLDX_OPERATIONAL_RLDX_AUDIT_TRAIL_rldx_audit_key] FOREIGN KEY ([rldx_audit_key])
REFERENCES [new].[RLDX_AUDIT_TRAIL] ([rldx_audit_key])

ALTER TABLE [new].[RLDX_PERSONAL_IDENTIFIERS] WITH CHECK ADD CONSTRAINT [FK_RLDX_PERSONAL_IDENTIFIERS_RLDX_AUDIT_TRAIL_rldx_audit_key] FOREIGN KEY ([rldx_audit_key])
REFERENCES [new].[RLDX_AUDIT_TRAIL] ([rldx_audit_key])

ALTER TABLE [new].[RLDX_ARCHIVE] WITH CHECK ADD CONSTRAINT [FK_RLDX_ARCHIVE_RLDX_AUDIT_TRAIL_rldx_audit_key] FOREIGN KEY ([rldx_audit_key])
REFERENCES [new].[RLDX_AUDIT_TRAIL] ([rldx_audit_key])
GO

The target table has these constraints:

 CONSTRAINT [PK_RLDX_ARCHIVE] PRIMARY KEY CLUSTERED 
(
[rldx_archive_key] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

ALTER TABLE [new].[RLDX_ARCHIVE] ADD  CONSTRAINT [DF_RLDX_ARCHIVE_recnum]  DEFAULT ((-1)) FOR [recnum]
GO

ALTER TABLE [new].[RLDX_ARCHIVE] ADD  CONSTRAINT [DF_RLDX_ARCHIVE_rldx_valid_from]  DEFAULT (getdate()) FOR [rldx_valid_from]
GO

ALTER TABLE [new].[RLDX_ARCHIVE] ADD  CONSTRAINT [DF_RLDX_ARCHIVE_rldx_valid_to]  DEFAULT ('9999-12-31 23:59:59') FOR [rldx_valid_to]
GO

ALTER TABLE [new].[RLDX_ARCHIVE] ADD  CONSTRAINT [DF_RLDX_ARCHIVE_rldx_current_record]  DEFAULT ('Y') FOR [rldx_current_record]
GO

ALTER TABLE [new].[RLDX_ARCHIVE]  WITH NOCHECK ADD  CONSTRAINT [CK_RLDX_ARCHIVE_rldx_current_record] CHECK  (([rldx_current_record]='N' OR [rldx_current_record]='Y'))
GO

ALTER TABLE [new].[RLDX_ARCHIVE] CHECK CONSTRAINT [CK_RLDX_ARCHIVE_rldx_current_record]
GO

And these indexes:

CREATE NONCLUSTERED INDEX [IX_RLDX_ARCHIVE_natural_keys] ON [new].[RLDX_ARCHIVE]
(
[facility_identifier] ASC,
[stay_number] ASC,
[episode_sequence_number] ASC,
[rldx_valid_from] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

CREATE UNIQUE NONCLUSTERED INDEX [IX_RLDX_ARCHIVE_recnum_rldx_valid_from] ON [new].[RLDX_ARCHIVE]
(
[recnum] ASC,
[rldx_valid_from] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

ALTER TABLE [new].[RLDX_ARCHIVE] ADD  CONSTRAINT [PK_RLDX_ARCHIVE] PRIMARY KEY CLUSTERED 
(
[rldx_archive_key] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

Questions:

Should I lose the ORDER BY on the INSERT INTO since the target table has a clustered index on that column anyway?
Should I drop the indexes and/or constraints before the INSERT INTO? But then I'd need to recreate them after loading the data.
Would wrapping the INSERT INTO with BEGIN TRANSACTION/COMMIT improve performance by reducing logging?
Would BCP be a better approach?
Some brilliant idea that I can't think of?

It's taking about 2 hours just to restore back to the known data point before I can test the new code changes. Ugh!

I actually have to do this across two environments, "old" and "new", comparing the table results after running the new and old code.

I appreciate any ideas on how to improve the performance of re-loading these large tables.

Viewing 10 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

DesNorton SSC-Insane Points: 24686 More actions · Answer 1

If the source and destination tables are the same structure, you can swap out the partitions

How to post data/code on a forum to get the best help.

handkot SSCertifiable Points: 5447 More actions · Answer 2

Some brilliant idea that I can't think of?

maybe use "database snapshot"?

I Have Nine Lives You Have One Only
THINK!

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 3

The following is going to be an major problem for 100 million rows...

SET IDENTITY_INSERT new.RLDX_ARCHIVE ON
INSERT INTO new.RLDX_ARCHIVE

There's a known issue with SET IDENTITY_INSERT ON... it forces full materialization of the data in TempDB and sorts it. ARCHIVE tables just don't need the IDENTITY PROPERTY on a column because the IDENTITY property has already been expressed in the original table. If it's truly an archive table, rebuild the table to NOT use the IDENTITY property. It's nearly as useless as having Foreign Keys on an audit table.

Second, add WITH (TABLOCK) to the INSERT and change the Recovery Model of the database (unless you have replication or some other junk that relies on the log file active) to enable minimal logging, which will quite literally make the INSERT operate twice as fast. I emphasis that you must NOT change to the Bulk Logged recovery model if you have anything that relies on the Transaction Log file. Still, use the WITH (TABLOCK) to prevent the pain of escalation from row to page to table locks. The table locks are frequently gonna happen anyway... cut out the middleman.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jonathan AC Roberts SSCoach Points: 18408 More actions · Answer 4

Could you drop the original table and rename the backup table to have the same name?

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 5

p.s. With regard to the original question of...

I'm doing development on large SCD2 tables (100M+ rows). When I've fixed bugs and it's time to re-run, I need to restore the data back to a known point, currently via truncate and insert from a backup table.

... and the hours it's taking to "reset" for the next test... is there no way for you to simply do a database restore on the test database? Perhaps even copy the tables you're working with to a new test database, back it up in the original condition you want the tables in so the restores only take a minute or two?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Brahmanand Shukla SSC Eights! Points: 990 More actions · Answer 6

You can try SELECT * INTO and then rename the table with sp_rename and also create constraints and indexes.

SQL Server Carpenter - Consultant & Trainer
Author of Learn T-SQL from Scratch

Scott In Sydney Hall of Fame Points: 3331 More actions · Answer 7

Thanks for all the replies, very much appreciated.

@DesNorton

If the source and destination tables are the same structure, you can swap out the partitions

Thanks, I'll Google how to do that and post any follow up questions as a separate thread.

@HandKot:

maybe use "database snapshot"?

Thanks, I'll Google to learn more about how to implement this. There are many tables in this database that should not be affected my the restoration of my tables. Can I do a "table snapshot"? Also note I'm not a full DBA of this database.

@JeffModen:

There's a known issue with SET IDENTITY_INSERT ON... it forces full materialization of the data in TempDB and sorts it. ARCHIVE tables just don't need the IDENTITY PROPERTY on a column because the IDENTITY property has already been expressed in the original table. If it's truly an archive table, rebuild the table to NOT use the IDENTITY property. It's nearly as useless as having Foreign Keys on an audit table.

Don't be thrown off by "ARCHIVE" or "AUDIT" in the table names. "If it's truly an archive table" - nah, probably not, it's just the main history table tracking changes in the source data using SCD2 style from/to dates. Perhaps the table is poorly named - "HISTORY" might be better?

because the IDENTITY property has already been expressed in the original table

Apologies I don't understand this terminology?

change the Recovery Model of the database (unless you have replication or some other junk that relies on the log file active) to enable minimal logging

While I'm not a full DBA I have tested and see that I can change the recovery model in this Development database from Simple to Bulk Logged. Would that improve the performance? I'm not the only user of the dev database, so TBH I don't know the repercussions of "...if you have anything that relies on the Transaction Log file".

Your comments make me think the modelling of this table (which I control) could be wrong and/or improved? Perhaps I'm over-using an IDENTITY column as the primary key of the table? I note that every table should have a PK:

https://www.isbe.net/Documents/SQL_server_standards.pdf, #17, bullet #4.

and my go-to is usually an IDENTITY column.

However, it would be very rare (never?) that a join needs to be made to this table using this surrogate key. Perhaps the natural keys facility_identifier, stay_number, episode_sequence_number, rldx_valid_from columns would be a better PK? I then wouldn't need IDENTITY_INSERT ON since there would be no IDENTITY column. But I don't want the "tail to wag the dog" (the slow performance on restoring the table during development to dictate the PK of this table). I hope that makes sense?

@jonathan-2 AC Roberts:

Could you drop the original table and rename the backup table to have the same name?

Good idea but that's only good for a one-off process. Doing this, I lose the backup. Since I have to restore many times during development this approach won't work for me.

@JeffModen (again):

is there no way for you to simply do a database restore on the test database? Perhaps even copy the tables you're working with to a new test database, back it up in the original condition you want the tables in so the restores only take a minute or two?

I just tested and I don't have create database rights on this server. I need to hit the docs, or just test this with some sample databases, but being lazy I'll just ask you...if I have a test database with 5 tables, and a target database with 100 tables, and I restore a backup from the test database into the target database, does the restore "merge" the backup into the target database. IOW, does the target database have 100 tables, 95 of which were untouched? Something a proper DBA would know, which I certainly am not.

@Brahmanand Shukla:

You can try SELECT * INTO and then rename the table with sp_rename and also create constraints and indexes.

Great idea, I actually haven't tested the performance of restoring into a heap, then adding the constraints and indexes. In the past, I have gotten much better performance using this approach.

BTW, I did use bcp to create a .DAT file, and restored to a test table. While it was nice being able to track the progress 1000 rows at a time in the CMD window, I didn't see any appreciable performance gain over my existing approach.

This reply was modified 6 years ago by Scott In Sydney.

Chris Harshman SSC-Forever Points: 42192 More actions · Answer 8

Jeff Moden wrote:

is there no way for you to simply do a database restore on the test database?

If you just want to reload one table instead of the entire database, I find that the command line BCP utility can work wonders for this purpose, especially if you use the -N switch for "native" data.

for example, to export the table :

bcp dbname.schemaname.tablename out filename.dat -n -S servername -T

to re-import the table after truncating it:

bcp databasename.schemaname.tablename in filename.dat -n -S servername -T -k -E -h "TABLOCK"

here's the documentation about the BCP utility:

https://docs.microsoft.com/en-us/sql/tools/bcp-utility?view=sql-server-2016

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 9

Scott In Sydney wrote:

@JeffModen:
There's a known issue with SET IDENTITY_INSERT ON... it forces full materialization of the data in TempDB and sorts it. ARCHIVE tables just don't need the IDENTITY PROPERTY on a column because the IDENTITY property has already been expressed in the original table. If it's truly an archive table, rebuild the table to NOT use the IDENTITY property. It's nearly as useless as having Foreign Keys on an audit table.
Don't be thrown off by "ARCHIVE" or "AUDIT" in the table names. "If it's truly an archive table" - nah, probably not, it's just the main history table tracking changes in the source data using SCD2 style from/to dates. Perhaps the table is poorly named - "HISTORY" might be better?

To be sure, it doesn't matter to me if it's an audit table, history table, or other kind of table... if you do a SET IDENTITY INSERT ON while making a copy of the table, then it's going to do the full materialization of the TempDB. Right now, there's no way around that unless you resort to using SELECT * INTO and then adding the constraints. Note that if you add a clustered index, the whole heap will be kept until the clustered table is built and then the heap will finally be dropped... but the space used will not. That could cause your DBAs a bit of angst.

Scott In Sydney wrote:

because the IDENTITY property has already been expressed in the original table
Apologies I don't understand this terminology?

2. What I meant was that if it's separate table that the SCD is being formed in (kind of like a Temporal Table), that table shouldn't have a the IDENTITY property on the ID column. If the SCD is being formed in the same table as the active data, then I see no use for such an IDENTITY column.

Scott In Sydney wrote:

change the Recovery Model of the database (unless you have replication or some other junk that relies on the log file active) to enable minimal logging
While I'm not a full DBA I have tested and see that I can change the recovery model in this Development database from Simple to Bulk Logged. Would that improve the performance? I'm not the only user of the dev database, so TBH I don't know the repercussions of "...if you have anything that relies on the Transaction Log file".

3. There's no advantage to be had in switching from the SIMPLE to the BULK LOGGED Recovery Model if you're not taking full backups.

Scott In Sydney wrote:

Your comments make me think the modelling of this table (which I control) could be wrong and/or improved? Perhaps I'm over-using an IDENTITY column as the primary key of the table? I note that every table should have a PK:
https://www.isbe.net/Documents/SQL_server_standards.pdf, #17, bullet #4.
and my go-to is usually an IDENTITY column.

4. If your table is a single table SCD (Type 2), you can certainly use an IDENTITY column as a PK. I don't believe that I'd waste the Clustered Index (CI) on it unless you really need to control fragmentation of the CI AND you don't have "ExpAnsive" UPDATEs in any of your variable width columns. Remember that you can identify a column as the PK constraint without it being the CI.

Scott In Sydney wrote:

However, it would be very rare (never?) that a join needs to be made to this table using this surrogate key. Perhaps the natural keys facility_identifier, stay_number, episode_sequence_number, rldx_valid_from columns would be a better PK? I then wouldn't need IDENTITY_INSERT ON since there would be no IDENTITY column. But I don't want the "tail to wag the dog" (the slow performance on restoring the table during development to dictate the PK of this table). I hope that makes sense?

5. If the SCD is a single table SCD (Type 2 SCD), you can't have a PK on any natural columns because they will be duplicated. It's part of what a TYPE 2 SCD is. At this point though, I'm thinking that a Type 2 SCD isn't actually what you want. I'm thinking a Type 6 (a combination of a Type 2 and Type 4) are what you really want.

https://en.wikipedia.org/wiki/Slowly_changing_dimension

Scott In Sydney wrote:

@JeffModen (again):
is there no way for you to simply do a database restore on the test database? Perhaps even copy the tables you're working with to a new test database, back it up in the original condition you want the tables in so the restores only take a minute or two?
I just tested and I don't have create database rights on this server. I need to hit the docs, or just test this with some sample databases, but being lazy I'll just ask you...if I have a test database with 5 tables, and a target database with 100 tables, and I restore a backup from the test database into the target database, does the restore "merge" the backup into the target database. IOW, does the target database have 100 tables, 95 of which were untouched? Something a proper DBA would know, which I certainly am not.

6. No. Unless they've piece-meal restores (which you're not doing), the database would be replaced with the 5 table database. BUT... there is a trick you could do. If you have a 5 table database of pristine work tables and you back that database up, then, of course, a restore of that 5 table database would replace anything you did in the 5 table database.

How can we use this to your advantage? Well, let's say that one of the tables in the 5 table database was named "DooDah" and that same table existed in the 100 table database. and you want to do some tests and "play around" with that table but want to quickly "set it back" for another test. Rename the "DooDah" table in the 100 table database to something else and make a SYNONYM that would point to the "DooDah" table in the 5 table database and have a ball. When you want to "set it back" for another test, just restore the 5 table database and Bob's your uncle.

The only disadvantage here is that you can't have FK's between the two databases but you ARE in development and that can temporarily not matter.

Scott In Sydney wrote:

@Brahmanand Shukla:
You can try SELECT * INTO and then rename the table with sp_rename and also create constraints and indexes.
Great idea, I actually haven't tested the performance of restoring into a heap, then adding the constraints and indexes. In the past, I have gotten much better performance using this approach.
BTW, I did use bcp to create a .DAT file, and restored to a test table. While it was nice being able to track the progress 1000 rows at a time in the CMD window, I didn't see any appreciable performance gain over my existing approach.

7. Again... not the best idea for a 100 million row table because you'll need to add a CI and that's going to take a whole lot of extra room.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)