Data loading best practices for 5-10 millions of rows.

Question

Data loading best practices for 5-10 millions of rows.

Minaz Amin

SSChampion

Points: 11052
More actions
October 24, 2013 at 10:48 pm

#279075

Hi Expert,
As this a huge environment wanted to impement best practice for loading 5 - 10 millions of rows, please advice.
How do I take care of costraint and indexes?
Is drop / recreate indexes is good or dissable and enable indexes?
in both scenario if my cluster index is dropped or dissable, no one will be able to access the table hence data loading will break.
I was reading one article which suggested dissbling cluster index will dissable all the associated non cluster indexes and then enable the cluster index , do the data loading and enable all the non cluster index and rebuild all.
I need your input.
"More Green More Oxygen !! Plant a tree today"

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

Orlando Colamatteo SSC Guru Points: 182276 More actions · Answer 1

There are no hard and fast rules for this. You must test. It will depend on many variables. How many indexes do you have? How wide are they? What order is the data in when it arrives? The work to drop all indexes before loading, then building them after loading can be less than keeping all of them organized while loading. Try both ways. Try a mix of just dropping all nonclustered but keeping the clustered, especially if your data set can be sorted in the same order as the clustered index before loading.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Koen Verbeeck SSC Guru Points: 259085 More actions · Answer 2

Also set the recovery model to simple. (maybe only during the load if you want point-in-time recovery during the rest of the day)

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

Jeff Moden SSC Guru Points: 1003894 More actions · Answer 3

Koen Verbeeck (10/25/2013)
Also set the recovery model to simple. (maybe only during the load if you want point-in-time recovery during the rest of the day)

Ooooohhhh, be careful, now. BULK LOGGED, maybe, but not SIMPLE. SIMPLE breaks the entire log chain. Yeah, you can do a DIFF backup to quickly reestablish the chain but why break it in the first place?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1003894 More actions · Answer 4

Minaz (10/24/2013)
Hi Expert,
As this a huge environment wanted to impement best practice for loading 5 - 10 millions of rows, please advice.
How do I take care of costraint and indexes?
Is drop / recreate indexes is good or dissable and enable indexes?
in both scenario if my cluster index is dropped or dissable, no one will be able to access the table hence data loading will break.
I was reading one article which suggested dissbling cluster index will dissable all the associated non cluster indexes and then enable the cluster index , do the data loading and enable all the non cluster index and rebuild all.
I need your input.

This would be, IMHO, a very good candidate for using a staging table (you should always use a staging table for imported data for many reasons) and partitioning.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Minaz Amin SSChampion Points: 11052 More actions · Answer 5

Thank you all, yes am also planing to follw the advice by loading through staging table. My main table is paritioned ( did it now ) on yearlywise so to maintain it easily.

while inserting do my indexes are fragemented? Just by updating the index stat will not update the table and during maintenanace window I can ran the rebuild/reorg index job depending on fragmentation percentage.

what do you advice?

"More Green More Oxygen !! Plant a tree today"

Orlando Colamatteo SSC Guru Points: 182276 More actions · Answer 6

Minaz (10/28/2013)
while inserting do my indexes are fragemented? Just by updating the index stat will not update the table and during maintenanace window I can ran the rebuild/reorg index job depending on fragmentation percentage.

Chances are indexes will become fragmented to some degree but it will depend on things like the index fill factor, which columns you are updating, the columns and order of the indexes, lots of things, as to whether they will or how much they will become fragmented.

For example if you insert a ton of data into a table that only has a clustered index where the leading edge of the index is an identity column then you won't see much fragmentation in that clustered index, however nonclustered indexes could be affected. Or let's say you update many rows in the table, but you only update an integer column that was present when all rows in the table were first inserted, then you will also not see much fragmentation because integer is a fixed-width data type.

But let's say you insert many rows into a table where the clustered index is on last name, if data being imported and data in the table are both distributed in a similar way then chances of encountering page splits and fragmentation ensuing is high. Or say you will be updating many rows, updating a varchar column that was previously null but now will have a value, and the fill factor is 100, chances of fragmentation are high.

Even deleting can cause fragmentation, so be aware of that possibility as well.

The good news is that index fragmentation is quite easy to solve. Check the scripts at http://ola.hallengren.com for a solution that will work for many systems with no special effort other than compiling some stored procedures and functions and setting up a scheduled job, and will be mostly hands off if even only a small nightly maintenance windows exists for the database.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

cdickens 87202 Valued Member Points: 53 More actions · Answer 7

I also like the idea of utilizing a staging table to get your latest data into the database without interrupting data access.

Furthermore, you might consider using the MERGE INTO syntax to allow SQL Server to reconcile the differences and commit DML changes to the primary data table to reduce the amount of changes occurring only to what is actually different. This should reduce the amount of activity amongst all related indexes, triggers, etc and again allow the table to stay online for queries. Unfortunately, I've not attempted MERGEs on such a large data set, so by all means test and report back to us. 🙂

--Chris

Orlando Colamatteo SSC Guru Points: 182276 More actions · Answer 8

Staging tables are fine but they don't buy you much if you're eventually having to load into the final table anyway. If your dataset is clean reliable or you have proper error handling in your loader (e.g. SSIS) then loading into the final table directly can far and away outperform loading into a staging table first and then using T-SQL to load into your final table.

MERGE might seem like a good choice conceptually but there is a ton of evidence out there that shows that INSERT followed by UPDATE inside a transaction is usually to be a better performer when handling large datasets. Test for yourself, of course. Also worth mentioning, there were more than its fair share of bugs opened against MERGE so make sure you have the latest patches and watch out for the ones that are still active as well.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Minaz Amin SSChampion Points: 11052 More actions · Answer 9

Thank you for sharing your thoughts.

I will try both the options to merge and to use stage table.

Let me first try stage table and then switch off the data to paritioned table. Let me check the performance and report back to you all.

"More Green More Oxygen !! Plant a tree today"