Looking for best way to do a very large update

Question

Looking for best way to do a very large update

Anye Mercy

SSCrazy

Points: 2134
More actions
July 13, 2010 at 2:41 pm

#235107

The overall problem I am trying to solve:
Given an ad-hoc query that returns a number of records in a table, update a couple fields on all these records. The ad-hoc query can be either very simple or very complicated with many joins, the number of rows returned can be anything from 1 to the entire table depending on the ad-hoc critiera. The table may have up to several million rows, and all the tests below are performed on a table of about 3.5 million rows. All of the fields that could be part of the ad-hoc query are indexed -- but the adhoc query itself isn't the problem.
Edited to add -- this process is performed as part of a web application, so other users may potentially be querying this table while another user is trying to update it -- so short lock time is important.
How far have I gotten?
I rebuilt statistics and indexes before running the tests.
So far if I populate a temporary table with the pkeys of the results of the ad-hoc query, this only takes a few seconds even for several million records. Happiness.
I tried this both with pkeys & fkeys on the temp table and without.
Then I tried joining to this temp table to do the update, and running the update in batches so it isn't trying to update millions of rows at once. I have tried batch sizes from 500 to 10,000 records with not much difference in total execution time (~ 12 min.), batch size of 50k gets it to about 3.5 minutes, 100k gets it to 3 min. I would like to get it even better if I can and am wondering if there is anything else besides tweaking batch size I can do to improve this?
Ex:
update top(BatchSize) MainTable set IntFK7=@someint, UserID=@myid, ChangeDate=@date from maintable M inner join temptable T on M.PkeyField=T.PkeyField
where M.IntFK7 <> @someint
(the above query should be autoincrementing as SomeIntProp changes for each record, @someint is a valid value in the foreign key table, the foreign key table has tens to hundreds of records only)
Here's the basics of MainTable:
CREATE TABLE [dbo].[MainTable](
[Pkey] [bigint] IDENTITY(1,1) NOT NULL,
[IntFK1] [int] NOT NULL,
[IndexedBigint1] [bigint] NOT NULL,
[IndexedBigint2] [bigint] NULL,
[IndexedBigint3] [bigint] NULL,
[Bigint4] [bigint] NULL,
[String1] [nvarchar](4000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[String2] [nvarchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Int1] [int] NULL,
[String3] [nvarchar](4000) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[String4] [nvarchar](4000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[String5] [nvarchar](4000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Bigint5] [bigint] NOT NULL CONSTRAINT [DF_Bigint5] DEFAULT ((0)),
[Int2] [int] NOT NULL CONSTRAINT [DF_Int2] DEFAULT ((0)),
[String3] [nvarchar](4000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[String4] [nvarchar](4000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Date1] [datetime] NULL,
[Date2] [datetime] NULL,
[Date3] [datetime] NULL,
[Date4] [datetime] NULL,
[BigintFK2] [bigint] NULL,
[String5] [nvarchar](4000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[String6] [nvarchar](4000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[String7] [nvarchar](4000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[BigintFK3] [bigint] NULL,
[IntFK4] [int] NOT NULL CONSTRAINT [DF_IntFK4] DEFAULT ((0)),
[IntFK5] [int] NOT NULL CONSTRAINT [DF_IntFK5] DEFAULT ((0)),
[String8] [nvarchar](256) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Bit1] [bit] NOT NULL CONSTRAINT [DF_Bit1] DEFAULT ((0)),
[SelfFK1] [bigint] NULL,
[IntFK6] [int] NOT NULL,
[IntFK7] [int] NOT NULL CONSTRAINT [DF_IntFK7] DEFAULT ((1)),
[UserID] [int] NOT NULL CONSTRAINT [DF_IntFK8] DEFAULT ((0)),
[ChangeDate] [datetime] NOT NULL CONSTRAINT [DF_ChangeDate] DEFAULT (getdate()),
[String8] [nvarchar](4000) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Int3] [int] NULL,
[Int4] [int] NULL,
[Int5] [int] NULL,
[Int6] [int] NULL,
[Int7] [int] NULL,
[Bit2] [bit] NULL DEFAULT ((0)),
[Bit3] [bit] NULL DEFAULT ((0)),
[IsDeleted] [bit] NOT NULL CONSTRAINT [DF_IsDeleted] DEFAULT ((0)),
[CreatedBy] [int] NOT NULL,
[CreatedDate] [datetime] NOT NULL CONSTRAINT [DF_CreatedDate] DEFAULT (getdate()),
[ModifiedBy] [int] NOT NULL,
[LastModifiedDate] [datetime] NOT NULL CONSTRAINT [DF_LastModifiedDate] DEFAULT (getdate()),
CONSTRAINT [PK_MainTable] PRIMARY KEY CLUSTERED
(
[Pkey] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
(I didn't include them here because there are too many and they don't directly apply but any field with FK in the name above is an FK to another table.)
The join table is created as:
CREATE TABLE [dbo].[TempTable](
[Pkey] [bigint] NOT NULL,
CONSTRAINT [PK_TempTable] PRIMARY KEY CLUSTERED
(
[Pkey] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
ALTER TABLE [dbo].[TempTable] WITH CHECK ADD CONSTRAINT [FK_TempTable_MainTable] FOREIGN KEY([Pkey]) REFERENCES [dbo].[MainTable] ([Pkey])
(note - the actual name of the table has a guid in it to prevent duplication from simultaneous processes. I am not using a "true" temp table because the same join table is used for several queries controlled from .NET as opposed to in a self-contained stored proc., the table is removed after all processes are complete. Only the very last update query is the one having the performance problems.)
I tend to think using a CTE / Row Number paging solution will be even slower but I could be wrong.
Anyway, if anyone has any ideas to improve the speed of this query beyond the 3 min for 3.5m rows, I would be greatly appreciative.
--
Anye Mercy
"Service Unavailable is not an Error" -- John, ENOM support
"You keep using that word. I do not think it means what you think it means." -- Inigo Montoya in "Princess Bride"
"Civilization exists by geologic consent, subject to change without notice." -- Will Durant

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

seth delconte SSCertifiable Points: 6388 More actions · Answer 1

You should be able to speed up the update significantly by first removing all indexes. I would script out the indexes to files, remove them, perform the update, then create the indexes again.

_________________________________
seth delconte
http://sqlkeys.com

Anye Mercy SSCrazy Points: 2134 More actions · Answer 2

Hm.. I don't know if that would be a good fit here. The update is performed based on something the user is doing in the application - and it is something they may do many times in the course of a session. I should have mentioned in the OP that this was all part of an application, only my tests are run manually.

Since it is not a single user application I don't think dropping and recreating the indexes each time they update that table will be feasible. If this were a manually performed update only done once in awhile then I could see that being useful.

Thanks for the suggestion!

--
Anye Mercy
"Service Unavailable is not an Error" -- John, ENOM support
"You keep using that word. I do not think it means what you think it means." -- Inigo Montoya in "Princess Bride"
"Civilization exists by geologic consent, subject to change without notice." -- Will Durant

Leo.Miller SSChampion Points: 13048 More actions · Answer 3

Anye Mercy (7/13/2010)
update top(BatchSize) MainTable set IntFK7=@someint, UserID=@myid, ChangeDate=@date from maintable M inner join temptable T on M.PkeyField=T.PkeyField
where M.IntFK7 <> @someint

Two things come to mind:

1) You are changing the stats on the IntFK7 column with each batch update, so each successive batch may perform a little worse than the previous one as the Stats change. A form of paramater sniffing. What happens if you do a WITH RECOMPILE each time around?

2) How does the number of rows being updated change if you ignore the WHERE clause? This clause is of no value unless it is eliminating a significant number of rows, since the worst you will do is set IntFK7 to itself. I would see what happens if you drop the WHERE clause.

Cheers

Leo

Leo
Nothing in life is ever so complicated that with a little work it can't be made more complicated.

Leo.Miller SSChampion Points: 13048 More actions · Answer 4

Leo.Miller (7/13/2010)
2) How does the number of rows being updated change if you ignore the WHERE clause? This clause is of no value

Sorry, ignore this point. I wasn't thinking straight.

Leo

Leo
Nothing in life is ever so complicated that with a little work it can't be made more complicated.

Anye Mercy SSCrazy Points: 2134 More actions · Answer 5

Good catch - I do actually have option(recompile) at the end of the query, but forgot to type it in the window here. I don't think I ever ran it w/o that option on.

--
Anye Mercy
"Service Unavailable is not an Error" -- John, ENOM support
"You keep using that word. I do not think it means what you think it means." -- Inigo Montoya in "Princess Bride"
"Civilization exists by geologic consent, subject to change without notice." -- Will Durant

DataDog Hall of Fame Points: 3305 More actions · Answer 6

use tablockx

create index on temp table

update statistics before update

check index condition before and after update

adjust fillfactor, padindex accordingly

ensure logs are on separate drives to data

ensure data and log files not sharing drives with other databases

lets see the: set showplan_all on

consider dropping and recreating any indexes that struggle with this particular workload

[edit]

check data and log file sizes and filegrowth - ensure no growth during update

check vlf counts also

Anye Mercy SSCrazy Points: 2134 More actions · Answer 7

doobya (7/13/2010)
use tablockx
create index on temp table
update statistics before update
check index condition before and after update
adjust fillfactor, padindex accordingly
ensure logs are on separate drives to data
ensure data and log files not sharing drives with other databases
lets see the: set showplan_all on
consider dropping and recreating any indexes that struggle with this particular workload
[edit]
check data and log file sizes and filegrowth - ensure no growth during update
check vlf counts also

I wrote a big long reply here with the showplan and a bunch of comments and questions and there was an error submitting it... so I will try the shorter version. I'll put the show plan in its own post so if that is the problem it doesn't eat the rest of my reply

The VLF counts are really high (427) -- this was not a topic I was familiar with before so I am researching remediation options and better planning now.

I had thought creating a FK would inherently index the field, but perhaps I am mistaken. I will try explicitly creating an index on the temp table and see if that improves it.

I am nervous about rebuilding statistics or indexes as part of this process because it is not a single user application triggering this, and the performance is only bad in outlying cases. The vast majority of our dbs run this query in a few seconds -- so I am looking for solutions that won't seriously impact the majority but will help the minority, if that makes sense.

Would tablockx prevent other users from querying the table at all when the update is occurring, or does it only lock the affected rows?

I suppose I could do a differential solution based on the estimated # of rows affected (which I already know based on the size of the temp table), but still -- since other users could potentially be querying the maintable while a small portion of it is locked for a short time, and the rebuild options would basically leave it unindexed for the entire duration of the tagging operation -- I think that is risky and could affect other performance in the application.http://www.sqlservercentral.com/Forums/Skins/Classic/Images/RichTextBoxTable/tbl_bottomleft.gif

How do the fillfactor and padindex apply here? I am not knowledgeable enough on these to follow your train of thought.

I could potentially put log files on their own drive, but we have so many databases we cannot have one per drive -- there will always be multiple databases on a drive (we have hundreds, the numbers grow every day because each of our customers' jobs is its own database).

You have given me lots to look into, thank you!

--
Anye Mercy
"Service Unavailable is not an Error" -- John, ENOM support
"You keep using that word. I do not think it means what you think it means." -- Inigo Montoya in "Princess Bride"
"Civilization exists by geologic consent, subject to change without notice." -- Will Durant

Anye Mercy SSCrazy Points: 2134 More actions · Answer 8

Here's the showplan (sorry for the crappy formatting:

update top(100000) MainTable set IntFK7=@tagid, ChangedBy=@userid, ChangedDate=@date from MainTable M inner join TempTableeead95e8x75a5x4f42xb639xbe56e74bd752 T on M.Pkey=T.Pkey where M.IntFK7 <> @tagid option(recompile)440NULLNULL4NULL90000NULLNULLNULL88.34656NULLNULLUPDATE0NULL

|--Assert(WHERE:(CASE WHEN [Expr1023] IS NULL THEN (0) ELSE NULL END))464AssertAssertWHERE:(CASE WHEN [Expr1023] IS NULL THEN (0) ELSE NULL END)NULL9000000.0162988.34656NULLNULLPLAN_ROW01

|--Merge Join(Left Semi Join, MANY-TO-MANY MERGE:([DBName].[dbo].[MainTable].[IntFK7])=([DBName].[dbo].[Tags].[TagID]), RESIDUAL:([DBName].[dbo].[MainTable].[IntFK7]=[DBName].[dbo].[Tags].[TagID]))476Merge JoinLeft Semi JoinMANY-TO-MANY MERGE:([DBName].[dbo].[MainTable].[IntFK7])=([DBName].[dbo].[Tags].[TagID]), RESIDUAL:([DBName].[dbo].[MainTable].[IntFK7]=[DBName].[dbo].[Tags].[TagID])NULL900000.0003130.2408723988.33036[Expr1023]NULLPLAN_ROW01

|--Clustered Index Update(OBJECT:([DBName].[dbo].[MainTable].[PK_MainTable]), OBJECT:([DBName].[dbo].[MainTable].[IX_MainTable_TaggedBy]), OBJECT:([DBName].[dbo].[MainTable].[IX_MainTable_TagID]), SET:([DBName].[dbo].[MainTable].[IntFK7] = RaiseIfNull([@tagid]),[DBName].[dbo].[MainTable].[ChangedBy] = RaiseIfNull([@userid]),[DBName].[dbo].[MainTable].[ChangedDate] = RaiseIfNull([@date])))487Clustered Index UpdateUpdateOBJECT:([DBName].[dbo].[MainTable].[PK_MainTable]), OBJECT:([DBName].[dbo].[MainTable].[IX_MainTable_TaggedBy]), OBJECT:([DBName].[dbo].[MainTable].[IX_MainTable_TagID]), SET:([DBName].[dbo].[MainTable].[IntFK7] = RaiseIfNull([@tagid]),[DBName].[dbo].[MainTable].[ChangedBy] = RaiseIfNull([@userid]),[DBName].[dbo].[MainTable].[ChangedDate] = RaiseIfNull([@date]))NULL10000078.082790.31188.08588[DBName].[dbo].[MainTable].[IntFK7]NULLPLAN_ROW01

| |--Compute Scalar(DEFINE:([Expr1019]=[Expr1019], [Expr1020]=[Expr1020]))498Compute ScalarCompute ScalarDEFINE:([Expr1019]=[Expr1019], [Expr1020]=[Expr1020])[Expr1019]=[Expr1019], [Expr1020]=[Expr1020]10000000.01169.703089[M].[Pkey], [Expr1019], [Expr1020]NULLPLAN_ROW01

| |--Compute Scalar(DEFINE:([Expr1019]=CASE WHEN [Expr1005] THEN (1) ELSE (0) END, [Expr1020]=CASE WHEN [Expr1004] THEN (1) ELSE (0) END))4109Compute ScalarCompute ScalarDEFINE:([Expr1019]=CASE WHEN [Expr1005] THEN (1) ELSE (0) END, [Expr1020]=CASE WHEN [Expr1004] THEN (1) ELSE (0) END)[Expr1019]=CASE WHEN [Expr1005] THEN (1) ELSE (0) END, [Expr1020]=CASE WHEN [Expr1004] THEN (1) ELSE (0) END10000000.01169.703089[M].[Pkey], [Expr1019], [Expr1020]NULLPLAN_ROW01

| |--Compute Scalar(DEFINE:([Expr1004]=CASE WHEN [DBName].[dbo].[MainTable].[IntFK7] as [M].[IntFK7] = [@tagid] THEN (1) ELSE (0) END, [Expr1005]=CASE WHEN [DBName].[dbo].[MainTable].[ChangedBy] as [M].[ChangedBy] = [@userid] THEN (1) ELSE (0) END))41110Compute ScalarCompute ScalarDEFINE:([Expr1004]=CASE WHEN [DBName].[dbo].[MainTable].[IntFK7] as [M].[IntFK7] = [@tagid] THEN (1) ELSE (0) END, [Expr1005]=CASE WHEN [DBName].[dbo].[MainTable].[ChangedBy] as [M].[ChangedBy] = [@userid] THEN (1) ELSE (0) END)[Expr1004]=CASE WHEN [DBName].[dbo].[MainTable].[IntFK7] as [M].[IntFK7] = [@tagid] THEN (1) ELSE (0) END, [Expr1005]=CASE WHEN [DBName].[dbo].[MainTable].[ChangedBy] as [M].[ChangedBy] = [@userid] THEN (1) ELSE (0) END10000000.01169.693089[M].[Pkey], [Expr1004], [Expr1005]NULLPLAN_ROW01

| |--Top(TOP EXPRESSION:((100000)))41211TopTopTOP EXPRESSION:((100000))NULL10000000.01239.683089[M].[Pkey], [M].[IntFK7], [M].[ChangedBy]NULLPLAN_ROW01

| |--Merge Join(Inner Join, MERGE:([T].[Pkey])=([M].[Pkey]), RESIDUAL:([DBName].[dbo].[TempTableeead95e8x75a5x4f42xb639xbe56e74bd752].[Pkey] as [T].[Pkey]=[DBName].[dbo].[MainTable].[Pkey] as [M].[Pkey]))41312Merge JoinInner JoinMERGE:([T].[Pkey])=([M].[Pkey]), RESIDUAL:([DBName].[dbo].[TempTableeead95e8x75a5x4f42xb639xbe56e74bd752].[Pkey] as [T].[Pkey]=[DBName].[dbo].[MainTable].[Pkey] as [M].[Pkey])NULL100000014.306239.673089[M].[Pkey], [M].[IntFK7], [M].[ChangedBy]NULLPLAN_ROW01

| |--Clustered Index Scan(OBJECT:([DBName].[dbo].[TempTableeead95e8x75a5x4f42xb639xbe56e74bd752].[PK_TempTableeead95e8x75a5x4f42xb639xbe56e74bd752] AS [T]), ORDERED FORWARD)41413Clustered Index ScanClustered Index ScanOBJECT:([DBName].[dbo].[TempTableeead95e8x75a5x4f42xb639xbe56e74bd752].[PK_TempTableeead95e8x75a5x4f42xb639xbe56e74bd752] AS [T]), ORDERED FORWARD[T].[Pkey]102390.45.2409033.702623150.2752439[T].[Pkey]NULLPLAN_ROW01

| |--Clustered Index Scan(OBJECT:([DBName].[dbo].[MainTable].[PK_MainTable] AS [M]), WHERE:([DBName].[dbo].[MainTable].[IntFK7] as [M].[IntFK7]<>[@tagid]) ORDERED FORWARD)41513Clustered Index ScanClustered Index ScanOBJECT:([DBName].[dbo].[MainTable].[PK_MainTable] AS [M]), WHERE:([DBName].[dbo].[MainTable].[IntFK7] as [M].[IntFK7]<>[@tagid]) ORDERED FORWARD[M].[Pkey], [M].[IntFK7], [M].[ChangedBy]100000278.55283.702624238.589348[M].[Pkey], [M].[IntFK7], [M].[ChangedBy]NULLPLAN_ROW01

|--Clustered Index Scan(OBJECT:([DBName].[dbo].[Tags].[PK_Tags]), ORDERED FORWARD)4337Clustered Index ScanClustered Index ScanOBJECT:([DBName].[dbo].[Tags].[PK_Tags]), ORDERED FORWARD[DBName].[dbo].[Tags].[TagID]130.0031250.0001713110.0032963[DBName].[dbo].[Tags].[TagID]NULLPLAN_ROW01

Sorry for the emoticons in the middle...that's annoying!

--
Anye Mercy
"Service Unavailable is not an Error" -- John, ENOM support
"You keep using that word. I do not think it means what you think it means." -- Inigo Montoya in "Princess Bride"
"Civilization exists by geologic consent, subject to change without notice." -- Will Durant

DataDog Hall of Fame Points: 3305 More actions · Answer 9

After copy and pasting the plan I spent a couple of minutes trying to figure out where the [Sad] reference came from ...

LOL I guess 🙁 becomes [Sad]

here are some ideas, for what they are worth ...

Idea 1

[DBName].[dbo].[Tags].[PK_Tags]

disable foreign key constraint if never fails

Idea 2

use some kind of identity field on temp table

Idea 3

reduce rowcount of temp table after every update

Idea 4

now you can drop where clause

and simplify everything - hopefully avoiding all those scans

declare @rid int

select top 1

@rid = rid

from (

select top 10000 rid

from temp with (tablockx)

order by rid

) dt

order by rid desc

update mt

set [IntFK7] = @tagid

, [ChangedBy] = @userid

, [ChangedDate] = @date

from MainTable mt with (tablockx)

join temp with (tablockx)

on temp.rid <= @rid

and mt.key = temp.key

delete

from temp with (tablockx)

where rid <= @rid

add a transaction etc.

you either want people to access the data during the update or not

if you do - they will have to put up with slightly wonky reads - use lots of small batches and no tablockx

otherwise - use big batches and plenty of tablockx - they will have to wait