deduping query optimization

Question

Post reply

deduping query optimization

mrpolecat

SSCertifiable

Points: 6964
More actions
September 28, 2007 at 9:01 am

#118729

I have the following deduping query which takes a long time to run and I'm sure there is a better way to write it. The query is run everyday after all of the new records have been entered for the day. There are 2 Million rows in the table and 3K-5K added each day, some through batch import and some through data entry. Any ideas are appreciated.
update tb_leadbox set invalidflag=6 where
uid2 not in
(select uid2 from (
select min(uid2)as uid2,fname,lname ,homephone from tb_leadbox where datediff(dd,entrydate,getdate()) <=365
group by fname,lname,homephone
) t2) and invalidflag <> 5 and datediff(dd,entrydate,getdate()) <=365
tb_leadbox has the following fields pertaining to this query
uid (uniqueidentifier, Not Null) PK
uid2 (numeric(18,0),Not Nill) identity
entrydate (datetime, Null)
fname (varchar(50), Null)
lname (varchar(50), Null)
homephone (varchar(12), Null)
invalidflag (int, Null)
tb_leadbox has the following indexes pertaining to this query
uid2 unique non clustered
uid2, invalidflag unique non clustered
homephone, workphone, invalid flag non clustered
lastname, invalidflag non clustered
homephone, uid2 non clustered
invlaidflag non clustered
fname,lname,homephone non clustered
fname non clustered
homephone non clustered

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply

mrpolecat SSCertifiable Points: 6964 More actions · Answer 1

should have mentioned this is SQL7 and upgrade is not a possibility.

Steve Jones - SSC Editor SSC Guru Points: 734418 More actions · Answer 2

A couple thoughts on this.

First, if you can eliminate functions, you can really speed things up. Since you're looking for stuff compare to today, you might be better off calculating the date that matters, meaning if something's older than today - a year, then update it. That will get you a quick search on that field. The function reduces the optimizer using the index on date.

Second, how many flags are there? Can you do an IN(1, 2, 3,4 ) instead of 5? The usually results in a scan of all data.

Lastly, not sure of your other queries, but if you can combine any indexes into a compound index, then you can gain a couple efficiencies. First less indexes to update and second, they might make some queries faster. Need to be sure you aren't making others slower though.

mrpolecat SSCertifiable Points: 6964 More actions · Answer 3

ok now I have this.

update tb_leadbox set invalidflag=6 where

uid2 not in

(select uid2 from (

select min(uid2)as uid2,fname,lname ,homephone from tb_leadbox where entrydate >= dateadd(dd,-365,getdate())

group by fname,lname,homephone

) t2) and invalidflag in (0,1,2,3,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30) and entrydate >= dateadd(dd,-365,getdate())

when I run the execution plan on the old and new together the new takes 49.63% and the old takes 50.37%. There are 5 table spools that occur which each take 17%.

Steve Jones - SSC Editor SSC Guru Points: 734418 More actions · Answer 4

Get rid of the function.

declare @dt datetime

select @dt = dateadd( d, -365, getdate())

update

...

and entrydate < @dt

That gets you a scalar. It's possible that you can't get this to run better because of the distribution of data. It might be that all of the data is mostly not 5s, so it scans anyway.

Not sure what else to change. That min() clause is probably eating up resources, but not sure how to handle that one.

Jeff Moden SSC Guru Points: 1003853 More actions · Answer 5

Hey there PoleCat... long time no "see"...

Is it the goal of this bit of code to update all but the "first" row for each customer what has been entered within the last year having any other code than 5 to a code 6?

Also, what do you mean by "long time"? I just ran your code on a 2 million row test table... took 12 seconds to find and update 6400 rows... how many are you updating?

Last but not least, the column you are updating appears in 4 different indexs... those take time to update... you sure you can't knock a few of those out?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 6

Hi Mr Polecat

Do you need columns other than uid2 in your inner derived table?

update tb_leadbox

set invalidflag=6

where uid2 not in

(select min(uid2) as uid2

from tb_leadbox

where entrydate >= dateadd(dd,-365,getdate())

group by fname,lname,homephone

)

and invalidflag in (0,1,2,3,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30)

and entrydate >= dateadd(dd,-365,getdate())

Cheers

ChrisM

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

mrpolecat SSCertifiable Points: 6964 More actions · Answer 7

Jeff Moden (9/29/2007)
Hey there PoleCat... long time no "see"...

compulsive personality. I see an article that interests me and stick around answering questions for a few days then find something else to do.

Is it the goal of this bit of code to update all but the "first" row for each customer what has been entered within the last year having any other code than 5 to a code 6?

Yes

Also, what do you mean by "long time"? I just ran your code on a 2 million row test table... took 12 seconds to find and update 6400 rows... how many are you updating?

1 hour + to update 11K but it is a really old server, 512MB ram and 2 300Mhz processors (yes that's an M not a G)

Last but not least, the column you are updating appears in 4 different indexs... those take time to update... you sure you can't knock a few of those out?

Everytime a new report/process is built a new index is created to speed up that report/process. Yes we could stand to lose a few but I'm not sure how it would effect other users. I am going to try rebuiolding the indices and see if that helps. I've always blamed the code but if it ran ok on your server maybe I need to look elsewhere.

mrpolecat SSCertifiable Points: 6964 More actions · Answer 8

Chris, I think you are right that I could get rid of a nest by removing the extra fields. I will give it a test. I also think a left join to the derived table may be faster than the "not in .."

Jeff Moden SSC Guru Points: 1003853 More actions · Answer 9

really old server, 512MB ram and 2 300Mhz processors (yes that's an M not a G)

I'm thinking that you've squeezed just about all the speed you're gonna get out of that... might be a few tricks left but I'm not sure what they might be...

fyi... NOT IN is gonna be a tiny bit faster than an outer join because is has one less step in execution plan.

I think the big problem is all the reporting indexes you mentioned... and, any triggers on the table?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

SwePeso SSC-Dedicated Points: 39752 More actions · Answer 10

UPDATEy

SETy.InvalidFlag = 6

FROMtb_LeadBox AS y

LEFT JOIN(

SELECTMIN(UId2) AS UId2

FROMtb_LeadBox

WHEREEntryDate >= DATEADD(YEAR, -1, GETDATE())

GROUP BYFName,

LName,

HomePhone

) AS x ON x.UId2 = y.UId2

WHEREx.UId2 IS NULL

AND y.InvalidFlag IN (0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)

AND y.EntryDate >= DATEADD(YEAR, -1, GETDATE())

N 56°04'39.16"
E 12°55'05.25"

mrpolecat SSCertifiable Points: 6964 More actions · Answer 11

I ran this test changing my update to a select

--old query with Steve's suggestion for the date parameter and Chris's suggestion to remove one of the nests.

begin

declare @dt datetime

select @dt = dateadd( d, -365, getdate())

select uid2,invalidflag from tb_leadbox where

uid2 not in

(

select min(uid2)as uid2 from tb_leadbox where entrydate >= @dt

group by fname,lname,homephone

) and invalidflag <> 5 and entrydate >= @dt

end

-- took 1 hr 50 minutes

--new query with join similar to Peter's

begin

declare @dt datetime

select @dt = dateadd( d, -365, getdate())

select t1.uid2,invalidflag

from tb_leadbox t1

left join

(select min(uid2)as uid2 from tb_leadbox where entrydate >= @dt

group by fname,lname,homephone) t2

on t1.uid2=t2.uid2

where t2.uid2 is null and invalidflag <> 5 and entrydate >= @dt

end

--took 1 minute 11 seconds

Quite a perfomance boost for the join

I will test the update next.

Jeff- yes there is one trigger on update.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 12

mrpolecat (10/2/2007)
.
.
.
-- took 1 hr 50 minutes
--new query with join similar to Peter's
.
.
.
--took 1 minute 11 seconds
Quite a perfomance boost for the join

No kidding Mr Polecat, nice result.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

mrpolecat SSCertifiable Points: 6964 More actions · Answer 13

Yeah, I just tested the update and it only took 5 minutes 40 seconds.

Jeff Moden SSC Guru Points: 1003853 More actions · Answer 14

Perfect... Nice "kick in" Peter... I'm gonna go back to sucking my thumb now... 😛

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)