deduping query optimization

  • I have the following deduping query which takes a long time to run and I'm sure there is a better way to write it. The query is run everyday after all of the new records have been entered for the day. There are 2 Million rows in the table and 3K-5K added each day, some through batch import and some through data entry. Any ideas are appreciated.

    update tb_leadbox set invalidflag=6 where

    uid2 not in

    (select uid2 from (

    select min(uid2)as uid2,fname,lname ,homephone from tb_leadbox where datediff(dd,entrydate,getdate()) <=365

    group by fname,lname,homephone

    ) t2) and invalidflag <> 5 and datediff(dd,entrydate,getdate()) <=365

    tb_leadbox has the following fields pertaining to this query

    uid (uniqueidentifier, Not Null) PK

    uid2 (numeric(18,0),Not Nill) identity

    entrydate (datetime, Null)

    fname (varchar(50), Null)

    lname (varchar(50), Null)

    homephone (varchar(12), Null)

    invalidflag (int, Null)

    tb_leadbox has the following indexes pertaining to this query

    uid2 unique non clustered

    uid2, invalidflag unique non clustered

    homephone, workphone, invalid flag non clustered

    lastname, invalidflag non clustered

    homephone, uid2 non clustered

    invlaidflag non clustered

    fname,lname,homephone non clustered

    fname non clustered

    homephone non clustered


  • should have mentioned this is SQL7 and upgrade is not a possibility.


  • A couple thoughts on this.

    First, if you can eliminate functions, you can really speed things up. Since you're looking for stuff compare to today, you might be better off calculating the date that matters, meaning if something's older than today - a year, then update it. That will get you a quick search on that field. The function reduces the optimizer using the index on date.

    Second, how many flags are there? Can you do an IN(1, 2, 3,4 ) instead of 5? The usually results in a scan of all data.

    Lastly, not sure of your other queries, but if you can combine any indexes into a compound index, then you can gain a couple efficiencies. First less indexes to update and second, they might make some queries faster. Need to be sure you aren't making others slower though.

  • ok now I have this.

    update tb_leadbox set invalidflag=6 where

    uid2 not in

    (select uid2 from (

    select min(uid2)as uid2,fname,lname ,homephone from tb_leadbox where entrydate >= dateadd(dd,-365,getdate())

    group by fname,lname,homephone

    ) t2) and invalidflag in (0,1,2,3,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30) and entrydate >= dateadd(dd,-365,getdate())

    when I run the execution plan on the old and new together the new takes 49.63% and the old takes 50.37%. There are 5 table spools that occur which each take 17%.


  • Get rid of the function.

    declare @dt datetime

    select @dt = dateadd( d, -365, getdate())

    update

    ...

    and entrydate < @dt

    That gets you a scalar. It's possible that you can't get this to run better because of the distribution of data. It might be that all of the data is mostly not 5s, so it scans anyway.

    Not sure what else to change. That min() clause is probably eating up resources, but not sure how to handle that one.

  • Hey there PoleCat... long time no "see"...

    Is it the goal of this bit of code to update all but the "first" row for each customer what has been entered within the last year having any other code than 5 to a code 6?

    Also, what do you mean by "long time"? I just ran your code on a 2 million row test table... took 12 seconds to find and update 6400 rows... how many are you updating?

    Last but not least, the column you are updating appears in 4 different indexs... those take time to update... you sure you can't knock a few of those out?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Mr Polecat

    Do you need columns other than uid2 in your inner derived table?

    update tb_leadbox

    set invalidflag=6

    where uid2 not in

    (select min(uid2) as uid2

    from tb_leadbox

    where entrydate >= dateadd(dd,-365,getdate())

    group by fname,lname,homephone

    )

    and invalidflag in (0,1,2,3,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30)

    and entrydate >= dateadd(dd,-365,getdate())

    Cheers

    ChrisM

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Jeff Moden (9/29/2007)


    Hey there PoleCat... long time no "see"...

    compulsive personality. I see an article that interests me and stick around answering questions for a few days then find something else to do.

    Is it the goal of this bit of code to update all but the "first" row for each customer what has been entered within the last year having any other code than 5 to a code 6?

    Yes

    Also, what do you mean by "long time"? I just ran your code on a 2 million row test table... took 12 seconds to find and update 6400 rows... how many are you updating?

    1 hour + to update 11K but it is a really old server, 512MB ram and 2 300Mhz processors (yes that's an M not a G)

    Last but not least, the column you are updating appears in 4 different indexs... those take time to update... you sure you can't knock a few of those out?

    Everytime a new report/process is built a new index is created to speed up that report/process. Yes we could stand to lose a few but I'm not sure how it would effect other users. I am going to try rebuiolding the indices and see if that helps. I've always blamed the code but if it ran ok on your server maybe I need to look elsewhere.


  • Chris, I think you are right that I could get rid of a nest by removing the extra fields. I will give it a test. I also think a left join to the derived table may be faster than the "not in .."


  • really old server, 512MB ram and 2 300Mhz processors (yes that's an M not a G)

    I'm thinking that you've squeezed just about all the speed you're gonna get out of that... might be a few tricks left but I'm not sure what they might be...

    fyi... NOT IN is gonna be a tiny bit faster than an outer join because is has one less step in execution plan.

    I think the big problem is all the reporting indexes you mentioned... and, any triggers on the table?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • UPDATEy

    SETy.InvalidFlag = 6

    FROMtb_LeadBox AS y

    LEFT JOIN(

    SELECTMIN(UId2) AS UId2

    FROMtb_LeadBox

    WHEREEntryDate >= DATEADD(YEAR, -1, GETDATE())

    GROUP BYFName,

    LName,

    HomePhone

    ) AS x ON x.UId2 = y.UId2

    WHEREx.UId2 IS NULL

    AND y.InvalidFlag IN (0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)

    AND y.EntryDate >= DATEADD(YEAR, -1, GETDATE())


    N 56°04'39.16"
    E 12°55'05.25"

  • I ran this test changing my update to a select

    --old query with Steve's suggestion for the date parameter and Chris's suggestion to remove one of the nests.

    begin

    declare @dt datetime

    select @dt = dateadd( d, -365, getdate())

    select uid2,invalidflag from tb_leadbox where

    uid2 not in

    (

    select min(uid2)as uid2 from tb_leadbox where entrydate >= @dt

    group by fname,lname,homephone

    ) and invalidflag <> 5 and entrydate >= @dt

    end

    -- took 1 hr 50 minutes

    --new query with join similar to Peter's

    begin

    declare @dt datetime

    select @dt = dateadd( d, -365, getdate())

    select t1.uid2,invalidflag

    from tb_leadbox t1

    left join

    (select min(uid2)as uid2 from tb_leadbox where entrydate >= @dt

    group by fname,lname,homephone) t2

    on t1.uid2=t2.uid2

    where t2.uid2 is null and invalidflag <> 5 and entrydate >= @dt

    end

    --took 1 minute 11 seconds

    Quite a perfomance boost for the join

    I will test the update next.

    Jeff- yes there is one trigger on update.


  • mrpolecat (10/2/2007)


    .

    .

    .

    -- took 1 hr 50 minutes

    --new query with join similar to Peter's

    .

    .

    .

    --took 1 minute 11 seconds

    Quite a perfomance boost for the join

    No kidding Mr Polecat, nice result.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Yeah, I just tested the update and it only took 5 minutes 40 seconds.


  • Perfect... Nice "kick in" Peter... I'm gonna go back to sucking my thumb now... 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply