De-Duping runtime OLTP data

  • Greetings good folks at SQLSeverCentral

    I am having some performance issues with a job which identifies duplicate account registrations as the account's sign up on our website.

    We have 2 databases on the same physical server: DB1 and DB2

    TableA in DB1 has CustID(PK),Fname,Lname,Address,State,City,Zip,Email,Phone,JoinDate....(Another 30 fields)

    TableB in DB2 has CustID(PK),BirthDate,MotherMaidenName.....(Another 15 fields)

    So, as the customers signup, we have a job which fetches new accounts every 30 mins and puts them in a temporary table and then does a join on DB1.TableA and DB2.TableB on CustID and compares different combinations and flags those new accounts depending on similarity dispositions which are predefined. As I understand this is a RBAR operation.

    e.g. if the new account's Fname,Lname,Password are similar to any account in the JOIN table, then flag record as 'A'

    OR

    if the new account's State and Email (chars before '@') are the same then flag record as 'B'

    and so on....

    At this moment we have around 800K + records in database and the new signups in the 30 min period can vary from 50-60 to 300-400 depending on the time of the day. The job execution takes from 5 to 15 mins and although there are (nolocks) it still causes considerable CPU usage which I would like to avoid.

    Any help in changing the logic or suggestions to improve the performance & efficiency would be deeply appreciated.

    TIA

    Anish

  • The basic concept doesn't sound like RBAR (if done without any kind of a loop). However, it might be better to check for duplicates during signup process and not batch based.

    Regarding performance: please post the code (maybe reduced to a few checks to see the concept) together with table def for the tables involved including all related indexes as well as some sample data so we have something "to play with".

    To perform those checks shold be a matter of seconds, not minutes...



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Thanks LutzM for your response. Here's the code with the table structure:

    [font="Courier New"]/* This the JOIN which I use to grab the latest sign ups */

    SELECT TOP 200

    c.CustID,

    c.Fname,

    c.Lname,

    p.BirthDate,

    p.MotherMaidenName,

    c.Address,

    c.City,

    c.State,

    c.Zip,

    c.Email,

    c.Password,

    c.OpenDateTime,

    c.HomePhone

    FROM

    DB1.dbo.CustomerTable c WITH (NOLOCK)

    LEFT JOIN

    DB2.dbo.crmProfiles p WITH (NOLOCK)

    ON

    c.CustID = p.CustID

    WHERE

    OpenDateTime > @StartVerificationDate AND OpenDateTime <= @EndVerificationDate

    AND

    c.AgentId NOT IN ('XYX', 'ABC')

    ORDER BY

    OPenDateTime

    /*Please note that the PK is CustID which is a char(10) field and there is also a index which includes all the above fields. Following is the temp table which gets the latest account signups every 30 minutes. To reduce the time for processing we cap this table to 200 records */

    CREATE TABLE #JoinedUsers

    (

    JoinedUserID INT IDENTITY,

    CustID CHAR(10),

    Fname VARCHAR(30),

    Lname VARCHAR(30),

    BirthDate SMALLDATETIME,

    MotherMaidenName VARCHAR(50),

    Address VARCHAR(180),

    City VARCHAR(100),

    State CHAR(2),

    Zip VARCHAR(15),

    Email VARCHAR(40),

    Password VARCHAR(10),

    JoinDate DATETIME,

    Phone VARCHAR(25)

    )

    /* For each record in #JoinedUsers we iterate through the following code:

    (This is the block which takes the most time) */

    SELECT

    C.CustID,

    c.Fname,

    c.Lname,

    p.BirthDate,

    p.MotherMaidenName,

    c.Address,

    c.City,

    c.State,

    c.Zip,

    c.Email,

    c.Password,

    c.HomePhone

    FROM

    DB1.dbo.CustomerTable c WITH (NOLOCK)

    LEFT JOIN DB2.dbo.crmProfiles p WITH (NOLOCK)

    ON

    c.CustID = p.CustID,

    DB1.dbo.CustomerTable c2 WITH (NOLOCK)

    WHERE

    c.CustID <> @CustID

    AND c2.CustID = @CustID

    --EXCLUDE ATS CUSTOMERS

    AND c.AgentId <> 'XYZ'

    --search for all rules with OR condition

    --1. FirstLastState

    AND

    (

    (

    (@Fname IS NOT NULL

    AND @Fname <> ''

    AND c.Fname = @Fname)

    AND (@Lname IS NOT NULL

    AND @Lname <> ''

    AND c.Lname = @Lname)

    AND (@State IS NOT NULL

    AND @State <> ''

    AND c.State = @State)

    )

    --2. LastAdrLine1State

    OR(

    (@Lname IS NOT NULL

    AND @Lname <> ''

    AND c.Lname = @Lname)

    AND (

    CHARINDEX(' ', @Address) > 0

    AND SUBSTRING(c.Address, 0, CHARINDEX(' ', c.Address)) =

    SUBSTRING(@Address, 0, CHARINDEX(' ', @Address)))

    AND (@State IS NOT NULL

    AND @State <> ''

    AND c.State = @State)

    )

    --3. MMNZip

    OR(

    (@MotherMaidenName IS NOT NULL

    AND @MotherMaidenName <> ''

    AND p.MotherMaidenName = @MotherMaidenName)

    AND (@Zip IS NOT NULL

    AND c.Zip = @Zip)

    )

    --4. EmailNoDomain

    OR(

    CHARINDEX('@', @Email) > 0

    AND SUBSTRING(c.Email, 0, CHARINDEX('@', c.Email)) =

    SUBSTRING(@Email, 0, CHARINDEX('@', @Email))

    )

    --5. LastDOBState

    OR(

    (@Lname IS NOT NULL

    AND @Lname <> ''

    AND c.Lname = @Lname)

    AND (@BirthDate IS NOT NULL

    AND @BirthDate <> ''

    AND p.BirthDate = @BirthDate)

    AND (@State IS NOT NULL

    AND @State <> ''

    AND c.State = @State)

    )

    --6. PassState

    OR(

    (@Password IS NOT NULL

    AND @Password <> ''

    AND c.Password = @Password)

    AND (@State IS NOT NULL

    AND @State <> ''

    AND c.State = @State)

    )

    --7. Phone

    OR(

    @Phone IS NOT NULL

    AND @Phone <> ''

    AND c.HomePhone = @Phone

    )

    ) --end big AND

    --IF NO MATCHES FOUND, DELETE THIS CUSTOMER FROM #JoinedUsers

    IF @@ROWCOUNT = 0

    DELETE FROM #JoinedUsers WHERE CustID = @CustID

    /* If matches are found then we use it to flag another table which marks the customer as suspicious */[/font]

    LutzM (9/10/2010)


    The basic concept doesn't sound like RBAR (if done without any kind of a loop). However, it might be better to check for duplicates during signup process and not batch based.

    Regarding performance: please post the code (maybe reduced to a few checks to see the concept) together with table def for the tables involved including all related indexes as well as some sample data so we have something "to play with".

    To perform those checks shold be a matter of seconds, not minutes...

  • This is the major cause for the poor performance:

    /* For each record in #JoinedUsers we iterate through the following code:

    (This is the block which takes the most time) */

    That's the reason why I asked you to post the code. Unfortunately, instead of removing some of the checks you obviously removed the cursor in your sample. But the cursor is most probably the major part of the problem. So, please post the complete procedure including the DELETE and "mark suspicious" section.

    In order to test our solutions we would nee table def for DB1.dbo.CustomerTable as well as DB2.dbo.crmProfiles including some sample data.

    Also, please post some sample rows for #JoinedUsers as well as your expected result.

    It would also help if you'd post the actual execution plan for one iteration.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Hi! Lutz

    Thanks for helping out on the weekend.

    As I had mentioned earlier we are not using a cursors but, temporary tables with WHILE loop constructs.

    I am attaching the stored procs and the table definitions and the sqlplan for the relevant section in a zipped file. I hope its useful.

    Thanks

    Anish

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply