How to Optimize this SQL query of finding Duplicates for Better Database performance

  • Hi All,

    I am new to Database technology, I have come up with the below SQL query with the help of this Fourm Members.

    This SQL query finds duplicate records present in dbo.Changes_DupCk table by comparing it with dbo.Changes and Changes_Frze tables. The three tables are as below.

    1. dbo.Changes_DupCk

    2. dbo.Changes

    3. dbo.Changes_Frze

    The work involves is every week, data from Changes_DupCk table has to be loaded into dbo.Changes and Changes_Frze table. But before loading we have to make sure that dbo.Changes_DupCk table has no dupicate record which is already present in one of the other two tables.

    A Record in Changes_DupCk becomes duplicate if the requestnumber,suffix,AmountTypeCode, EffDate,ExpDate,Amount,AdjustmentType,AdjustmentStatus,AdjustmentUserId matches with a record in dbo.Changes or dbo.Changes_Frze table

    All three have same table structure mentioned below

    CREATE TABLE [dbo].[Changes_Frze](

    [requestNumber] [char](12) NOT NULL,

    [suffix] [char](3) NOT NULL,

    [AmountTypeCode] [char](2) NOT NULL,

    [EffDate] [datetime2](3) NOT NULL,

    [CreationDateTime] [datetime2](3) NOT NULL,

    [ExpDate] [datetime2](3) NULL,

    [Amount] [money] NULL,

    [AdjustmentUserId] [varchar](10) NULL,

    [AdjustmentReason] [varchar](255) NULL,

    [AdjustmentManagerId] [varchar](10) NULL,

    [AdjustmentStatus] [char](1) NOT NULL,

    [AdjustmentType] [char](1) NOT NULL,

    [AdjustmentReasonId] [smallint] NULL

    ) ON [PRIMARY]

    GO

    The SQL query I came up is

    WITH

    cteGetAllRows AS

    ( --=== Combine all the rows as if they were from one table using UNION ALL, marking each row with a table #.

    SELECT *,TableNo = 1 FROM dbo.Changes_DupCk UNION ALL

    SELECT *,TableNo = 2 FROM dbo.Changes UNION ALL

    SELECT *,TableNo = 3 FROM dbo.Changes_Frze

    )

    ,

    cteMarkDupes AS

    ( --=== Number all duplicate rows starting the count over for each row according to the PARTITION BY

    SELECT DupeCheck = ROW_NUMBER() OVER (PARTITION BY [requestNumber],AmountTypeCode,TransactionEffDate,TransactionExpDate,Amount,AdjustmentType,AdjustmentStatus,AdjustmentUserId ORDER BY TransactionEffDate)

    ,*

    FROM cteGetAllRows

    ) --=== Return only the duplicates not including the original of each duplicate.

    SELECT *

    FROM cteMarkDupes

    WHERE DupeCheck > 1

    ;

    This query assumes that all three tables might have duplicates. But in reality, only the Changes_DupCk table may contain duplicate record. Duplicate record in this case means that record is already present in one of other two tables. And dbo.Changes has millions of Records which all are unique. So making an UNION ALL will make my database slow, please suggest me how to optimise this query for better performance.

  • Hi,

    Instead of using UNION ALL you put can use Join and select out the common data from your table and after that you can add a column as a flag which will gets updated as 1 when commom values when matched...

    then you can processed with other values whose flag is 0 so that common data will not come into the scenario of duplication

    _______________________________________________________________
    To get quick answer follow this link:
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

  • Hi Kapil,

    Common data means Duplicate Data or Unique data . please explain How to set Flag over it.

    Could you please provide an example on this scenario.

  • I have written this query for checking duplicates by matching request no . But its not checking the third inner join for Duplicates, It only checks the first inner join for duplicates.

    SELECT * FROM dbo.Changes_DupCk INNER JOIN dbo.Changes ON dbo.FinancialChangesClaims_DupCk.requestNumber=dbo.lChanges.requestNumber INNER JOIN dbo.Changes_ClmFrze ON dbo.Changes_DupCk.requestNumber=dbo.Changes_ClmFrze.requestNumber

    Please advise what is the mistake.

  • For performance questions, we really need to see table definitions, all index definitions and the actual execution plan (saved as a .sqlplan file and attached)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hi Gila,

    Please check the attached doc containing the table definition and indexes for all the three tables

    and advise.

    The table "changes" will have millions of data and so is "Changes_DupCk" . Out of the million data in "Changes_DupCk " only few will be actually duplicate, I mean only few records will be already present in "Changes" or "Changes_Frze".

    Please optimize my code to tune the database performance as I am using Union ALL in my query which will impact database as it has more unique datas.

  • Execution plan?

    muralikrishna2489 (1/7/2015)


    Please optimize my code to tune the database performance as I am using Union ALL in my query which will impact database as it has more unique datas.

    Does it? You've tested and proven that union all is slow on larger resultsets and it's due to the union all?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hi Gila,

    I dont know how to put a execution plan in SQL. I have created in Doc and attached.

    Regarding UNION ALL, I was advised that Since both Changes and Changes_DupCK table has millions of data , Giving UNION ALL will impact the database speed.

    That is why I am looking for optimizing the code.

    Please advise.

  • muralikrishna2489 (1/7/2015)


    Hi Gila,

    I dont know how to put a execution plan in SQL. I have created in Doc and attached.

    Regarding UNION ALL, I was advised that Since both Changes and Changes_DupCK table has millions of data , Giving UNION ALL will impact the database speed.

    That is why I am looking for optimizing the code.

    Please advise.

    That's not quite an execution plan. You use the key command CTL-M in a query window in SQL SErver Management Studio to enable the capture of an execution plan. Run the query. You'll see a new tab available with a bunch of icons and lines. That's the plan. Right click on it, save as, creating a file with a .sqlplan extension as Gail says.

    ----------------------------------------------------
    The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore Roosevelt
    The Scary DBA
    Author of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd Edition
    Product Evangelist for Red Gate Software

  • muralikrishna2489 (1/7/2015)


    I was advised that Since both Changes and Changes_DupCK table has millions of data , Giving UNION ALL will impact the database speed.

    May I suggest you read through this: http://www.sqlservercentral.com/articles/Performance+Tuning/115825/

    Is the query slow?

    Has it been tested with production volumes of data? If so, is it unacceptably slow?

    Unless you can answer both questions with 'Yes', this line of inquiry may be a waste of time.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hi Gila,

    I have not tested with Production data, I will test it and let you know the status shortly.

    Thanks Again.

  • I wonder if MERGE statement can be used in this case. would something like this improves the performance

    Merge [dbo].[Changes] T1

    using

    [dbo].[Changes_DupCk] S

    on

    T1.requestnumber = S.requestnumber AND

    T1.suffix = S.suffix AND

    T1.AmountTypeCode = S.AmountTypeCode AND

    T1.EffDate = S.EffDate AND

    T1.ExpDate = S.ExpDate AND

    T1.Amount= S.Amount AND

    T1.AdjustmentType= S.AdjustmentType AND

    T1.AdjustmentStatus = S.AdjustmentStatus AND

    T1.AdjustmentUserId = S.AdjustmentUserId

    WHEN NOT MATCHED THEN

    INSERT (requestnumber,suffix,AmountTypeCode, EffDate,CreationDateTime,ExpDate,Amount,AdjustmentUserId,

    AdjustmentReason,AdjustmentManagerId,AdjustmentStatus,AdjustmentType,AdjustmentReasonId)

    VALUES(S.requestnumber,S.suffix,S.AmountTypeCode,S.EffDate,S.CreationDateTime,S.ExpDate,S.Amount,S.AdjustmentUserId,S.AdjustmentReason,S.AdjustmentManagerId,S.AdjustmentStatus,S.AdjustmentType,S.AdjustmentReasonId)

  • I wonder if MERGE statement can be used in this case. This script is only for [Changes] table . If it is performance effective we can implement the same for [Changes_Frze].

    Merge [dbo].[Changes] T1

    using

    [dbo].[Changes_DupCk] S

    on

    T1.requestnumber = S.requestnumber AND

    T1.suffix = S.suffix AND

    T1.AmountTypeCode = S.AmountTypeCode AND

    T1.EffDate = S.EffDate AND

    T1.ExpDate = S.ExpDate AND

    T1.Amount= S.Amount AND

    T1.AdjustmentType= S.AdjustmentType AND

    T1.AdjustmentStatus = S.AdjustmentStatus AND

    T1.AdjustmentUserId = S.AdjustmentUserId

    WHEN NOT MATCHED THEN

    INSERT (requestnumber,suffix,AmountTypeCode, EffDate,CreationDateTime,ExpDate,Amount,AdjustmentUserId,

    AdjustmentReason,AdjustmentManagerId,AdjustmentStatus,AdjustmentType,AdjustmentReasonId)

    VALUES(S.requestnumber,S.suffix,S.AmountTypeCode,S.EffDate,S.CreationDateTime,S.ExpDate,S.Amount,S.AdjustmentUserId,S.AdjustmentReason,S.AdjustmentManagerId,S.AdjustmentStatus,S.AdjustmentType,S.AdjustmentReasonId)

  • Hi Laxmi,

    Thanks for the suggestion.

    I am not inserting anything into the table, I just need to Find the count of duplicates. Please tell why I need to use insert statements.

  • If it is only about finding the count of duplicate records then this query would work.

    CREATE TABLE #DuplicateRecordFinder(

    [requestNumCer] [char](12) NOT NULL,

    [suffix] [char](3) NOT NULL,

    [AmountTypeCode] [char](2) NOT NULL,

    [EffDate] [datetime2](3) NOT NULL,

    [CreationDateTime] [datetime2](3) NOT NULL,

    [ExpDate] [datetime2](3) NULL,

    [Amount] [money] NULL,

    [AdjustmentUserId] [varchar](10) NULL,

    [AdjustmentReason] [varchar](255) NULL,

    [AdjustmentManagerId] [varchar](10) NULL,

    [AdjustmentStatus] [char](1) NOT NULL,

    [AdjustmentType] [char](1) NOT NULL,

    [AdjustmentReasonId] [smallint] NULL

    )

    INSERT INTO #DuplicateRecordFinder

    (

    select Changes_DupCk.* from Changes_DupCk A join Changes_Frze B

    on A.requestnumBer=B.requestnumBer and A.suffix =B.suffix and

    A.AmountTypeBode =B.AmountTypeBode and A.EffDate =B.EffDate and

    A.ExpDate =B.ExpDate and A.Amount =B.Amount and A.AdjustmentType =B.AdjustmentType and

    A.AdjustmentStatus =B.AdjustmentStatus and A.AdjustmentUserId =B.AdjustmentUserId

    union all

    select Changes_DupCk.* from Changes_DupCk A join Changes C

    on A.requestnumCer=C.requestnumCer and A.suffix =C.suffix and

    A.AmountTypeCode =C.AmountTypeCode and A.EffDate =C.EffDate and

    A.ExpDate =C.ExpDate and A.Amount =C.Amount and A.AdjustmentType =C.AdjustmentType and

    A.AdjustmentStatus =C.AdjustmentStatus and A.AdjustmentUserId =C.AdjustmentUserId

    )

    select count(*) from #DuplicateRecordFinder

    drop table #DuplicateRecordFinder

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply