How to Optimize this SQL query of finding Duplicates for Better Database performance

Question

How to Optimize this SQL query of finding Duplicates for Better Database performance

Jackson89

Ten Centuries

Points: 1141
More actions
January 6, 2015 at 10:56 pm

#314331

Hi All,
I am new to Database technology, I have come up with the below SQL query with the help of this Fourm Members.
This SQL query finds duplicate records present in dbo.Changes_DupCk table by comparing it with dbo.Changes and Changes_Frze tables. The three tables are as below.
1. dbo.Changes_DupCk
2. dbo.Changes
3. dbo.Changes_Frze
The work involves is every week, data from Changes_DupCk table has to be loaded into dbo.Changes and Changes_Frze table. But before loading we have to make sure that dbo.Changes_DupCk table has no dupicate record which is already present in one of the other two tables.
A Record in Changes_DupCk becomes duplicate if the requestnumber,suffix,AmountTypeCode, EffDate,ExpDate,Amount,AdjustmentType,AdjustmentStatus,AdjustmentUserId matches with a record in dbo.Changes or dbo.Changes_Frze table
All three have same table structure mentioned below
CREATE TABLE [dbo].[Changes_Frze](
[requestNumber] [char](12) NOT NULL,
[suffix] [char](3) NOT NULL,
[AmountTypeCode] [char](2) NOT NULL,
[EffDate] [datetime2](3) NOT NULL,
[CreationDateTime] [datetime2](3) NOT NULL,
[ExpDate] [datetime2](3) NULL,
[Amount] [money] NULL,
[AdjustmentUserId] [varchar](10) NULL,
[AdjustmentReason] [varchar](255) NULL,
[AdjustmentManagerId] [varchar](10) NULL,
[AdjustmentStatus] [char](1) NOT NULL,
[AdjustmentType] [char](1) NOT NULL,
[AdjustmentReasonId] [smallint] NULL
) ON [PRIMARY]
GO
The SQL query I came up is
WITH
cteGetAllRows AS
( --=== Combine all the rows as if they were from one table using UNION ALL, marking each row with a table #.
SELECT *,TableNo = 1 FROM dbo.Changes_DupCk UNION ALL
SELECT *,TableNo = 2 FROM dbo.Changes UNION ALL
SELECT *,TableNo = 3 FROM dbo.Changes_Frze
)
,
cteMarkDupes AS
( --=== Number all duplicate rows starting the count over for each row according to the PARTITION BY
SELECT DupeCheck = ROW_NUMBER() OVER (PARTITION BY [requestNumber],AmountTypeCode,TransactionEffDate,TransactionExpDate,Amount,AdjustmentType,AdjustmentStatus,AdjustmentUserId ORDER BY TransactionEffDate)
,*
FROM cteGetAllRows
) --=== Return only the duplicates not including the original of each duplicate.
SELECT *
FROM cteMarkDupes
WHERE DupeCheck > 1
;
This query assumes that all three tables might have duplicates. But in reality, only the Changes_DupCk table may contain duplicate record. Duplicate record in this case means that record is already present in one of other two tables. And dbo.Changes has millions of Records which all are unique. So making an UNION ALL will make my database slow, please suggest me how to optimise this query for better performance.

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply

kapil_kk SSC-Insane Points: 21316 More actions · Answer 1

Hi,

Instead of using UNION ALL you put can use Join and select out the common data from your table and after that you can add a column as a flag which will gets updated as 1 when commom values when matched...

then you can processed with other values whose flag is 0 so that common data will not come into the scenario of duplication

_______________________________________________________________
To get quick answer follow this link:
http://www.sqlservercentral.com/articles/Best+Practices/61537/

Jackson89 Ten Centuries Points: 1141 More actions · Answer 2

Hi Kapil,

Common data means Duplicate Data or Unique data . please explain How to set Flag over it.

Could you please provide an example on this scenario.

Jackson89 Ten Centuries Points: 1141 More actions · Answer 3

I have written this query for checking duplicates by matching request no . But its not checking the third inner join for Duplicates, It only checks the first inner join for duplicates.

SELECT * FROM dbo.Changes_DupCk INNER JOIN dbo.Changes ON dbo.FinancialChangesClaims_DupCk.requestNumber=dbo.lChanges.requestNumber INNER JOIN dbo.Changes_ClmFrze ON dbo.Changes_DupCk.requestNumber=dbo.Changes_ClmFrze.requestNumber

Please advise what is the mistake.

Gail Shaw SSC Guru Points: 1004514 More actions · Answer 4

For performance questions, we really need to see table definitions, all index definitions and the actual execution plan (saved as a .sqlplan file and attached)

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Jackson89 Ten Centuries Points: 1141 More actions · Answer 5

Hi Gila,

Please check the attached doc containing the table definition and indexes for all the three tables

and advise.

The table "changes" will have millions of data and so is "Changes_DupCk" . Out of the million data in "Changes_DupCk " only few will be actually duplicate, I mean only few records will be already present in "Changes" or "Changes_Frze".

Please optimize my code to tune the database performance as I am using Union ALL in my query which will impact database as it has more unique datas.

Gail Shaw SSC Guru Points: 1004514 More actions · Answer 6

Execution plan?

muralikrishna2489 (1/7/2015)
Please optimize my code to tune the database performance as I am using Union ALL in my query which will impact database as it has more unique datas.

Does it? You've tested and proven that union all is slow on larger resultsets and it's due to the union all?

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Jackson89 Ten Centuries Points: 1141 More actions · Answer 7

Hi Gila,

I dont know how to put a execution plan in SQL. I have created in Doc and attached.

Regarding UNION ALL, I was advised that Since both Changes and Changes_DupCK table has millions of data , Giving UNION ALL will impact the database speed.

That is why I am looking for optimizing the code.

Please advise.

Grant Fritchey SSC Guru Points: 399381 More actions · Answer 8

muralikrishna2489 (1/7/2015)
Hi Gila,
I dont know how to put a execution plan in SQL. I have created in Doc and attached.
Regarding UNION ALL, I was advised that Since both Changes and Changes_DupCK table has millions of data , Giving UNION ALL will impact the database speed.
That is why I am looking for optimizing the code.
Please advise.

That's not quite an execution plan. You use the key command CTL-M in a query window in SQL SErver Management Studio to enable the capture of an execution plan. Run the query. You'll see a new tab available with a bunch of icons and lines. That's the plan. Right click on it, save as, creating a file with a .sqlplan extension as Gail says.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Gail Shaw SSC Guru Points: 1004514 More actions · Answer 9

muralikrishna2489 (1/7/2015)
I was advised that Since both Changes and Changes_DupCK table has millions of data , Giving UNION ALL will impact the database speed.

May I suggest you read through this: http://www.sqlservercentral.com/articles/Performance+Tuning/115825/

Is the query slow?

Has it been tested with production volumes of data? If so, is it unacceptably slow?

Unless you can answer both questions with 'Yes', this line of inquiry may be a waste of time.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Jackson89 Ten Centuries Points: 1141 More actions · Answer 10

Hi Gila,

I have not tested with Production data, I will test it and let you know the status shortly.

Thanks Again.

laxmimanasa11 SSC Enthusiast Points: 112 More actions · Answer 11

I wonder if MERGE statement can be used in this case. would something like this improves the performance

Merge [dbo].[Changes] T1

using

[dbo].[Changes_DupCk] S

on

T1.requestnumber = S.requestnumber AND

T1.suffix = S.suffix AND

T1.AmountTypeCode = S.AmountTypeCode AND

T1.EffDate = S.EffDate AND

T1.ExpDate = S.ExpDate AND

T1.Amount= S.Amount AND

T1.AdjustmentType= S.AdjustmentType AND

T1.AdjustmentStatus = S.AdjustmentStatus AND

T1.AdjustmentUserId = S.AdjustmentUserId

WHEN NOT MATCHED THEN

INSERT (requestnumber,suffix,AmountTypeCode, EffDate,CreationDateTime,ExpDate,Amount,AdjustmentUserId,

AdjustmentReason,AdjustmentManagerId,AdjustmentStatus,AdjustmentType,AdjustmentReasonId)

VALUES(S.requestnumber,S.suffix,S.AmountTypeCode,S.EffDate,S.CreationDateTime,S.ExpDate,S.Amount,S.AdjustmentUserId,S.AdjustmentReason,S.AdjustmentManagerId,S.AdjustmentStatus,S.AdjustmentType,S.AdjustmentReasonId)

laxmimanasa11 SSC Enthusiast Points: 112 More actions · Answer 12

I wonder if MERGE statement can be used in this case. This script is only for [Changes] table . If it is performance effective we can implement the same for [Changes_Frze].

Merge [dbo].[Changes] T1

using

[dbo].[Changes_DupCk] S

on

T1.requestnumber = S.requestnumber AND

T1.suffix = S.suffix AND

T1.AmountTypeCode = S.AmountTypeCode AND

T1.EffDate = S.EffDate AND

T1.ExpDate = S.ExpDate AND

T1.Amount= S.Amount AND

T1.AdjustmentType= S.AdjustmentType AND

T1.AdjustmentStatus = S.AdjustmentStatus AND

T1.AdjustmentUserId = S.AdjustmentUserId

WHEN NOT MATCHED THEN

INSERT (requestnumber,suffix,AmountTypeCode, EffDate,CreationDateTime,ExpDate,Amount,AdjustmentUserId,

AdjustmentReason,AdjustmentManagerId,AdjustmentStatus,AdjustmentType,AdjustmentReasonId)

VALUES(S.requestnumber,S.suffix,S.AmountTypeCode,S.EffDate,S.CreationDateTime,S.ExpDate,S.Amount,S.AdjustmentUserId,S.AdjustmentReason,S.AdjustmentManagerId,S.AdjustmentStatus,S.AdjustmentType,S.AdjustmentReasonId)

Jackson89 Ten Centuries Points: 1141 More actions · Answer 13

Hi Laxmi,

Thanks for the suggestion.

I am not inserting anything into the table, I just need to Find the count of duplicates. Please tell why I need to use insert statements.

laxmimanasa11 SSC Enthusiast Points: 112 More actions · Answer 14

If it is only about finding the count of duplicate records then this query would work.

CREATE TABLE #DuplicateRecordFinder(

[requestNumCer] [char](12) NOT NULL,

[suffix] [char](3) NOT NULL,

[AmountTypeCode] [char](2) NOT NULL,

[EffDate] [datetime2](3) NOT NULL,

[CreationDateTime] [datetime2](3) NOT NULL,

[ExpDate] [datetime2](3) NULL,

[Amount] [money] NULL,

[AdjustmentUserId] [varchar](10) NULL,

[AdjustmentReason] [varchar](255) NULL,

[AdjustmentManagerId] [varchar](10) NULL,

[AdjustmentStatus] [char](1) NOT NULL,

[AdjustmentType] [char](1) NOT NULL,

[AdjustmentReasonId] [smallint] NULL

)

INSERT INTO #DuplicateRecordFinder

(

select Changes_DupCk.* from Changes_DupCk A join Changes_Frze B

on A.requestnumBer=B.requestnumBer and A.suffix =B.suffix and

A.AmountTypeBode =B.AmountTypeBode and A.EffDate =B.EffDate and

A.ExpDate =B.ExpDate and A.Amount =B.Amount and A.AdjustmentType =B.AdjustmentType and

A.AdjustmentStatus =B.AdjustmentStatus and A.AdjustmentUserId =B.AdjustmentUserId

union all

select Changes_DupCk.* from Changes_DupCk A join Changes C

on A.requestnumCer=C.requestnumCer and A.suffix =C.suffix and

A.AmountTypeCode =C.AmountTypeCode and A.EffDate =C.EffDate and

A.ExpDate =C.ExpDate and A.Amount =C.Amount and A.AdjustmentType =C.AdjustmentType and

A.AdjustmentStatus =C.AdjustmentStatus and A.AdjustmentUserId =C.AdjustmentUserId

)

select count(*) from #DuplicateRecordFinder

drop table #DuplicateRecordFinder