table with 3 million plus records taking half a minute, can I improve performance

Question

table with 3 million plus records taking half a minute, can I improve performance

HildaJ

SSCarpal Tunnel

Points: 4516
More actions
August 6, 2015 at 3:53 pm

#306408

Thank you in advance for any help.
I have a table that I need to do some computations on all the data but first I need to remove the duplicate records and insert the results into a destination table. Here's the example below. My table has 3.1 million rows. I have tried using the DISTINCT and the GROUP BY but both ways to select the data takes about half a minute to run. I'm wondering if there is a way to increase performance. Users are ok with this time since the process runs overnight but improving it won't hurt. I do have a clustered index on these fields but that doesn't seem to improve any.
SELECTDateYear ,
DateMonth ,
Nbr ,
Nbr1 ,
Nbr2 ,
Datafield1 ,
Datafield2,
ID
FROM [dbo].[TableName]
WHERE 1 = 1
AND COALESCE([ID],0) > 0 -- ID sometimes is NULL, we don't want these records
GROUP BYDateYear ,
DateMonth ,
Nbr ,
Nbr1 ,
Nbr2 ,
Datafield1 ,
Datafield2,
ID

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

Andrew P SSCarpal Tunnel Points: 4649 More actions · Answer 1

I don't think you need that COALESCE call. NULL > 0 is not true, so [ID] > 0 should be equivalent. My memory is a little rusty on this, so you may want to confirm.

Alternatively, the function call to COALESCE will be called on every row. Try replacing that COALESCE call with ([ID] > 0 AND [ID] IS NOT NULL).

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 2

HildaJ (8/6/2015)
Thank you in advance for any help.
I have a table that I need to do some computations on all the data but first I need to remove the duplicate records and insert the results into a destination table. Here's the example below. My table has 3.1 million rows. I have tried using the DISTINCT and the GROUP BY but both ways to select the data takes about half a minute to run. I'm wondering if there is a way to increase performance. Users are ok with this time since the process runs overnight but improving it won't hurt. I do have a clustered index on these fields but that doesn't seem to improve any.
SELECTDateYear ,
DateMonth ,
Nbr ,
Nbr1 ,
Nbr2 ,
Datafield1 ,
Datafield2,
ID
FROM [dbo].[TableName]
WHERE 1 = 1
AND COALESCE([ID],0) > 0 -- ID sometimes is NULL, we don't want these records
GROUP BYDateYear ,
DateMonth ,
Nbr ,
Nbr1 ,
Nbr2 ,
Datafield1 ,
Datafield2,
ID

Get rid of the COALESCE, as Andrew points out it's not required and will punish performance.

The speed of this query will depend on whether or not there's an index in place to support a streaming aggregate. With the COALESCE removed, can you post an Actual execution plan as an attachment file, and also ddl for the table and indexes. Cheers.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Sean Lange SSC Guru Points: 286573 More actions · Answer 3

Andrew P (8/6/2015)
I don't think you need that COALESCE call. NULL > 0 is not true, so [ID] > 0 should be equivalent. My memory is a little rusty on this, so you may want to confirm.
Alternatively, the function call to COALESCE will be called on every row. Try replacing that COALESCE call with ([ID] > 0 AND [ID] IS NOT NULL).

To help clear out the "fuzzies", you don't need to check for IS NOT NULL. Simply ID > 0 will remove and rows that are NULL. NULL is not equal to, greater than or less than 0. It is NULL.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

SolveSQL SSChasing Mays Points: 647 More actions · Answer 4

--Keep your query as simple as possible

SELECT DISTINCT DateYear ,

DateMonth , Nbr , Nbr1 , Nbr2 , Datafield1 , Datafield2, ID

FROM [dbo].[TableName]

WHERE [ID] > 0

If possible, try to decrease the amount of column you want to run distinct against. And than use join to get rest of the Columns.

HildaJ SSCarpal Tunnel Points: 4516 More actions · Answer 5

Thanks for the advice, I remove the COALESCE and it improve the query by a few three seconds. Unfortunately, I do have to have all those columns.

ScottPletcher SSC Guru Points: 100949 More actions · Answer 6

GROUP BY will typically perform less well than DISTINCT:

SELECT DISTINCT DateYear ,

DateMonth ,

Nbr ,

Nbr1 ,

Nbr2 ,

Datafield1 ,

Datafield2,

ID

FROM [dbo].[TableName]

WHERE 1 = 1

AND ID > 0

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

ilovedata Old Hand Points: 377 More actions · Answer 7

Can you not use SSIS and remove the duplicates using the SORT transformation since you said its a nightly job? If you have to do it only the T-SQL way, why not use the ROW_NUMBER() and determine if there are duplicate rows and delete everything that is greater than 1.

https://www.youtube.com/watch?v=4fl1tN2DSQE

https://www.youtube.com/watch?v=O4pf7m-VTSk

just a thought.. hope it helps...