August 5, 2011 at 2:54 am
Hi, Everyone.
I need help in improving the performance of our clean up script to remove duplicate data by extracting the unique records from the source table and then inserting these records to a clean table.
I have attached the created scripts for the two tables and the stored procedure used. The clean tables are just a replica of these two tables and we just added an extension "_clean" to the table name.
What we are planning is to clean data per day and transfer it to a clean table then somehow update the clean records to the production server.
Additional Details
* "SampleTable" contains 41,125,020 Million records.
* "SampleTableDetails" contains 1,321,512,023 Billion records.
We tried to execute the stored procedure spSampleCleanup and here is the details of the timing:
Note: that this was on executed for our sample tables with the following records.
* "SampleTable" that contains 205,821 thousand records.
* "SampleTableDetails" that contains 5,129,312 million records.
+ Processed the 205,821 data in SampleTable and inserted the 180,213 records to the SampleTable_Clean table for 4 seconds.
+ Processed the 5,129,312 records in SampleTableDetails and inserted the 4,123,221 records to the SampleTableDetails_Clean table for 12 hours +.
Now, I don't know if this is how far the T-SQL query execution performance can go or there are some other ways to improve this.
Any help is highly appreciated!
August 5, 2011 at 7:55 am
dunno maybe this is more efficient?
INSERT INTO SampleTableDetails_Clean
SELECT *
FROM SampleTableDetails
WHERE SampleTableId IN (
SELECT * FROM @UniqueIds
)
AND DATED BETWEEN @startDate AND @endDate
changes to:
INSERT INTO SampleTableDetails_Clean
SELECT a.*
FROM SampleTableDetails a
JOIN @UniqueIds b
ON a.SampleTableId = b.SampleTableId
WHERE DATED BETWEEN @startDate AND @endDate
dont know about your experience but I've always found joining to be faster than where-in clauses.
edit -- need select a.* not select * where theres a join.
Ben
^ Thats me!
----------------------------------------
01010111011010000110000101110100 01100001 0110001101101111011011010111000001101100011001010111010001100101 01110100011010010110110101100101 011101110110000101110011011101000110010101110010
----------------------------------------
August 5, 2011 at 8:13 am
and another thing (apologies if this is overly simplistic) but if at all possible make sure your server has more (preferably over double) RAM available to it than there is data in that table. otherwise you will find that the server will process 1.1billion rows in x minutes then return the remaining 0.2 billion rows in x days!
and, err, on your new server, add a unique key constraint on the table so you dont have to do this again!
Ben
^ Thats me!
----------------------------------------
01010111011010000110000101110100 01100001 0110001101101111011011010111000001101100011001010111010001100101 01110100011010010110110101100101 011101110110000101110011011101000110010101110010
----------------------------------------
August 5, 2011 at 10:11 am
Thanks for the input Ben. If I am going to re-index the table, would that help also?
August 5, 2011 at 10:51 am
Another approach is to "clean up" the duplicates by using a CTE. For example:
CREATE TABLE #T(Date_Stamp DATETIME,KW1 DECIMAL(5,1), KW2 DECIMAL(5,1))
INSERT INTO #T
SELECT '12/10/2010', 5.3, 3.1 UNION ALL
SELECT '12/10/2010', 5.3, 3.1 UNION ALL
SELECT '12/9/2010', 4, 2 UNION ALL
SELECT '12/8/2010', 3, 1 UNION ALL
SELECT '12/7/2010', 7.4, 5 UNION ALL
SELECT '12/7/2010', 7.4, 5 UNION ALL
SELECT '12/7/2010', 7.4, 5
;with cte
as (select row_number() over(partition by Date_Stamp,KW1,KW2 order by Date_Stamp) as rn,
KW1,KW2
from #T)
SELECT * FROM cte
Result:
rnKW1KW2
17.45.0
27.45.0
37.45.0
13.01.0
14.02.0
15.33.1
25.33.1
now using SELECT * FROM cte WHERE rn > 1 (selects the "unique" row)
Results:
17.45.0
13.01.0
14.02.0
15.33.1
Have not tested the above code on a record set as large as those you are processing but by selecting the "unique" row youl eliminate the need to insert into you "clean" table and might run in a short enough time to be adequate.
August 5, 2011 at 12:14 pm
bitbucket-25253 (8/5/2011)
Another approach is to "clean up" the duplicates by using a CTE. ....Have not tested the above code on a record set as large as those you are processing but by selecting the "unique" row youl eliminate the need to insert into you "clean" table and might run in a short enough time to be adequate.
Thanks bitbucket, but I am already using CTE in my stored procedure..please see attachment on my first post.
August 5, 2011 at 12:24 pm
James Tech (8/5/2011)
bitbucket-25253 (8/5/2011)
Another approach is to "clean up" the duplicates by using a CTE. ....Have not tested the above code on a record set as large as those you are processing but by selecting the "unique" row you eliminate the need to insert into you "clean" table and might run in a short enough time to be adequate.
Thanks bitbucket, but I am already using CTE in my stored procedure..please see attachment on my first post.
Sorry as a matter of general practice, do not like opening attachments, not knowing what is in them. (Phobia against malicious software which may or may not be present)
August 6, 2011 at 1:05 pm
I guess my first three recommendations would be...
1. Stop using a table variable for this. The data is way too big.
2. Partition the target tables by date.
3. Consider creating an archive plan here only the last 3 to 6 months of data is keep in the "in use" table and the rest is kept in a "mostly offline" table.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 7, 2011 at 11:30 pm
Jeff Moden (8/6/2011)
I guess my first three recommendations would be...1. Stop using a table variable for this. The data is way too big.
2. Partition the target tables by date.
3. Consider creating an archive plan here only the last 3 to 6 months of data is keep in the "in use" table and the rest is kept in a "mostly offline" table.
Thanks Jeff and everyone.
1.) What would be the alternative to use for large data instead of table variable?
2.) Can you elaborate further when you partition the target tables by date? Is this using the PARTITION BY still? or physically partition the table?
Also, I notice something in the query that I provide...in the WHERE clause:
..WHERE DATED BETWEEN @startDate AND @endDate.
If I supply a date range e.g. WHERE DATED BETWEEN '2011/07/01 0:0:0' AND '2011/07/01 23:59:59:999', it will retrieve the July 2nd 00:00:00.000 records as well...Any thoughts? This is also applicable with other date range, it will retrieve the records a day after the end date.
3.) Please enlighten me on this item.
Thank you for your help Jeff.
August 8, 2011 at 8:07 am
August 8, 2011 at 5:18 pm
James Tech (8/7/2011)
Jeff Moden (8/6/2011)
I guess my first three recommendations would be...1. Stop using a table variable for this. The data is way too big.
2. Partition the target tables by date.
3. Consider creating an archive plan here only the last 3 to 6 months of data is keep in the "in use" table and the rest is kept in a "mostly offline" table.
Thanks Jeff and everyone.
1.) What would be the alternative to use for large data instead of table variable?
2.) Can you elaborate further when you partition the target tables by date? Is this using the PARTITION BY still? or physically partition the table?
Also, I notice something in the query that I provide...in the WHERE clause:
..WHERE DATED BETWEEN @startDate AND @endDate.
If I supply a date range e.g. WHERE DATED BETWEEN '2011/07/01 0:0:0' AND '2011/07/01 23:59:59:999', it will retrieve the July 2nd 00:00:00.000 records as well...Any thoughts? This is also applicable with other date range, it will retrieve the records a day after the end date.
3.) Please enlighten me on this item.
Thank you for your help Jeff.
Sure...
On 1), a Table Variable will be perceived to contain only 1 row (under most circumstances) when the execution plan is made. That can cause some pretty nasty performance problems because SQL Server may do something like picking a loop join (etc) for smaller tables instead of (possibly) a merge or hash join which usually works better for larger tables.
On 2), I'm talking about partitioning as you would find in example "H" under "Create Table" in Books Online. Physical partitioning across multiple file groups to bring more read-write heads (if your disk system is setup as such) into play would probably also be a benefit.
On 3), The DateTime datatype has a resolution of 3.3 milliseconds and will, off course, round up to the next day for times using 23:59:59.999. You should never use a time when you're trying to get a whole day. If you were trying to get all of the data for all of yesterday without regard to the time of day, then you should use code like the following (assuming that @DesiredDate is a "whole date" with midnight time or no time element)...
 WHERE somedatetimecolumn >= @DesiredDate AND somedatetimecolumn < DATEADD(dd,1,@DesiredDate )
Heh... and don't let anyone talk you into using 23:59:59.997 or any other time because it just makes a mess of things. 🙂
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 11 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply