help with deletion of multiple records

  • Hi there,

    I have a table that has multiple duplicate records. I do not want to truncate the table. I just want all unique records. As you see below, there is no primary key. The table has 400.000,000 records. --===== If the test table already exists, drop it

    IF OBJECT_ID('TempDB..#mytable','U') IS NOT NULL

    DROP TABLE #mytable

    --===== Create the test table with

    CREATE TABLE #mytable

    (

    ID varchar(10) not null,

    name varchar(100) not null,

    dateinserted datetime,

    )

    INSERT INTO #mytable

    (ID, name, dateinserted)

    SELECT 'TIX123','cocacola',getdate() UNION ALL

    SELECT 'TIX123','cocacola',getdate() UNION ALL

    SELECT 'TIX999','pepsi',getdate() UNION ALL

    SELECT 'TIX523','nbc',getdate() UNION ALL

    SELECT 'TIX999','pepsi',getdate() UNION ALL

    I want to delete one record of cocacola & pepsi. If it's a small table, i can do this

    WITH Ordered AS

    (

    SELECT*, ROW_NUMBER() OVER ( PARTITION BY name ORDER BY Id) AS RN

    FROM #mytable where ID

    in ('TIX123','TIX999')

    )

    delete FROM Ordered

    WHERE RN > 1;

    How can I improve this query, if there are million records with different names?

    Thanks

    Rash

  • The only way I've been able to do this before is to delete in batches. The problem is, this is a slow process. I had to do with for multiple tables with hundreds of millions of rows and it required me to run this manually over a few weekends to complete.

  • So you have a temp table with 400 million rows??? Even if it is actually a persistent table why do you have a table that large with no primary key? Do you at least have a clustered index? Finding and removing duplicates from a 400 million row heap is going to be painfully slow no matter how you code it.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • @SSChampion, I do have a clustered index on ID. Table was created by somebody & had not been maintained until it ran out of space. So, i am assigned to take care of this & I am a newbie trying to figure out the best way to do this

  • So you have duplicate ID and name. Which dateinserted do you want to keep?

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • @SSChampion, many of these records have same dateinserted. so, we want one record per Unique ID, name, dateinserted

  • Maybe you can aggregate your data into another table, drop the current table and rename the newly created one?

    Since you can't rename a temp table I changed your sample code to a persistent table as an example.

    --===== If the test table already exists, drop it

    IF OBJECT_ID('mytable','U') IS NOT NULL

    DROP TABLE mytable

    --===== Create the test table with

    CREATE TABLE mytable

    (

    ID varchar(10) not null,

    name varchar(100) not null,

    dateinserted datetime,

    )

    INSERT INTO mytable

    (ID, name, dateinserted)

    SELECT 'TIX123','cocacola',getdate() UNION ALL

    SELECT 'TIX123','cocacola',getdate() - 10 UNION ALL

    SELECT 'TIX999','pepsi',getdate() UNION ALL

    SELECT 'TIX523','nbc',getdate() UNION ALL

    SELECT 'TIX999','pepsi',getdate() - 10

    select ID, name, MIN(dateinserted) as DateInserted

    into SomeTable

    from mytable

    group by ID, name

    drop table mytable

    exec sp_rename SomeTable, mytable

    select * from mytable

    drop table mytable

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Since he has 400 million rows, should this be done with one statement or should this be done in batches? Moving hundreds of millions of records is probably going to take awhile.

  • JoshDBGuy (4/1/2014)


    Since he has 400 million rows, should this be done with one statement or should this be done in batches? Moving hundreds of millions of records is probably going to take awhile.

    Yes indeed it will. You could do those inserts in batches to ease some pressure for sure.

    I might even restore a db backup to another server and do this. Just need to be careful that you capture any data changes in between the backup date and the current time.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • @SSChampion, By batches you mean maybe based on certain dates.

  • You could try

    WITH Ordered AS

    (

    SELECT*, ROW_NUMBER() OVER ( PARTITION BY name ORDER BY Id) AS RN

    FROM #mytable where ID

    in ('TIX123','TIX999')

    )

    select * from ordered

    where RN=1

    to move the data to a new table, with SSIS or select into, then truncate the existing table. I think this would help with the processing time.

    ----------------------------------------------------

  • MMartin1 (5/28/2014)


    You could try

    WITH Ordered AS

    (

    SELECT*, ROW_NUMBER() OVER ( PARTITION BY name ORDER BY Id) AS RN

    FROM #mytable where ID

    in ('TIX123','TIX999')

    )

    select * from ordered

    where RN=1

    to move the data to a new table, with SSIS or select into, then truncate the existing table. I think this would help with the processing time.

    This is pretty similar to what the OP tried before they posted. This is going to be horribly slow on 400 million rows.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • With batches, it is meant you can try something like

    WITH Ordered AS

    (

    SELECT*,

    ROW_NUMBER() OVER ( PARTITION BY name ORDER BY Id) AS RN,

    ROW_NUMBER() OVER ( ORDER BY id) as absolute_RN

    FROM #mytable where ID

    in ('TIX123','TIX999')

    )

    select *

    into #MyWorkingTable

    from ordered

    GO

    declare @low bigint;

    declare @high bigint;

    declare @max-2 bigint;

    select @max-2=count(*) from #mytable;/*assumes you have more than 10k records the source table */

    set @low=1;

    set @high = @low+9999;

    while @max-2>=@low

    begin

    delete from #myWorkingTable

    where absolute_RN>=@low and absolute_RN<@high

    and RN>1

    set @low=@high;

    set @high=@low+9999;

    END

    This brings the data with two new columns into another table. You work on that table, and would truncate he original and bring what is left over back into the original table. Be careful with log size and diskspace here though , since you will be making a copy of a large table. Since you have no keys I assume there is also no issue with FK? Another thing to consider.

    ----------------------------------------------------

  • MMartin1 (5/28/2014)

    You could try

    WITH Ordered AS

    (

    SELECT*, ROW_NUMBER() OVER ( PARTITION BY name ORDER BY Id) AS RN

    FROM #mytable where ID

    in ('TIX123','TIX999')

    )

    select * from ordered

    where RN=1

    to move the data to a new table, with SSIS or select into, then truncate the existing table. I think this would help with the processing time.

    This is pretty similar to what the OP tried before they posted. This is going to be horribly slow on 400 million rows.

    I don't think we can overcome that the process will be slow, just which is the faster of the slows.

    ----------------------------------------------------

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply