Wht is the best way to load 15 million rows into a table with a pk? There are known duplicates.

  • There was no need to move the data.

    Option1 - Insert unique values into the original table
    CREATE CLUSTERED INDEX CI_Temp ON exportLogTable2(TableID);

    WITH cteData as (
    SELECT TableID
       , ... -- Other fields
       , rn = ROW_NUMBER() OVER(PARTITION BY TableID
               ORDER BY (SELECT NULL) /* or change to some sortable field in the table*/
               )
    FROM exportLogTable2
    )
    INSERT INTO exportLogTable (TableID, ...)
    SELECT TableID, ....
    FROM cteData
    WHERE rn = 1;

    DROP INDEX CI_Temp ON exportLogTable2;

    Option2 - Delete the data, create a PK, and rename the table back to the original table

    CREATE CLUSTERED INDEX CI_Temp ON exportLogTable2(TableID);

    WITH cteDups as (
    SELECT TableID
       , rn = ROW_NUMBER() OVER(PARTITION BY TableID
               ORDER BY (SELECT NULL) /* or change to some sortable field in the table*/
               )
    FROM exportLogTable2
    )
    DELETE FROM cteDups
    WHERE rn > 1;

    DROP INDEX CI_Temp ON exportLogTable2;

    EXEC sp_rename @objname = N'exportLogTable2', @newname = 'exportLogTable';

    ALTER TABLE exportLogTable
    ADD CONSTRAINT PK_exportLogTable
    PRIMARY KEY CLUSTERED (TableID);

  • This isn't really that hard - the most difficult part is identifying 'duplicate' rows.  Using a CTE or derived table - add a row_number() and partition by tableID (and any other columns if needed).  Order the rows by some value that works - if you have a date/time column with the latest date...use that.


    WITH dups
    AS (
    SELECT ..., row_number() over(Partition By TableID Order By tdate desc) As rn
    FROM exportLogTable2
    )
    INSERT INTO exportLogTable
    SELECT ...
    FROM dups
    WHERE rn = 1;

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs


  • INSERT INTO dbo.exportLogTable
    SELECT
      ...list_of_all_columns_in_table...
    FROM (
      SELECT *, ROW_NUMBER() OVER(PARTITION BY tableID ORDER BY tableID) AS row_num
      FROM dbo.exportLogTable2
    ) AS elt2
    WHERE row_num = 1

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Hi, because this is large table with 15 milion rows, so my suggestion is split it into small pieces. Here my code

    SELECT * INTO exportLogTable2 FROM exportLogTable;
    EXEC sp_rename 'exportLogTable', 'exportLogTable_Backup';
    CREATE TABLE exportLogTable WITH PRIMARY KEY(tableID);

    DECLARE @Top INT = 5000;
    WHILE EXISTS(SELECT 1 FROM exportLogTable2) BEGIN
        BEGIN TRANSACTION;
        INSERT INTO exportLogTable
        SELECT TOP(@Top) e2.*
        FROM
            exportLogTable2 e2
            LEFT JOIN exportLogTable e ON e.tableID = e2.tableID
        WHERE
            e.tableID IS NULL;

        DELETE TOP (@Top) FROM exportLogTable2;
        COMMIT TRANSACTION;
    END;

  • Hey everyone,

    I've been on call this week and got really busy with other production issues.  However, I wanted to follow up on this issue.

    I went saw option 1 from DreNorton.  That worked really well.  Although, it was a very slow process.  I RDP'd to the server.  Initiated the query on the remote server and let it execute all night.

    Option1 - Insert unique values into the original table

    CREATE CLUSTERED INDEX CI_Temp ON exportLogTable2(TableID);

    WITH cteData as (
    SELECT TableID
    , ... -- Other fields
    , rn = ROW_NUMBER() OVER(PARTITION BY TableID
    ORDER BY (SELECT NULL) /* or change to some sortable field in the table*/
    )
    FROM exportLogTable2
    )
    INSERT INTO exportLogTable (TableID, ...)
    SELECT TableID, ....
    FROM cteData
    WHERE rn = 1;

    DROP INDEX CI_Temp ON exportLogTable2;



    That was certainly a learning experience. I appreciate all of the input. It helps be a better DBA.

    Things will work out.  Get back up, change some parameters and recode.

  • The "standard" way to do that is to make sure the key doesn't already exist in the table to be insert to.  If you need to, delete dups from the NewData before running the main INSERT (i.e. not as part of the load process itself).


    INSERT INTO dbo.MainTable (TableID, ... )
    SELECT TableID, ....
    FROM dbo.NewDataTable NDT
    WHERE NOT EXISTS(
        SELECT 1
        FROM dbo.MainTable MT
        WHERE MT.TableID = NDT.TableID
        )

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Viewing 6 posts - 16 through 20 (of 20 total)

You must be logged in to reply to this topic. Login to reply