Wht is the best way to load 15 million rows into a table with a pk? There are known duplicates.

Question

Wht is the best way to load 15 million rows into a table with a pk? There are known duplicates.

Viewing 6 posts - 16 through 20 (of 20 total)

You must be logged in to reply to this topic. Login to reply

DesNorton SSC-Insane Points: 24279 More actions · Answer 1

There was no need to move the data.

Option1 - Insert unique values into the original table
CREATE CLUSTERED INDEX CI_Temp ON exportLogTable2(TableID);

WITH cteData as ( SELECT TableID , ... -- Other fields , rn = ROW_NUMBER() OVER(PARTITION BY TableID ORDER BY (SELECT NULL) /* or change to some sortable field in the table*/ ) FROM exportLogTable2 ) INSERT INTO exportLogTable (TableID, ...) SELECT TableID, .... FROM cteData WHERE rn = 1;

DROP INDEX CI_Temp ON exportLogTable2;

Option2 - Delete the data, create a PK, and rename the table back to the original table
CREATE CLUSTERED INDEX CI_Temp ON exportLogTable2(TableID);

WITH cteDups as ( SELECT TableID , rn = ROW_NUMBER() OVER(PARTITION BY TableID ORDER BY (SELECT NULL) /* or change to some sortable field in the table*/ ) FROM exportLogTable2 ) DELETE FROM cteDups WHERE rn > 1;

DROP INDEX CI_Temp ON exportLogTable2;

EXEC sp_rename @objname = N'exportLogTable2', @newname = 'exportLogTable';

ALTER TABLE exportLogTable ADD CONSTRAINT PK_exportLogTable PRIMARY KEY CLUSTERED (TableID);

How to post data/code on a forum to get the best help.

Jeffrey Williams SSC Guru Points: 90252 More actions · Answer 2

This isn't really that hard - the most difficult part is identifying 'duplicate' rows. Using a CTE or derived table - add a row_number() and partition by tableID (and any other columns if needed). Order the rows by some value that works - if you have a date/time column with the latest date...use that.

WITH dups AS ( SELECT ..., row_number() over(Partition By TableID Order By tdate desc) As rn FROM exportLogTable2 ) INSERT INTO exportLogTable SELECT ... FROM dups WHERE rn = 1;

Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

― Charles R. Swindoll

How to post questions to get better answers faster
Managing Transaction Logs

ScottPletcher SSC Guru Points: 100949 More actions · Answer 3

INSERT INTO dbo.exportLogTable SELECT ...list_of_all_columns_in_table... FROM ( SELECT *, ROW_NUMBER() OVER(PARTITION BY tableID ORDER BY tableID) AS row_num FROM dbo.exportLogTable2 ) AS elt2 WHERE row_num = 1

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Thong Lam Valued Member Points: 69 More actions · Answer 4

Hi, because this is large table with 15 milion rows, so my suggestion is split it into small pieces. Here my code
SELECT * INTO exportLogTable2 FROM exportLogTable; EXEC sp_rename 'exportLogTable', 'exportLogTable_Backup'; CREATE TABLE exportLogTable WITH PRIMARY KEY(tableID);

DECLARE @Top INT = 5000; WHILE EXISTS(SELECT 1 FROM exportLogTable2) BEGIN BEGIN TRANSACTION; INSERT INTO exportLogTable SELECT TOP(@Top) e2.* FROM exportLogTable2 e2 LEFT JOIN exportLogTable e ON e.tableID = e2.tableID WHERE e.tableID IS NULL;

DELETE TOP (@Top) FROM exportLogTable2; COMMIT TRANSACTION; END;

WebTechie SSChampion Points: 11405 More actions · Answer 5

Hey everyone,

I've been on call this week and got really busy with other production issues. However, I wanted to follow up on this issue.

I went saw option 1 from DreNorton. That worked really well. Although, it was a very slow process. I RDP'd to the server. Initiated the query on the remote server and let it execute all night.

Option1 - Insert unique values into the original table

CREATE CLUSTERED INDEX CI_Temp ON exportLogTable2(TableID);

WITH cteData as (
  SELECT TableID
   , ... -- Other fields
   , rn = ROW_NUMBER() OVER(PARTITION BY TableID
            ORDER BY (SELECT NULL) /* or change to some sortable field in the table*/
            )
  FROM exportLogTable2
)
INSERT INTO exportLogTable (TableID, ...)
SELECT TableID, ....
FROM cteData
WHERE rn = 1;

DROP INDEX CI_Temp ON exportLogTable2;



That was certainly a learning experience.  I appreciate all of the input.  It helps be a better DBA.

Things will work out. Get back up, change some parameters and recode.

ScottPletcher SSC Guru Points: 100949 More actions · Answer 6

The "standard" way to do that is to make sure the key doesn't already exist in the table to be insert to. If you need to, delete dups from the NewData before running the main INSERT (i.e. not as part of the load process itself).

INSERT INTO dbo.MainTable (TableID, ... ) SELECT TableID, .... FROM dbo.NewDataTable NDT WHERE NOT EXISTS( SELECT 1 FROM dbo.MainTable MT WHERE MT.TableID = NDT.TableID )

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.