Add ID and Primary Key to Large Table(s)

Question

Add ID and Primary Key to Large Table(s)

tking 39890

Old Hand

Points: 349
More actions
April 24, 2015 at 8:45 am

#307338

Background:
* SQL Server 2008 R2
* Database was created from a third party product. The product writes to the 3 tables that I need to make changes to 24/7 and downtime is not an option. All changes must be done live.
* Database overall size is ~200 GB
* The 3 tables I must update make up ~190 GB of that space.
* Tables have no primary key or ID columns. Therefore, the data is highly fragmented.
* Of the ~190 GB of space allocated for the tables, there is roughly 70 GB of actual data.
* Rows of the table are not guaranteed to be unique. In fact, on one of the tables, tests were ran with a small sample of data and duplicates were very much evident.
What I'm trying to accomplish here is to get an ID column added to the 3 tables and set that ID field as the primary key. Doing so will force the data to become much less fragmented than it is currently and with purging and new inserts, eventually fragmentation will be nearly non-existent.
Problem:
Making table changes on tables this large while data is constantly being added poses many risks and can cause data loss. This was tried on a smaller table than these three and the entire table was lost in the process. Restore from backup was needed to get back to most recent log backup point.
Original Solution:
My original plan was to create a backup of each table and run the script below to migrate the majority of the data temporarily into the new table. I could then update the original table (which now would contain much less data) and then migrate the data back.
CREATE TABLE #temp
(
MsgDate varchar(10)
,MsgTime varchar(8)
,MsgPriority varchar(30)
,MsgHostname varchar(255)
,MsgText varchar(2048)
)
WHILE( ( SELECT COUNT(*) FROM NetworkWarningsBackup1 WHERE msgDate <= '2015-04-01' AND msgTime <= '15:15:00' ) > 0 )
BEGIN
INSERT INTO #temp
SELECT TOP 33 MsgDate
,MsgTime
,MsgPriority
,MsgHostname
,MsgText
FROM NetworkWarningsBackup1
WHERE msgDate <= '2015-04-01'
AND msgTime <= '15:15:00'
ORDER BY MsgDate, MsgTime, MsgText
INSERT INTO NetworkWarningsBackup2
SELECT *
FROM #temp
DELETE t
FROM NetworkWarningsBackup1 AS t
WHERE EXISTS
(
SELECT *
FROM #temp AS t2
WHERE t.MsgDate = t2.MsgDate
AND t.MsgTime = t2.MsgTime
AND t.MsgPriority = t2.MsgPriority
AND t.MsgHostname = t2.MsgHostname
AND t.MsgText = t2.MsgText
)
TRUNCATE TABLE #temp
END
DROP TABLE #temp
Original Solution Problem:
The problem with the solution above is that it calls the DELETE function on the original table using the values from the temporary table. When there are duplicate rows, which have not all been inserted into the backup table yet, they will all be removed from the original table because there is nothing unique to separate them out. In my testing, I had 10,000 rows in the original table and ended up with 9,959 rows in the backup table.
Question 1: Is my approach to making these table changes reasonable?
Question 2a: If so, how can I make sure I don't lose data as part of this temporary migration of the data to my backup tables?
Question 2b: If not, what would be a better approach that isn't going to cause disruption to the application that INSERTs data 24/7 and won't have any risk of data loss?
I appreciate you taking the time to read this.
TJ

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

cunningham SSCarpal Tunnel Points: 4595 More actions · Answer 1

Can you add a nullable auto increment column to the table? this will have little impact, as the autoincrement will start when the next insert occurs. then you can

select * into backuptable from originaltable with (nolock) where newcol is null

you will be safe in the knowledge that all the columns with null in that column have been copied out. you can then do a data compare using a 3rd party tool (sql datacompare by redgate is one) to verify the data is the same. this additional step necessary because opf the nolock you had to use to ensure the original table could still inserted into.

just a thought!

cunningham SSCarpal Tunnel Points: 4595 More actions · Answer 2

forgot to add, once you have got to this point, then you can delete from the original table in batches where newcol is null. small batches to minimise locking.

once complete you will have a smaller table to add the indexes to

tking 39890 Old Hand Points: 349 More actions · Answer 3

First off, thank you for your reply.

If I'm understanding correctly, where I'm doing my entire move (insert to backup table and delete from original) together in batches, your suggestion does all the copy to the backup table first, then goes back and does the delete from the original table afterwards. This could work, especially since I'm using date (varchar(10)) and time (varchar(8)) fields as my cutoff for the data (ex. move all data prior to this morning at 8:00 am). I would probably not do the nullable ID column initially though and implement a non-nullable ID column after my move of data is complete.

Again, thank you for your reply and fresh perspective. I'll do some tweaking of my script and testing and see what I can come up with.

TJ

cunningham SSCarpal Tunnel Points: 4595 More actions · Answer 4

Great, let me know if it is suitable or not after some testing. Yes the auto increment was mainly to let you be sure what records are new and what records you can get rid of. but as you have date and time to reference that would not be necessary at this stage.

Sean Lange SSC Guru Points: 286573 More actions · Answer 5

cunningham (4/24/2015)
Can you add a nullable auto increment column to the table? this will have little impact, as the autoincrement will start when the next insert occurs. then you can
select * into backuptable from originaltable with (nolock) where newcol is null
you will be safe in the knowledge that all the columns with null in that column have been copied out. you can then do a data compare using a 3rd party tool (sql datacompare by redgate is one) to verify the data is the same. this additional step necessary because opf the nolock you had to use to ensure the original table could still inserted into.
just a thought!

You can't have an identity column that allows NULL.

Also, why the nolock here? You run the risk of getting incomplete data.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

cunningham SSCarpal Tunnel Points: 4595 More actions · Answer 6

that's why i added

you can then do a data compare using a 3rd party tool (sql datacompare by redgate is one) to verify the data is the same. this additional step necessary because of the nolock you had to use to ensure the original table could still be inserted into.

was just giving some ideas

Sean Lange SSC Guru Points: 286573 More actions · Answer 7

cunningham (4/24/2015)
that's why i added
you can then do a data compare using a 3rd party tool (sql datacompare by redgate is one) to verify the data is the same. this additional step necessary because of the nolock you had to use to ensure the original table could still be inserted into.
was just giving some ideas

But the identity column cannot be NULL. It is a requirement of the identity property.

Why do think you have to use NOLOCK here? Using that hint opens up the possibility of getting duplicate and/or missing rows from the original table.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

cunningham SSCarpal Tunnel Points: 4595 More actions · Answer 8

it's not important that identity can't be null as the OP isn't going to do that. my phrasing of the question was 'can you add a nullable auto increment' because I wasn't sure if you could so i was not stating as fact.

Nolock because the original table needs to be inserted into all the time. so you would have to verify the data afterwards using 3rd party tool.

Sean Lange SSC Guru Points: 286573 More actions · Answer 9

cunningham (4/24/2015)
it's not important that identity can't be null as the OP isn't going to do that. my phrasing of the question was 'can you add a nullable auto increment' because I wasn't sure if you could so i was not stating as fact.
Nolock because the original table needs to be inserted into all the time. so you would have to verify the data afterwards using 3rd party tool.

The only "auto increment" in sql server is identity and it can't be null. Maybe they could use ROW_NUMBER.

I would much prefer using batches instead of introducing bad data through the use of NOLOCK.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

ScottPletcher SSC Guru Points: 100948 More actions · Answer 10

Don't add an id as a PK, just use MsgDate and MsgTime as the (nonunique) clustering keys. That's how you'll lookup and process this data anyway.

I can't imagine how these tables in their current state are being UPDATEd or SELECTed against, only INSERTed to.

If that's true, or 99.999% true, then I suggest the following, which does involve very brief unavailability, but we should be talking just a few seconds; realistically, I don't think you can do this with "no" downtime.

1) create a new table with columns identical to your existing table, but go ahead and add the clustered index on ( MsgDate, MsgTime ), for example:

CREATE TABLE NetworkWarnings__New (

MsgDate varchar(10)

,MsgTime varchar(8)

,MsgPriority varchar(30)

,MsgHostname varchar(255)

,MsgText varchar(2048)

)

CREATE CLUSTERED INDEX NetworkWarnings__CL ON NetworkWarnings_New ( MsgDate, MsgTime ) WITH ( FILLFACTOR = 100 ) ON [PRIMARY] --chg filegroup name if/as needed

2) Rename the existing table -- this will require an exclusive lock -- and then rename the new table to your existing table name:

BEGIN TRANSACTION

SELECT TOP (0) * FROM NetworkWarnings WITH (TABLOCKX)

EXEC sp_rename 'NetworkWarnings', 'NetworkWarnings__Old'

EXEC sp_rename 'NetworkWarnings__New', 'NetworkWarnings'

COMMIT TRANSACTION

3) Existing code can then continue inserting into the original table name, but it will only be new data, since all the old data has been effectively moved to the other table.

4) You can then work with the old data as needed without any further interference to current activity.

5) After getting the old table rebuilt as you want it, you will have to do another similar thing to merge the table rows that have built up since the rename into the rebuilt "__Old" table.

6) One final transaction can do the final switch of copying the last handful of new rows into the Old and doing the renames again to make the Old the true, master log table again.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.