Restructure 100 Million Row (or more) Tables in Seconds. SRSLY!

  • Comments posted to this topic are about the item Restructure 100 Million Row (or more) Tables in Seconds. SRSLY!

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Solomon, this is a great post and thanks for sharing! Just this last week I took a similar approach with 4 large tables that had multiple columns being changed from varchar to nvarchar to support our localization effort. Much of the techniques you mention here were used with the exception of triggers. Given that we were on SQL Server 2008 I was able to use Change Tracking in place of triggers which resulted in 0 modifications to the existing tables. I'd be interested to hear how other folks have used Change Tracking or Change Data Capture. Thanks again for sharing!

    Luke C
    MCSE: Data Platform, MCP, MCTS, MCITP - Database Administrator & Database Developer

  • If Suppose i have 1 Billion Records in the Table. what is the time complexity. Could you please explain the time complexity.

  • This is a very good article . My Client here would do enahancements to the application frequently and I end up adding columns to the table with over 3 million rows. Any suggestions how I would do in this scenario?

  • The article was well explained and straight to the point.

    The title on the other hand is *VERY* misleading (seconds?) Apparently you don't count the time you spent setting AND processing it all ... 🙁


    * Noel

  • Luke C (4/26/2011)


    Solomon, this is a great post and thanks for sharing! Just this last week I took a similar approach with 4 large tables that had multiple columns being changed from varchar to nvarchar to support our localization effort. Much of the techniques you mention here were used with the exception of triggers. Given that we were on SQL Server 2008 I was able to use Change Tracking in place of triggers which resulted in 0 modifications to the existing tables. I'd be interested to hear how other folks have used Change Tracking or Change Data Capture.

    Hey Luke. Thanks! CDC is an interesting option that I did not think of due to us being on SQL Server 2005. But if I get the chance (someday :-)), I will revise this article with that option.

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • sathishsathish81 (4/26/2011)


    If Suppose i have 1 Billion Records in the Table. what is the time complexity. Could you please explain the time complexity.

    Hello. The amount of time it takes to move the data over varies based on several factors:

    1) How wide is each row

    2) How much activity/contention is there on the table

    3) How fast is the underlying disk subsystem

    4) etc?

    So it really takes some testing on each system to really have a decent idea.

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • jvrakesh-858370 (4/26/2011)


    This is a very good article . My Client here would do enahancements to the application frequently and I end up adding columns to the table with over 3 million rows. Any suggestions how I would do in this scenario?

    Hello. I don't really understand your question. Can you please explain in more detail? Thanks.

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • noeld (4/26/2011)


    The article was well explained and straight to the point.

    The title on the other hand is *VERY* misleading (seconds?) Apparently you don't count the time you spent setting AND processing it all ... 🙁

    Hi Noel. I am sorry if you feel it was misleading, but I did address this in my article at the beginning of the Overview section. The main intent of the "quickly" making changes was to minimize customer / system impact. The end result is that from the outside (and from the perspective of the application), the restructuring does indeed take only seconds. I also stated in the article summary that this is a way to make large-scale changes that do not require a down-time or maintenance window.

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Excellent stuff and very detailed. Thanks for sharing. I have done something similar in the past but never thought of using triggers, instead compared the data between old and new table quickly using Red-Gate data compare tool.

    But I guess using the Change tracking feature makes it even more convenient.

    Thanks!

    Amol Naik

  • Nice article and thanks for posting and sharing your methods.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Hi,

    Good article, great topic.

    I agree with the use of change tracking in 2008, and would add that triggers are not always straight forward. You must consider referential integrity, cascading events , and the typically more complex object relationships in an enterprise system.

    For simple structures, triggers can be used wisely - but watch them on any mass updates to your source as they can quickly become points of contention.

    For more complex structures that may already have cascading triggers, or constraints - then things can get unruly fast and you may find your self spending more time coding than is worthy of the effort.

    But still, options are our friends, and you've certainly presented some above with triggers and change tracking and RedGate

    (also newer object level SQL replication services for that matter...)

    One other I use on occasion (usually from a remote host via open datasource/openrowset) but basically the same on the local host, is a simple SQL statement (simple is my best friend - simple is fast, simple is stable - so whenever possible I say go simple).

    If again, the table/object in question is fairly stand-alone and has good key values (basics required for a trigger copy as well usually), then another approach is to sync the copy from the source using straight TSQL.

    Example:

    Table1 has a key column named RecordID and a datetime column named EntryDate

    Table2 is an exact duplicate structure of table1 (assumes initial bulk load).

    -- declare a table variable

    Declare @TABVAR Table(RecordID INT)

    --Declare storage and set value for the latest record in your copy table

    Declare @synctime datetime

    set @synctime = Select max(entydate) from Table2

    -- Fetch NEW IDs for processing (records in source - not yet in copy)

    insert into @TABVAR

    select RecordID from table1

    where entrydate >= @synctime

    --Delete records from your copy table, prepping for new inbound

    --- Sometimes not required if source is insert only. But we'll assume some updates also occur...

    delete from table2 where RecordID in (select RecordID from @TABVAR)

    --Insert and Sync data

    Insert into Table2

    Select * from table1 where recordID in (select RecordID from @TABVAR)

    Done

    Run this on interval during your day and your copy is in sync.

    May be similar to trigger code but is not bound to your source table and hence has no impact on application code or performance unless you are executing the task.

    * Example may be lose, but it gives you the idea .

    Deletes in the source are a consideration, but I suppose in any replication the first thing is a review of structures and limitations and finding the right fit.

    And again, this is simple code that can also be run from a remote host without need for a link server or DSN. remote host selects just use OPENDATASOURCE/ROWSET calls...)

    Enjoy and thanks for sharing your insights !

  • Great arcticle, Solomon!

    I used to do this with humongus mainframe databases for banks back in the 90s. And yes, you really can implement the change in seconds if you're well prepared. Real 24/7 operations.

    It amazes me how many 'architects' still don't get this sort of technique and 'polute' databases with silly tables/views as they try to work around the fact that even a 60 second outage is unacceptable.

    BTW - I've used very similar techniques to migrate huge amounts of data from an old system to a new system. Triggers are very cool. 😉

  • Great article Solomon!

    I've often used similar methods during migrations. Even on small datasets (achieving sub-second deployment of new releases, a great feature for websites, a similar trick can be used on MS IIS by setting up the new site and just switching the root-folder and virtual folder at the same time as switching the tables).

    So far I've always developed a migration "application" together with the new developments making extensive use of specific contents/structures of the underlying system to optimize the migration/synchronization of data. Your article goes a long way to formalize the method (which I like a lot, since it allows me to spend more time on the new developments and less on the migration itself).

    I do however avoid triggers like the plague in production-code as a principle, they can cause more problems than they're worth (I've been developing SQL since version 6.5 and have learned the hard way). I much more take rbarbati's approach in an earlier reply-post, for this I prepare ALL my tables to ALWAYS include a timestamp-column, so I can find updated/inserted rows since a certain time (kept in a special table containing just one row with one timestamp field and a binary(8) field containing the previous timestamp; starting any synchronization-job by copying the real timestamp into the previous timestamp field, this way I can find a window of timestamps to look for since the previous synchronization across all tables with 100% accuracy (timestamps are global to the database and I've so far never seen any problems/bugs). To avoid problems with table-relations changing during the migration-queries, I even ignore changes made after this timestamp-window, so the timestamp really gives a snapshot of data to look at with guaranteed referential integrity even if users are making changes during the time the pre-release migration runs (those will be taken care of in the next run).

    One note about indexing this timestamp field though: On big tables you can create an non-clustered unique index on the timestamp and primary key. Be careful however creating indexes on tables with high modification-rates, this index is ALWAYS hit by ALL updates/inserts and it may be wiser to have the migration-code do dirty-reads (non-blocking) and a SLOW table-scan and deal with the consequences of dirty-reads rather than penalizing production-code with the index-update). You'll have to test both ways on test-servers simulating the high load and measure the penalty-hit. Deletions can of course be found by left-joining the primary keys of old and new tables. I have used 'delete' flags on some occasions rather than physical deletes, but this brings a burden to the application to always filter out those records. But it can be a great way for very fast "deletion" of records that are highly referred to by other tables without having to update those references as well as the deletion AND have a history of 'old' records AND have a good way of finding deleted records at migration-time (the timestamp is set at 'deletion'), so it may be worth it for certain tables in certain applications).

    Combine this with Solomons windowed copying and you have a hell of a migration-platform with low to almost zero impact on production downtime which is of more an more importance in todays globally connected world.

  • In my opinion, adding a trigger to the production database; creating a table in production database, dropping a table from production database, creating a job in production database - IS RELEASE.

    Every change made in the production database should be considered as a release.

    For example, adding a trigger could affect performances and should be planned ahead, tested, released, monitored. Same with the other changes which makes these steps actually:

    Release I, Release II and Release III

    Where does FKs fit in? (Their name has to be unique on the database level)

    What happens in case of the replication, if changed table is published?

    How this data transfer affects tlog and backup size?

    What happens in case there is a log shipping set up on the database? How is space affected? How are performances affected?

    Depending on the table structure and the way transfer is done, indexes can end up heavily fragmented if created before table is populated.

    Although being useful in some cases, it seems to me that this approach has limited usage and hidden risks which should have been mentioned as well.

Viewing 15 posts - 1 through 15 (of 33 total)

You must be logged in to reply to this topic. Login to reply