Moving data between databases

  • Hello,

    I have a couple of questions regarding a framework that I may want to change regarding the moving of data between databases on the same SQL Server.

    I will provide some background first, then describe my current method and then maybe you can help me figure out a better way to achieve this goal .

    I have a multi-user win32 application that uses SQL Server extensively. What happens is that through user interaction, system processing, etc... rows are inserted, updated and deleted pretty much constantly in the database.

    As time goes by the database grows larger and larger. This is mainly because I need to retain audit data for tracing purposes regarding transactions, inventories etc... So I need to be able to say who did what and at what time. This is available through reports that are embedded in the application so they use the same ADO datasource as the main application.

    Users complain after a while that some queries take a long time to run, despite regular index and database maintenance, simply because the database grows larger and larger. This also has some impact when they need to send me a database backup to investigate problems, find bugs etc... So in order to prevent this problem, I have designed an archiving functionality outside the application.

    There is another database for archiving purposes. It's an exact copy of the application database in terms of schema except for identity inserts which are OFF (since I want ID's to match).

    What we designed is a simple stored procedure that will look into the live database's schema, find tables that have certain timestamp fields that tell us that the record has been deleted. If we find one of these fields, then the table is eligible for archiving. For each of these tables, we will go day by day and build 2 dynamic SQL statements (using EXEC): one to INSERT INTO Archiving(SELECT * FROM LIVETABLE WHERE FIELD <= @MaxDate) and one to DELETE FROM LIVETABLE WHERE FIELD <= @MaxDate. Basically we want to archive all tables, but one day at a time (keeping at most X days of live data in the live database for reporting purposes, which is configurable by the users), and waiting 2 seconds between each table so that we can give some leeway and reduce locks on live tables for the application which has to keep running during that time.

    This stored procedure can be programmed using a simple Job with 1 step at a user's will.

    As you can imagine, this works great when this is set up when the live database is first created. However, most of the people I try to make implement this in their environment have waited about 2/3 years on average to implement this method. This means that the first time they want to run it, it takes days to run. Some tables have about 4,000,000 records by then and only 1,000,000 are to be kept so for each day, we need to move X records and this takes time just for locating the records to move

    So my question is this: how can you implement a solution that can work on multiple client sites to effectively move data from one database to another (on the same server), not just copy, but move, ie add to the archiving environment one and remove from the live environment without impacting performance of the live application too much? And it has to work on 2000 and 2005.

    Thanks a lot and don't hesitate to ask more questions regarding how it's currently designed if it is unclear.

    Greg

  • I guess that you also don't want to change your archiving mechanism too much.

    If you want to reduce the number of rows processed per execution, you need to reduce the number of days to process in each archive step. 

    One possibility is to create a loop for each day between the last archive run and the current value of @MaxDate and archive/delete records between those two dates.  This would effectively do your archiving on a day by day basis and allow you to catch up for those database that haven't archived anything yet.

    It might take some time to catch up and you might want to put some time restrictions in so that it runs at a quiet time of the day (or night) - it will take more days to complete the archive but would have a lesser impact on the current users.

    It's not a perfect solution but might give you some ideas.

     

    J

  • Hi,

    That's actually the way it's currently setup. We take the minimum date and compare it to @MaxDate and do a WHILE loop that contains a WHILE for the list of tables to archive. When all tables have been processed for a given day, we increment @currenDate by 1 until @currentDate = @MaxDate.

    I'm opened to changing the way this works completely if there's something better out there though. I'm not sure I know one right now though, that's why I was asking. Since it also needs to work on 2000 or 2005, it further restricts my choices to using a 2000 compatible method.

    Greg

  • What's the actual problem with implementing your archiving strategy as it is?  It, presumably, works when it is up and running so is there an actual problem or are you just planning ahead?

     

    J

  • I am trying to plan ahead. When we first picked this way to do this it made sense because the database activity was mostly due to user interaction. However, with more and more integration on the way and some multi-threaded components that compute things every 5 minutes or so, it becomes harder to schedule these during a "real" downtime because there aren't really any anymore.

    I guess I'm more or less happy with this way of doing things. It works great when you implement it from teh beginning and even then, when you implement it later the first batch takes a lot of time but after that you're good to go.

    I was just wondering if there were other ways out there that people have tried with or without success. Really, this is more of a discussion than a question .

  • I have almost exactly same problem, so will use this topic, too 🙂

    I've got a large (1TB) database on SQL Server 2000 and need to move "archive" data to a different server (also SQL Srv 2000).

    I do not have currently any archiving mechanisms on the source server (except of daily backups), so I wonder, what will be the best option:

    1. replicate the data needed to new server and delete from the source via horizontal/vertical replication with SQL Server replication mechanisms

    2. using DTS (I'm not very familiar with DTS)

    3. write my own T-SQL query to insert data to newer server and to delete from old one

    Only few tables are to be archived and only a certain list of columns is needed to keep.

    Which option would be best to consider to?

    regards,
    Anna
  • Hi,

    For the extraction part, for pure speed purposes, you could try a simple SSIS package that exports the required data from your source DB to a Raw File and then an import from the Raw File to the Archiving database. This can be better performance-wise than doing an export/import on the fly, mainly because you can control the level of transaction between the two operations I would say. But the other methods you mention would work too. Plus RAW Files Import/Export are pretty fast to process in SSIS (they often showcase SSIS' speed using RAW Files are their sources).

    For the deletion part, based on how much data you're removing from the tables, you could build a new temp table, move the required data (using BULK INSERT in T-SQL or SSIS etc. or any other fast methods) and flush the old table/rename the new one to the old one. Otherwise, you can maybe export the PK (assuming your table is not a heap) to another table and perform a delete with a JOIN on the primary key you just selected. However, this will take much longer if you have a lot of rows to remove at a time.

    These are more "maintenance-mode" kinda methods though, if it's something you need to schedule very often, then T-SQL might be plenty enough based on the volume of data to move. The thing is that with T-SQL, you'll have to deal with remote servers/queries if I understand it correcly. That's why the export to RAW file to a share accessible to both servers could be a good way to reduce the time needed to do it.

    Hope this helps,

    Greg

  • I do not have SSIS anywhere, as my infrastructure is still SQL Server 2000-based (few dozens instances waiting for upgrade..).

    So, did you mean DTS?

    I wouldn't say saving data to a file is ok for tables e.g. 200GB each. That will also take a lot of disk space and IO for SQL Server to consume, and my maximum daily window for that is 2h (according to SLA).

    Is direct data copying via DTS more efficient than replication?

    regards,
    Anna
  • Sorry, I did not get the 2000 thing (hard to believe it's still out there so much 😉

    Anyway, I'm not too familiar with DTS in terms of performance, even though technically I'm pretty sure the process would be fairly similar than SSIS. The idea I was having behind the RAW file export was to avoid a situation where you could lock both SQL Servers (if you use transactions for example). The RAW method allows you to do the import/export as two completely separate processes if you wanted. Some of the data from MS and other sites shows that RAW works really well, but I mean this is all dependent on your scenario/data too (http://consultingblogs.emc.com/jamiethomson/archive/2006/06/28/SSIS_3A00_-Comparing-performance-of-a-raw-file-against-a-recordset-destination.aspx).

    As far replication is concerned, I'm definitely no expert in it, as we have tried only a couple of things with it where I work so I would not really be able to tell you. How would you deal with the deletion part in that scenario?

    Maybe you could run a test load somewhere with replication and DTS and compare the two? Even throw in T-SQL as a third benchmark?

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply