Data Cleanup

The end of the year is when I do a little data cleanup. Not a lot, but some. Work slows down, with no major deadlines, so I'm able to spend a bit of time organizing myself. I usually go through my downloads and documents folders, deleting files I don't expect to use. I ensure that my jobs removing older log and temp files are running, especially on the laptops. A lot of disuse in 2021 had a few of them filling disks with more files than I had expected.

In a few positions I've had as a DBA, this was a time when we'd look to archive away some older databases, and even some data. It was rare, but always invigorating because we knew queries would run faster in systems when we could archive off data. I haven't ever started with clean databases in a new year, but in a couple of jobs we'd archive off all data that was more than one or two years old into another database. It was available if someone needed it, but the main OLTP databases would contain only one or two years' worth of data, helping queries perform quicker at the start of the year.

Last month I was at the SQL Server and Azure SQL Conference, and someone asked about the future plans for archive and migration of SQL Server data. The person noted they wished they could easily archive off data, specifically using a feature like Stretch Database. That feature wasn't cost-effective, but it might be if you could stretch to another instance rather than Azure. No good answer from Microsoft, but that is something I'd like to see.

Archiving older data, often data that is less queried is a good way to speed up systems. However, doing this seamlessly, and with easy access from applications, is cumbersome. Software should make this easy. I set up a database and then point the archive process to that location. The process manages moving data from a table based on a column value I've specified. Ideally, I could also decide if I allow queries to span the two databases automatically or I require some switch to allow querying of the archived data.

There are plenty of possibilities here, but I suspect we won't see any of them soon. There isn't a lot of extra money to be made by allowing customers to manage their own archival systems, and these days it seems every software vendor is trying to make money from renting features rather than selling them. Archival isn't one I see many customers willing to pay a premium for, so I suspect the solution for most of us is to write better and code and ensure queries perform well, even when we have terabytes of data.