With the introduction of the GDPR, a lot of our prospective clients are demanding more and more information regarding data disposal, retention and usage practices. Both our new and existing clients want to look at our internal ecosystem and understand how we manage it. This means that I finally got a budget for a disaster recovery test server, a full set of Redgate tools and time to implement the long needed "automated database restore for 200 databases".
Now when a sales guy asks me about our DR testing strategy, I can finally tell them that we have one and it works. I can say that every month we restore each database, test it, and report the results to management. This is something that my project managers have been complaining about since I started last year, but no-one wanted to pay the cost for out of their budget.
After going through Capex (capital expenditure) approval , and figuring out what licencing we need to use, then having to jump through the usual large company paperwork, we finally got our server. Actually, the 4 weeks to get a new server fully functioning, licensed, and with the correct Active Directory group permissions is a record for us.
I'll admit, we had to guess a little on capacity. I know there are calculations you can do on capacity planning, and we skipped them (Bad DBA). Based on the fact that this was going to be a DR test server that we would be tearing down after we restored each database then our plan was based on "our biggest database x 5 size".
Creating our restores using the Redgate SQL Backup interface was super cool. we scheduled lots of restores for the first Monday in each month. There were some problems where we tried to copy and paste SQL agent jobs and the databases had a different number of filegroups. It was OK for just a single data file and a single log file. We altered our jobs and overcame almost every issue. Except one.
On the first round of restores, we hit a database that would not restore. Unfortunately, this was a crucial database common to all of our key customers (McDonalds, Starbucks, and many more ). Our first reaction was "touch nothing until we are safe". What do we do if we try anything dangerous and lose the database? How do we know if older backups are also good? This is something that I learnt in great detail from a Michael Hotek lecture in London in 2005.
So, first thing, "don't panic." Lets try and figure out if it is the backup process, the restore process, or the actual database. I decided to check the database on the live server. I could not even run sp_updatestats , which started to worry me.
I started looking at the tables that couldn't update statistics. I found that all these tables were Memory-Optimized Tables (MOT), which was not something we knew the dev team was using. Now I'm hoping you can forgive me for making the assumption that Memory-Optimized Tables work a little bit like DBCC PINTABLE. They don't, and I was wrong.
I Googled for the answer, and looked through all of the resources I have (mostly SQLServerCentral.com) and experimented to find out what could be causing this. As usual Microsoft errors can be a little vague, and I came up with nothing.
Given that a core database was in danger, I just tried to rule out the obvious. Firstly, I scripted out the create statements for all of the MOT tables. I needed to do this anyway, as you can only have 8 indexes on an MOT table, and we had performance concerns with certain queries. I later found that we had upgraded to Enterprise Edition at a huge cost rather than indexing the tables correctly because a senior developer had read that Memory-Optimized Tables would solve his problems.
I changed the script for the database to change the MOT tables into normal tables with 13 indexes. I assumed that this would fix the issue. This did allow sp_updatestats to start working, but I still couldn't restore the database.
I kind of made a guess at this point and thought it might be related to the MO filegroup. I though to myself, "OK, lets drop the empty filegroup." Unfortunately the only advice Microsoft gives is to "drop the database and rebuild it" - not too useful when this is a database with a few terabytes of data. This was certainly not going to happen on my production server!
I spent a few hours scratching my head and writing emails that asked developers to not release changes in case we had an issue. In effect a change freeze. Then I thought to look at the target server to which I was restoring databases. We never intended for this server to actively run the databases. It was simply just to restore them and report the results.
We had spent about $3000 on a small server with lots of high speed drives but only 8GB of RAM. Who needs RAM if you aren't running your databases? My engineer had configured the server, which is physical, not virtual with 8GB of Ram and 4TB of disk, with 8 cores. More than enough to run restore tests.
My next step was to look at the SQL Server memory configuration on the target server. It was set to dynamic memory usage, but allowed to use unlimited memory. Yes, I know, "bad practice", but this was a box to dump backups on and prove we have a good restore position. I let my network engineer configure the server and didn't check the settings, pretty much treating it like a workstation rather than a server.
I lucked out at this point and decided to reduce the maximum RAM allocation for SQL to 4GB, allowing enough headroom for the Memory-Optimized filegroup. I did this even though there were no files in the MO filegroup. To my surprise the database restore worked. It seems that SQL Server wants to create a new memory space, but when the database is initialised, the dynamic ram settings don't come into play and SQL Server won't release the memory in order to create the filegroup.
As a result of this I've had to reconfigure all of our cloud based failover sites, but I've also taken away a few points from this.
Number 1.. Even though we never have time to DR Test, it is possibly one of the most valuable things you can do
Number 2. Never trust that your Dev team aren't going to use something you haven't planned for.
Number 3. Never under spec your DR test servers.
Number 4. If you use Memory optimised tables, be aware of the limitations before you use them.
DR testing is not glamourous, but it sure as heck frightens your manager when you prove that you don't have a good DR position.