Failover 10,000 databases?

  • Hi All,

    It may sound like a clustering question but it's also service restart one.

    We have a Windows 2008 cluster with one instance of SQL Server 2008R2.

    There are 10,000+ databases on it. While user_db access is fast, cluster failover takes almost an hour, even with high performing hardware. We would like to drop the failover time.

    1/3 of the databases are not active.

    Almost all of them are on SIMPLE recovery mode.

    I am considering;

    - Setting AUTO_CLOSE ON for inactive databases, or OFFLINE

    - Playing with recovery interval

    Anything else you might suggest?

    Cheers,

    Kuzey

  • The time it takes to failover is very long because of the number of databases. To become online SQL needs to open filehandles to all database files (each database has minimum of 1 data and 1 log file, resulting in 20.000 filehandles) and perform a consistency check on the databses.

    The best way to decrease the failover time is to remove as many databases as possible. You could place all inactive databases to another instance. You could also archive them to a backup and restore this when the database is needed.

    ** Don't mistake the ‘stupidity of the crowd’ for the ‘wisdom of the group’! **
  • Do a alter database or detach/reattach and move all the inactive or offline databases to a disk not part of the cluster. That will reduce the volume of databases for failover and reduce the amount of time SQL takes having to fail them over.

  • If those inactive databases are never used, you can take them offline. If they're offline, SQL won't open or recover them on startup.

    Still, with thousands of databases you're looking at a long recovery time. No possibility of moving some of the active databases elsewhere?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Jason.Reeves (8/22/2013)


    Do a alter database or detach/reattach and move all the inactive or offline databases to a disk not part of the cluster.

    A clustered SQL Server's database files must always be on a disk that's part of the cluster and set as a dependency to the SQL service. Otherwise it cannot even see the drive in question.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • recovery time is adversely affected by large number of vlfs in a transaction log file (see here)

    so it is worth checking your databases for this condition (rule of thumb. vlfs < 100). Scan the errorlog to see which databases take the longest time to recover and compare this to the count of vlfs per transaction log.

    ---------------------------------------------------------------------

  • HanShi (8/22/2013)


    and perform a consistency check on the databses.

    There is no consistency check performed at startup or automatically at any other time. The only time SQL performs a consistency check is when a user runs one via one or more of the DBCC Check* commands.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • If you've got that many databases and that many are inactive you'll want to set the inactive ones offline. If they are online when the SQL Server starts then it needs to run crash recovery on them. If you setup the databases with auto-close it still needs to bring then online on startup then it'll close them again. This actually adds more work to the process not less.

    If possible move some databases so another instance, either on the same cluster or on another cluster so that there are less databases to bring online at once and the database instance will become available faster.

    Also check your VLF count and ensure that you have the least possible VLFs. Without knowing the size of the databases it's impossible to give you an exact number that you should be aiming for, but go with less is better. Also look at your IO to see if there's lots of reads and writes going on. What's the transactions per second on the system look like? Can the disks handle having this many files being read and written to all at once?

  • I'm with Denny here...

    Check the Tlog file utilization and VLFs for each database. I was able to improve recovery time of a 1TB database just doing that (above 200 total databases on the failover instance). It went down from few minutes to few seconds, for that 1TB database only, I mean. The Tlog file was too big and fragmented and that was delaying the recovery.

    Also, check the speed and latency of your Tlog and Data LUNs. Slow disks will also slow down the recovery. It will take longer to read the Tlog files.

  • Kuzey (8/22/2013)


    Hi All,

    It may sound like a clustering question but it's also service restart one.

    We have a Windows 2008 cluster with one instance of SQL Server 2008R2.

    There are 10,000+ databases on it. While user_db access is fast, cluster failover takes almost an hour, even with high performing hardware. We would like to drop the failover time.

    10,000 databases on a cluster? Not good architecture. I would like to be corrected by anyone if I'm wrong - Is there is a reason why that many databases can be on one instance or even server. If I see couple hundred databases on a cluster, I'd be nervous regarding possibilities of things that can happen.

    - What if one application, database has issues, restart of SQL is required? All databases are impacted.

    - What about disaster recovery possibilities and time frames.

    - Backups, maintenance plans, how would you do an upgrade, or control resource utilization... Well in short, hoping this is not prod.:Whistling:

    To reduce failover time, like people who already pointed out - please move databases off to other servers. For inactive ones, I'd recommend taking a backup, taking them offline and removing them after couple of months.

  • GilaMonster (8/22/2013)


    Jason.Reeves (8/22/2013)


    Do a alter database or detach/reattach and move all the inactive or offline databases to a disk not part of the cluster.

    A clustered SQL Server's database files must always be on a disk that's part of the cluster and set as a dependency to the SQL service. Otherwise it cannot even see the drive in question.

    Could installing a separate instance in the cluster for offline/rarely used databases help?

  • Probably as it would get databases off of the instance. The ability to do that would of course depend on the application design.

  • sqlsurfer101 (8/22/2013)


    10,000 databases on a cluster? Not good architecture. I would like to be corrected by anyone if I'm wrong - Is there is a reason why that many databases can be on one instance or even server. If I see couple hundred databases on a cluster, I'd be nervous regarding possibilities of things that can happen.

    - What if one application, database has issues, restart of SQL is required? All databases are impacted.

    - What about disaster recovery possibilities and time frames.

    - Backups, maintenance plans, how would you do an upgrade, or control resource utilization... Well in short, hoping this is not prod.:Whistling:

    To reduce failover time, like people who already pointed out - please move databases off to other servers. For inactive ones, I'd recommend taking a backup, taking them offline and removing them after couple of months.

    Budget restrictions.

    Imagine if you deploy Clusters in a medium size company, for every 100 or 500 databases that you have, or whatever number you decide is good. How much money you are going to require? Will the total downtime, if one, exceed the actual investment and support of a SQL Cluster in a year? SQL Clusters are not cheap. And the more that you have, the more stuff that you need to monitor.

    The number of databases is not as important as their size and memory or CPU utilization. I would not be worried about having thousands of databases if the SQL Cluster is properly designed. With a good SAN's design , the right SQL server version and enough RAM, having thousands of databases should not be a problem. Databases do not move during a failover, unless you use Veritas Cluster or SAN to SAN replication.

  • Budget restrictions.

    The number of databases is not as important as their size and memory or CPU utilization. I would not be worried about having thousands of databases if the SQL Cluster is properly designed. With a good SAN's design , the right SQL server version and enough RAM, having thousands of databases should not be a problem. Databases do not move during a failover, unless you use Veritas Cluster or SAN to SAN replication.

    Another tool, I used years ago for a high transaction db for DR, came to my mind. BackupExec Replication. It's basically keeping two (data) files in synch by replication only changed bits. It had the lowest impact on servers and network among other solutions. Even that one may not handle 20,000+ open files.

    K.

  • mrdenny (8/23/2013)


    Probably as it would get databases off of the instance. The ability to do that would of course depend on the application design.

    The app design can change to a certain extent. I am now actually considering an automated process taking databases, which have not been accessed for certain time, offline or even moving to an "archive" instance. I think a dynamic solution is needed to keep the number of active databases low.

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply