System databases on an iSCSI LUN

  • Hi All,

    I am preparing to build a new SQL Server 2008 Enterprise server. The machine is very well equipped (Windows 2008 Enterprise, 64GB RAM, 16-core processors, etc). The system volume is a local hard drive, but the data volume is a 300GB iSCSI LUN. This is all for a set of single-purpose databases, for a LOB web app that has been performing poorly.

    So here is my question: What is a best practice for locating your system databases on iSCSI LUNs?

    ---

    I am somewhat suspicious of iSCSI LUNs. In the past I have seen that if you put databases on the LUN and then lose connectivity with an LUN, even for a few seconds, it can cause the SQL service to crash and not recover.

    In my case, it was all the databases, even the system databases, sitting out on the LUNs. I had used the NetApp SnapManager for SQL and it had moved everything out to the SAN. So when someone accidentally unplugged the switch, SQL tried to restart the service in three quick tries, then gave up. When the LUN reconnected, SQL Server was already halted. When I manually started the service, everything came up just fine, but it gave me heartburn. This is hardly a high-availability solution.

    ---

    Today I am looking at a server in an enterprise. We have redundant, round-robin iSCSI paths, multiple NICs, and backup power on switches and the NetApp. The server is about ready for me to install SQL Server. But looking at the logs, I see that a couple of Windows updates installed on the 15th at 3AM... and there dozens of iSCSIPrt errors to accompany it. Apparently whatever update was being applied caused the connections on each NIC to go down for a second or two.

    I am worried that this could happen again and, even with all the redundancy we planned, it could still kill my SQL Server service.

    This is a government website and they are serious about HA. So what do I do here? I have several ideas:

    1. If I locate the system databases on the local C: drive, will the SQL Service will recover if the LUN is disconnected and reconnected?

    2. Let's assume #1 is true, and the SQL Service will keep running. If the LUN containing the databases disappears, can this result in data loss, corruption, torn pages, etc?

    I am thinking that the likelihood of a complete loss of connectivity is now very unlikely, since we have physical redundancy all the way to the SAN. But the errors from the 15th point out that a simple Windows Update could potentially interrupt a LUN connection and cause problems for me.

    What do you think I should do?

  • Hi there. We experienced a very similar problem you did. Our primary SQL server was a physcial box using multiple LUNs for the database storage (1 LUN for the system dbs, 2 LUNs for Application DBs, and a 4th for the SnapInfo LUN. We were getting sporadic disconnects across multiple servers connecting to the SAN over iSCSI. This brought the SQL Service down when the LUN holding the system dbs got disconnected. I ended up reverting the system db's back to local disk which the physical box. As stop gap i write out a daily SQL backup of the system dbs to a share that is replicated to our DR site. The problem ended up being a bug in the OS for the NetApp. When we upgraded to the newer OS the problem disappeared. I am going to give it a few more months to show any problems before I put the system dbs back on LUNs.

    To answer your question. If you have moved the system db's back to locally attached storage on the physical box then the lun's becoming disconnected will have no affect on the SQL Service. You applications databases may get throw suspect. If you using a single LUN for all your application databases and you have a significant number as we do I would recommend just cycling the SQL service once you have confirmed the LUN is available again. This will let the engine bring the dbs back online and verify they are ok. A simple refresh of MGMT Studio will tell you quickly if any of the dbs got damaged which can also be confirmed by the logs.

  • 1) I advise my clients to avoid NetApp stuff like the plague when dealing with SQL Server servers. Way too many problems historically

    2) I have several clients very heavily invested in iSCSI (EqualLogic primarily) with great success. One got up to over 30TB of SQL Server data before their bad design, underprovisioning and hard-hitting caused them to move to a (MUCH more expensive) Fibre Channel SAN.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • SQLGuru,

    do you have some links or other information about NetApp Iscsi problems. I'm i the unfortunate situation that we use a NetApp and yes we have problems with it. But to convince management that the problem is not on the SQL Server side I need some more arguments.

    Thanks in advance

    [font="Verdana"]Markus Bohse[/font]

  • MarkusB (4/21/2010)


    SQLGuru,

    do you have some links or other information about NetApp Iscsi problems. I'm i the unfortunate situation that we use a NetApp and yes we have problems with it. But to convince management that the problem is not on the SQL Server side I need some more arguments.

    Thanks in advance

    I do not have any links for you - sorry. As a consultant I have come across multiple clients in the past with issues on their storage systems. I do acknowledge that I have not seen many clients using them for SQL storage in the past few years so it is quite reasonable to believe that any issues they may have had could be rectified in later versions of their hardware/firmware. I also note that EVERY SINGLE CLIENT I have EVER had with ANY flavor of SAN has done a VERY poor job of implementing/operating said storage system. I have had more than one meeting with CIOs to tell them their seven figure investment was CAUSING performance issues as opposed to helping to ALEVIATE them. There are many many ways you can go wrong when provisioning a SAN for SQL workloads. I highly recommend getting some professional help before putting new storage online - it is soooooOOOOO much easier to do things right from the get-go than to fix problems after rollout!

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • SQLGuru,

    thanks for your reply and I totally agree with you that a lot of SAN implementations are not very well done.

    Unfortunately I joined my current company just after they had implemented the NetApp storage. And can you believe it, they never did any performance test before moving all the data to the SAN.

    Now we have endless discussions about what is acceptable performance vs. the "other advantages" of having data on the SAN.

    But while performance is not great, at the moment I have bigger worries with Iscsi errors while doing backups to a share on the SAN.

    Lots of EventID 20 "Connection to the target was lost. The initiator will attempt to retry the connection." messages.

    But looking at the bright side of things, these might finally convince management, that they made the wrong choice.

    [font="Verdana"]Markus Bohse[/font]

  • MarkusB (4/22/2010)


    SQLGuru,

    thanks for your reply and I totally agree with you that a lot of SAN implementations are not very well done.

    Unfortunately I joined my current company just after they had implemented the NetApp storage. And can you believe it, they never did any performance test before moving all the data to the SAN.

    Now we have endless discussions about what is acceptable performance vs. the "other advantages" of having data on the SAN.

    But while performance is not great, at the moment I have bigger worries with Iscsi errors while doing backups to a share on the SAN.

    Lots of EventID 20 "Connection to the target was lost. The initiator will attempt to retry the connection." messages.

    But looking at the bright side of things, these might finally convince management, that they made the wrong choice.

    I would actually have been surprised if the company had done ANY perf testing prior to implementation!!

    Have you ensured that every piece of hardware in the system has the latests certified drivers/firmware? It is AMAZING how many problems go away when you get those things addressed. Also review the network utilization metrics - some NICs/switches don't do very well if you start to saturate them. NIC teaming can be problematic if you don't dot the i's and cross the t's correctly. Oh, and have you checked for TCP/IP Chimney being enabled - BAD stuff there

    I very rarely make this explicit suggestion (as opposed to a generic "hire a professional" recommendation) but please drop me a line if you feel as I do that a hired-gun would be helpful to give your environment and systems a performance review. I do this for a living and I doubt you have anything going on there that I haven't seen time and time again. 😎

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Are you available for a brief discussion on Failover Clustering and Availability Groups? This is a very gray area for me.

    M

    robiski@hotmail.com

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply