You can use the SQL Server view sys.dm_io_virtual_file_stats to see IO Stall Time. Around 7 ms is average, anything above 20 ms is considered poor, and above 50 ms is considered critical. Using NetApp we are getting average IO Stall Times over 2 seconds (2,245 ms :angry:) for specific datafiles, including tempdb. And this is averaged over millions of IOs reading average block sizes of 150 KB.
I’m wondering if the reason why we are seeing an average > 1 sec. stall time for physical disk I/O has to do with block marshaling. We are executing large ETL jobs and BI queries. That pumps out a lot of 1 MB sequential reads. NetApp breaks a 1 MB block down to multiple 4 KB blocks. When an 4 KB block is updated it keeps both before & after versions of the block, causing the blocks to be scattered all over kingdom come. When a 1 MB physical read occurs NetApp has to look up the location of every 4 KB block and then go out and read every block and assemble them back into a 1 MB block before it can respond. In this case the chain is only as strong as the weakest (slowest) link. The NetApp processor doing this may be swamped with these operations (saturation hockey-stick :crazy:), especially if many asynchronous 1MB PIO requests are sent, which is what SQL Server does. (asynchronous = multiple PIOs are sent in rapid succession before waiting for each one to respond.) That’s my working hypothesis.
In any case, NetApp is the winner if you are checking off features in a comparison matrix, with the exception of performance. It is OK on OLTP performance. But it is absolutely junk on batch & OLAP performance and it should not be considered.
If a NetApp employee wants to pick a fight with me over the facts they should come down to my company and tune our SAN. If they can fix our performance problems I’d be happy to state a retraction and the solution. So far we have lived with this performance problem for years.
Here is another possible explanation, but it’s over my head.
We had our meeting with NetApp yesterday and went over the Professional Services findings. Some things they listed are tasks we’ve been addressing since the slovol issues began, aligning mis-aligned VMs, adding disks to aggregates (or, in our case, moving VM’s to larger aggr w/ faster disks). But one thing they confirmed, which was brought to my attention via an off-toasters email discussion hours before (I give that individual much thanks!!), was BURT 393877, “inefficient pre-fetching of metadata blocks delays WAFL Consistency Point.”
Data ONTAP's WAFL filesystem periodically commits user-modified data to the back-end storage media (disk or otherwise) to achieve a Consistency Point (CP). Although a Consistency Point typically takes only a few seconds, a constraint has been designed into the software that all operations needed for a single Consistency Point must be completed within 10 minutes. If a CP has not been completed before a 600-second timer expires, a "WAFL hung" panic is declared, and a core dump is produced to permit diagnosis of the excessive CP delay. During the processing for a CP, some disk blocks are newly brought into use, as fresh data is stored in the active filesystem, whereas some blocks may be released from use. (Although a block which is no longer needed in the active filesystem may remain in use in one or more snapshots, until all the snapshots which use it are deleted.) But any changes in block usage must be reflected in the accounting information kept in the volume metadata. To make changes in the block accounting, Data ONTAP must read metadata blocks from disk, bringing them into the storage controller's physical memory. Because the freeing of blocks often occurs in a random ordering, the workload of updating the metadata for block frees can be much higher than for updating the metadata to reflect blocks just being brought into use.
For greatest processing efficiency, Data ONTAP makes an effort to pre-fetch blocks of metadata which are likely to be needed for a given Consistency Point. However, in some releases of Data ONTAP, the pre-fetching of metadata is done in an inefficient way, and therefore the processing for the Consistency Point may run slower than it should. This effect can be most pronounced for certain workloads (especially overwrite workloads) in which many blocks may be freed in unpredictable sequences. And the problem may be compounded if other tasks being performed by Data ONTAP attempt intensive use of the storage controller's memory. The competition for memory may cause metadata blocks to be evicted before the Consistency Point is finished with them, leading to buffer thrashing and a heavy disk-read load.
In aggravated cases, the Consistency Point may be slowed so much that it cannot be completed in 10 minutes, thus triggering a "WAFL hung" event.
The BURT doesn’t list any specific workarounds, as, apparently, there’s many depending on your environment and what’s causing it. For us, they wanted to take each FAS3160 controller down to the boot prompt and make an environment change. They didn’t say what this change was because it would have to be undone once a version of OnTap is released that fixes the issue.
On a different topic (Hadoop & share-nothing DBMSs, not SQL Server):
“Let me say this unequivocally: You absolutely should not use a SAN or NAS with Hadoop.” link
“Using RAID on Hadoop slave machines is not recommended, because Hadoop orchestrates data redundancy across all the slave nodes.”
“Instead of relying on a SAN for massive storage and reliability then moving it to a collection of blades for processing, Hadoop handles large data volumes and reliability in the software tier. Since each machine in a Hadoop cluster stores as well as processes data, those machines need to be configured to satisfy both data storage and processing requirements.”