The servers have 512GB of memory with sql allowed to use 348GB. The numerous data files are all on a Netapp SAN device. The main user database is 3.3TB with an aspstate database also receiving a lot of traffic. We have used a large number of LUNs to spread out the data files, with the transaction log for the main database on its own lun and the four tempdb data files and log file all on their own Luns. Going to add four more same-sized tempdb data files this sunday.
Most of the problem queries come from the application ( ORM generated ) and having reporting and OLTP traffic all on one database is a problem --- to be addressed along with a plan to archive/partition millions of older records. Sql agent jobs and batch jobs run by the application scheduler are spread out as much as possible.
Since we have NEVER seen database corruption from these "15-second" messages, it is mainly something upper management gets in a tizzy about. 15 seconds? Nothing is taking 15 seconds.
Sql server 2012 Enterprise 64-bit on Servers 2008R2 64-bit ( prod databases still in 2005 compatibility mode )
More than a few times I have had to sit down with CIOs to explain why their SAN was the CAUSE of poor SQL Server performance, not helping to prevent poor performance. "own LUN", "large number of LUNs", etc often don't mean what DBAs think they do (not saying that is the case here - just what I see over and over again). SOOOO many ways a SAN can be sub-optimally configured for SQL Server IO performance!! And even if you do have a metric butt ton of 15K drives in big RAID 10 sets with NO overlap/sharing at all there is still the front side adapter, switch fabric and HBA(s) that can be bottlenecks. If you are getting the 15 second messages then at least during those times you have severe IO throughput/response time issues. Time-interval file IO stall analysis can help pinpoint which file(s) are getting hammered.
I feel your pain on the ORM front. One magic bullet I have found on those sometimes is a setting that caused ALL values to be passed in as Unicode character strings (i.e. N'12345' instead of 12345 for example. Total killer due to CONVERT_IMPLICITs.
Note that putting too many files (or LUNs) on too few spindles can easily DECREASE IO performance due to increased head thrashing and seek times.
Good luck getting things improved!
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru at GMail