Wanted to check back in... we think we got it. It'll take a few days to pull everything together what we did and was microsoft recommended... but the tweak that has seemed to nail for us was that we made (in hindsight not so smart) mistake of SQL Server Priority Boost checked on the nodes in our cluster.
If any of you following this have done the same... uncheck it as soon as reasonbly possible. Again, in hindsight, If google around long enough you'll hit the articles that say you shouldn't haven't have this checked in a cluster... but it only says it could cause networking problems, but no details or specific messages...
In our case, we got to a point to where we stripped a node down to it's bare bones... we uninstalled EVERYTHING on the node that wasn't critical and (with boost on) ran a test where I would run a procedure that reliably causes the 19019 events, and profiler. SQL Profiler would see that the cluster service would get periodically dropped as a connection from SQL Server. THIS DROP is what was generating the 19019 errors! On a hunch, one of DBA's thought that *maybe* this priority boost thing might be choking other processes on the node... the cluster service being one of them... to give way to the higher-priority SQL Server.
Sure enough, we switched off this setting... not a single doggone 19019 since.
Like I said, I will try to get back to you folks within a few days, maybe a week, to compile everything we tried and all of Microsoft's recommendations based on our particular environment. In short, in no particular, things Microsoft sited in our environment:
1. Spikes in disk activity on our SAN (got better after applying latest drivers/hotfixes)
2. They claimed that our NIC cards in our nodes were teamed, which is a cluster no no (our cards were not teamed, period.)
3. They tried to reference an obscure match with Quest's SQL Litespeed causing the problem when using native command substitution... not buying this one... we've has litespeed for years and have always used teh xp_ procs... not command substitution
4. The suggested that we look at / play with our MAXDOP options... current set to 0 on each of our nodes. (this was suggested after we mentioned to them about us stumbling upon the priority boost thing).
Take it easy -- Mike