Last night we were troubleshooting really poor performance on a core system. One of the things being investigated was a network switch. On every interface of the network switch we were seeing flow control pause counters continually incrementing. This is usually an indication that the systems hooked to those interfaces are saying, "Whoa, can't handle the traffic, pause a bit, and let me catch up." But the fact we were seeing this on every interface, even for brand new servers that were doing absolutely nothing seem to indicate something was going on at the switch level. We had vendor support on the phone and they were flagging the counters and saying, "Your servers can't keep up. This may be the cause of your performance problems."
Even though I'm a DBA now, because the issue was big enough, I was brought back over to provide any assistance I could yesterday afternoon. I had already done a packet trace. I had seen all the MAC control datagrams saying pause. They looked a lot like this post here. So we immediately said, "Hey, we shouldn't be seeing these. Let's turn off flow control everywhere." We did - on the server and the switch. The flow control messages kept coming. What?!?
What made things increasingly interesting is that when we looked at the MAC address for the origination of these messages (this is at layer 2), they weren't from the same MAC address the OS was recognizing. Same vendor, and the MAC address was only 1 or 2 off on that last octet. So it seemed tied to the same converged network adapter (CNA), but we couldn't explain the MAC address any more than we could explain the flow control message. One of the things I eventually keyed in on after concluding that the servers weren't under load and shouldn't be sending pause messages because they were under load was to look at the time quanta. The time quanta was 0. This means "send immediately." Basically, if I send a message saying pause, I set a time quanta for you to wait. If I send another message saying pause, and the first one isn't done, you overwrite the time with the new. So a 0 time quanta basically means, if you were pausing, stop, and give me the data. And we were seeing a ton of these. I still couldn't explain the cause of the messages, but what it did tell me is the counter incrementing on the switch was a red herring. Our switch vendor did some research in their case histories and found out that it was consistent among at least two of the generation 1 CNAs to do this in order to keep traffic flow coming. Low and behold, that's what we have, gen 1 CNAs. So the adapter was automatically sending out the messages independent of the OS and that explained the different MAC address. So it was not the real problem. And we could safely ignore the flow control pause messages we were seeing. Meaning the track they were taking was a dead end.
Here's why I say to do the packet trace especially when things are good. The networking guys indicating they've been seeing these counters increment all along, ever since we put the switches in. However, no one had collaborated to do the packet traces on these servers to investigate what was going on. On previous support calls the switch vendor had indicated we could ignore these counter increments, they weren't related to whatever issue we were having. Why this time was different, I don't know. However, had we done a packet trace when performance was good, we'd have seen the time quanta zero and the unexplained MAC address then. And we'd have worked on an explanation then. Meaning we wouldn't have traced that trail in the wee hours of this morning because we would have known that was expected traffic. It's always a good idea to look at your system carefully when it's working fine. That helps you see what is out of place when things aren't going so well. And this is especially true at the network layer. I can't tell you how many times I've discovered an issue that was at the server level because of a packet trace. But that's a post for another time.