As I said, I am really hoping this is all around serializable isolation. The code has not changed, in fact the code causing the problem is over 5 years old. I have been in many environments, and when I see a ton of complicated deadlocks, I normally look at isolation pretty quick. In this case although the process has been a bit nagging, causing most of the very few deadlocks and most of the long blocking, it didn't point me to isolation. I am thinking in compatibility 130 for some reason the locking has changed, causing wider locking than with 120. I am just guessing at this point, and will try and record some stats to try and prove it. My biggest problem is trying to reproduce the exact type of load in a test environment. As I said in my original post, we will release code within two weeks that will get rid of the serializble isolation as we have no need for it. I will switch back to 130 at that point and see what happens. I plan to set up a scenario when I get some time and profile locks during a complicated serializable process in both 120 and 130 and see what the difference is. I will post back to this thread with results and anything else I find.
By the way, we ran 2016 in a test environment for a couple of months with no issues, but we just don't have anywhere close to the load and can not reproduce the load we have in production. Other than this, I have heard of nothing more than a few bad plans being produced by upgrading to 2016, however I now am very glad that compatibility level exists and the ability to change it.