"I still think 2012 is underperforming compared to 2008R2, particularly given it's running on almost double the hardware our old 2008R2 box was."
This is a big clue. Have you tested the workflows on SQL Server 2008 R2 on the new hardware, or was it SQL Server 2008 R2 on old hardware, SQL Server 2012 on new hardware only? SQL Server 2008 R2 might struggle just as hard as SQL Server 2012 on "twice the hardware", depending on what that means.
This is what I learned based on internal testing and production system observation across numerous 2, 4, and 8 NUMA node systems. The workloads within scope of our tests:
1) high concurrency ETL (typically twice as many client connections than logical CPUs)
2) higher concurrency batch report workload (driven by reports scheduled on report app servers, concurrency typically at least 4 times the logical CPU count).
1) assuming same core count, increase core count per NUMA node (or drive up query memory allocation concurrency per NUMA node in any manner) = poke the bear of per-NUMA-node query memory allocation spinlock contention. Indicated by CMEMTHREAD waits, and high CPU during low page lookups/logical reads per CPU % utilized. Can lead to high CPU utilized per core, aligned along NUMA node boundaries (if not across all cores). Resolved by trace flag 8048.
2) assuming same core count, increase NUMA node count = decrease the database cache efficiency for workloads whose queries/connections don't benefit from NUMA node affinity. Basically, imbalanced cache churn rates cause more disk IO than necessary*. Seems to vary with the square of NUMA node count according to my tests. This case is especially relevant when transitioning between Intel servers at one NUMA node per socket and the two-NUMA node per socket AMD servers. Resolved by trace flag 8015. (Don't use without trace flag 8048 or you'll poke the #1 bear even harder.)
3) for a given amount of RAM, increase NUMA node count = increased exposure to persistent high foreign memory page count on multiple SQL Server buffer nodes. I haven't seen confirmation of this anywhere, but especially after the SQL Server 2012 SP1 CU3 hotfix for NUMA behavior, I'm convinced that persistent high foreign page counts triggers a reaction within SQLOS, trying to sort away the high foreign page count. If the count can't be sorted away (perhaps because the amount of free memory on each NUMA node doesn't allow it, even if the amount of physical memory on each physical NUMA node is balanced) a significant amount of overhead is added to some SQL Server memory operations. That overhead seems to be hard to pinpoint with typical wait and spinlock measures. (It may be related to a spinlck category that isn't documented anywhere available to me yet.) This is also resolved with trace flag 8015.
4) for a given core count, increase NUMA node count = increased likelihood of imbalanced CPU utilization, aligned with NUMA node boundaries, due to the tendency of parallel workers for a given query to be assigned within a NUMA node. Closely related to number 2. Increases the CXPACKET and SOS_SUSPEND_QUEUE waits for a given workload, as well as signal wait time across all wait types. This is also resolved by trace flag 8015.
So, during your targeted workloads, you may want to monitor per CPU CPU busy, per buffer node PLE, stolen pages, database cache pages, and free pages, and overall stolen pages, and reserved memory, and granted workspace memory.
Use a stored procedre to capture and diff waits and spinlocks in small increments. At live production sites I was using 5 minute intervals, in-house testing we decreased the intervals.
Rohit Nayak's post on debugging CMEMTHREAD waits can really help, too - especially if you are skilled with extended events (I'm no good at them yet).
You may be in a situation where trace flag 8048 helps but 8015 provides no extra benefit. If 8015 helps, I'd warn against using it without 8048. Of course, monitoring CMEMTHREAD waits, spinlocks, per-node PLE and free memory, stolen memory, reserved memory, etc, might just show that neither of these will help at all.
Or, you may be in a situation where 8015 + 8048 could help, but going to a single lazy writer or a single IO completion port makes it a no-go.
I posted this at length here not to bore or attempt to impress you
but because as far as I know, I'm the only person that's talking much about this potential optimization for "SMP type" batch workloads that don't require NUMA node connection affinity. Many data warehouses fall into this description. And, attempting to explain it here allows me to practice for an upcoming blog post where I'll include perfmon stats and maybe a graph or two.
*I often see an assumption that the main considerations in widely swinging PLE (when the PLE calculation is legit) are database block physical read rate and/or first database block write rate. When stolen memory (this includes the "active/used" portion of query memory grants - the not-yet-used query memory which may still cache database blocks is in "reserved memory") goes up, the database cache typically gets smaller - then PLE goes down even if physical read rate and first write rate remain the same. Same pace of blocks marching down the line, but its a shorter line til they age out