Here's the thing: the SQL Server 2012 SP1 CU3 fix is probably a good thing - I'm waiting on more details on it. But, the behavior it addresses goes all the way back to SQL Server 2005. No fixes yet for versions SQL Server 2005, 2008, 2008 R2. So for others experiencing similar issues, trace flag 8015 which the kb article http://support.microsoft.com/kb/2819662 may be a solution. But, please consider using startup trace flag 8015 together with startup trace flag 8048 to avoid spinlock contention (and the possibility of a spinlock convoy).
If the MSSQL 2012 SP1 CU3 fix is targeted to the handling of NUMA node foreign and away buffers, there still could be a considerable benefit to TF 8015 + 8048 on that version or later. That's because some workloads just work better with a single large bpool and a single large scheduler group.
Worked for quite a while to reproduce in a test lab the large numbers of persistent foreign pages I was seeing in the field on two, four, and eight NUMA node servers. I wasn't able to*. But, on a four NUMA node server with no foreign pages after achieving max server memory the test workload (thousands of batch queries submitted at concurrency of 120 queries) required ~25% more disk IO and ~10% more elapsed time to complete than the comparative test runs with trace flag 8048 and trace flag 8015 in place.
So, dealing with foreign/away buffers is an important piece of the puzzle, but its far from the only piece of the puzzle. For many systems SQL Server 2008 R2 and before, TF 8048 + 8015 will be the only available resolution until a future CU/SP. But, even then, depending on the scope of the fix therein, TF 8048 + 8015 may offer more performance benefits.
The tradeoff is that systems with TF 8015 in place will have only one lazywriter and one IO completion port, instead of one lazy writer and one IO completion port per NUMA node. But... that's how most SQL Server on VMWare or Hyper-V will also operate. And many systems will not be adversely effected by a single lazy writer or IO completion port (keep watch on checkpoint performance to verify).
Sooner or later I'll post at sql-sasquatch.blogspot.com more details of the test system and perfmon/spin/wait stats from the test workloads. Also diagrams of the mechanics as I understand them, and some deidentified stats from 4 and 8 NUMA node servers in the field. It takes me forever to prepare that stuff, sorry.
*I now believe that the high persistent foreign pages was due to interaction of complex ETL and/or SSIS packages with "right-sized" systems (relative to SQL Server max server memory and other expected SQL Server memory needs). Without much free memory on these systems, SQL Server couldn't continue to perform memory sorting to eliminate the foreign pages. On the other hand, the test lab systems always had super-sized memory, to allow for multiple iterations of tests at various memory configurations. With more free memory available at various max server memory levels, SQL Server was able to sort memory as it wanted and keep the number of foreign memory pages to a minimum.