It might seem that cheap huge memory, multi-core client processors and All-Flash/SSD storage have met all our hardware performance objectives. But along the route from twenty years ago to today, we took a dead end path on memory. Back then, memory was desperately needed to reduce IO to an achievable level. Today, flash-NVMe storage systems can drive 1M IOPS. While there is strong value in having large memory to reduce IOPS to 100K, the incremental value of huge memory to further drive IOPs down to 10K is only minor. Anything below 30K IOPS is really just noise.
So is there anything wrong with the massive memory typical in recent generation (client and servers) systems? Probably not, nor does reducing memory have positive effect, aside from cost. The main deficiency in recent generation systems is that most of the compute capability of modern processors is wasted in dead cycles waiting for round-trip memory accesses. Many years ago, DRAM manufacturers told system vendors that the multi-bank approach in SDRAM and DDR could scale memory bandwidth going forward at a moderate cost burden. The choice was in: 1) having the minimum number of banks necessary to support the target bandwidth, or 2) having many more banks to allow for low latency as well as bandwidth capability. System vendors, thinking heavily based on the needs of database transaction processing, were unanimously(?) of the opinion that only the lowest cost that could meet bandwidth was required.
Today, we have essentially a mono-culture in the choice of DRAM for main memory in that all (contemporary) products of a given DDRx generation employ the same number banks. Currently, there are 16 banks in DDR4. Over twenty years, memory latency at the DRAM interface has been essentially unchanged at around 45ns for the random access full cycle time (tRC). There are distinct memory products for graphics, network switches, and low power, each targeted to the objectives of their respective environments.
We need to admit that the direction of more memory capacity at lowest cost followed for the last twenty years has been pushed far beyond the true requirements to ridiculous levels. Consider a system with 2 x 28-core processors and the standard 24 x 64GB DIMMs for 1.5TB memory. A year ago, the 64GB DIMM was $1,000-1,300 each. Today the 64GB DIMM is under $400 each. In reality, even very heavy workloads could probably run on 256GB memory with proper tuning and Flash-NVMe storage. Regardless of whether the 64GB DDR4 ECC memory is $1,300 or $330, the cost difference between 1.5TB and 256GB system memory is inconsequential after factoring the database engine per-core licensing cost.
If our budget for memory were on the order of $10,000 and we admit that we probably only need about 256GB, i.e., allowing a cost budget of about 4X greater per GB than conventional DDR4 memory, what could be possible? Some years ago, there was an RL-DRAM product/technology that was 16 banks when DDR3 was 8 banks at perhaps a 2X die-area penalty per unit capacity. A modern version of RL-DRAM would probably 64-banks. If I understand correctly, the full cycle latency of RL-DRAM was under 10ns? The Intel eDRAM used for graphics memory actually has 128 banks with cycle time of 3ns? Both are much better than the conventional DRAM cycle times of 40ns+.
Would be the benefit of employing very low latency memory out-weigh the trade-off of lower capacity and higher cost per capacity? In the typical multi-processor (socket) system, probably not. This is because the overall system memory latency to the individual core has large elements beyond the latency at the DRAM chip interface, even with integrated memory controllers. However, a single-die processor in a single-socket system could benefit substantially from low latency memory, even at much smaller capacity. The difference in cost between conventional large capacity, high latency and smaller capacity low latency memory is probably a wash, but the value of being able meet a specific performance objective with many fewer cores results in a big gain from reduced software per-core licensing. In addition, anomalous performance quirks are far less likely on a system with uniform memory than on non-uniform memory.
Note: good chunk of overall memory latency occurs in the L3 cache. On the Intel Xeon SP (Skylake and Cascade Lake), L3 introduces about 18ns latency. Perhaps the only meaningful purpose of the L3 in Intel processors is to facilitate cache-coherency, as it probably has little cache-hit rate net benefit. Intel did mention that they are going to re-think the L3 strategy for a future generation processor?