Modern CPUs are very fast compared to all things external, including memory (RAM).
It is understandable, since CPU clock frequency has reached a point where it takes several clock ticks for an electric signal simply to run from from the CPU through the bus to RAM chips and back.
It also complicates life on many levels: multi-level cache hierarchies are built to deliver data closer to the CPU, which in turn require complex synchronization logic in the chip. Programs must be written in a cache-friendly way to avoid wait cycles while data is fetched.
Many of these problems could be avoided if a significant amount of RAM was located directly on the CPU chip. It doesn't have to an exclusive arrangement: maybe put 1-4 GB on the chip, depending on its class and allow additional memory installed separately.
I'm sure there are good reasons Intel, AMD and the like are not doing this. What are these reasons? Is it that there's no room to spare on the chip?
Answer
Intel's Haswell (or at least those products that incorporate the Iris Pro 5200 GPU) and IBM's POWER7 and POWER8 all include embedded DRAM, "eDRAM".
One important issue that has led eDRAM not to be common until recently is that the DRAM fabrication process is not inherently compatible with logic processes, so that extra steps must be included (which increase cost and decrease yield) when eDRAM is desired. So, there must be a compelling reason for wanting to incorporate it in order to offset this economic disadvantage. Alternatively, DRAM can be placed on a separate die that is manufactured independently of, but then integrated onto the same package as, the CPU. This provides most of the benefits of locality without the difficulties of manufacturing the two in a truly integrated way.
Another problem is that DRAM is not like SRAM in that it does not store its contents indefinitely while power is applied, and reading it also destroys the stored data, which must be written back afterwards. Hence, it has to be refreshed periodically and after every read. And, because a DRAM cell is based on a capacitor, charging or discharging it sufficiently that leakage will not corrupt its value before the next refresh takes some finite amount of time. This charging time is not required with SRAM, which is just a latch; consequently it can be clocked at the same rate as the CPU, whereas DRAM is limited to about 1 GHz while maintaining reasonable power consumption. This causes DRAM to have a higher inherent latency than SRAM, which makes it not worthwhile to use for all but the very largest caches, where the reduced miss rate will pay off. (Haswell and POWER8 are roughly contemporaneous and both incorporate up to 128MB of eDRAM, which is used as an L4 cache.)
Also, as far as latency is concerned, a large part of the difficulty is the physical distance signals must travel. Light can only travel 10 cm in the clock period of a 3 GHz CPU. Of course, signals do not travel in straight lines across the die and nor do they propagate at anything close to the speed of light due to the need for buffering and fan-out, which incur propagation delays. So, the maximum distance a memory can be away from a CPU in order to maintain 1 clock cycle of latency is a few centimetres at most, limiting the amount of memory that can be accommodated in the available area. Intel's Nehalem processor actually reduced the capacity of the L2 cache versus Penryn partly to improve its latency, which led to higher performance.* If we do not care so much about latency, then there is no reason to put the memory on-package, rather than further away where it is more convenient.
It should also be noted that the cache hit rate is very high for most workloads: well above 90% in almost all practical cases, and not uncommonly even above 99%. So, the benefit of including larger memories on-die is inherently limited to reducing the impact of this few percent of misses. Processors intended for the enterprise server market (such as POWER) typically have enormous caches and can profitably include eDRAM because it is useful to accommodate the large working sets of many enterprise workloads. Haswell has it to support the GPU, because textures are large and cannot be accommodated in cache. These are the use cases for eDRAM today, not typical desktop or HPC workloads, which are very well served by the typical cache hierarchies.
To address some issues raised in comments:
These eDRAM caches cannot be used in place of main memory because they are designed as L4 victim caches. This means that they are volatile and effectively content-addressable, so that data stored in them is not treated as residing in any specific location, and may be discarded at any time. These properties are difficult to reconcile with the requirement of RAM to be direct-mapped and persistent, but to change them would make the caches useless for their intended purpose. It is of course possible to embed memories of a more conventional design, as it is done in microcontrollers, but this is not justifiable for systems with large memories since low latency is not as beneficial in main memory as it is in a cache, so enlarging or adding a cache is a more worthwhile proposition.
As to the possibility of very large caches with capacity on the order of gigabytes, a cache is only required to be at most the size of the working set for the application. HPC applications may deal with terabyte datasets, but they have good temporal and spatial locality, and so their working sets typically are not very large. Applications with large working sets are e.g. databases and ERP software, but there is only a limited market for processors optimized for this sort of workload. Unless the software truly needs it, adding more cache provides very rapidly diminishing returns. Recently we have seen processors gain prefetch instructions, so caches are able to be used more efficiently: one can use these instructions to avoid misses caused by the unpredictability of memory access patterns, rather than the absolute size of the working set, which in most cases is still relatively small.
*The improvement in latency was not due only to the smaller physical size of the cache, but also because the associativity was reduced. There were significant changes to the entire cache hierarchy in Nehalem for several different reasons, not all of which were focused on improving performance. So, while this suffices as an example, it is not a complete account.
No comments:
Post a Comment