Wednesday, 16 July 2014

microprocessor - How can a CPU deliver more than one instruction per cycle?


Wikipedia's Instructions per second page says that an i7 3630QM deliver ~110,000 MIPS at a frequency of 3.2 GHz; it would be (110/3.2 instructions) / 4 core = ~8.6 instructions per cycle per core?! How can a single core deliver more than one instruction per cycle?


To my understanding a pipeline should only be able to deliver one result per clock.


These are my thoughts:



  • Internal frequency is actually higher than 3.2 GHz

  • Some parts of the CPU are asynchronous in a way a humble human like myself cannot understand

  • There are multiple concurrent pipelines per core

  • A pipeline can deliver more than result per clock, an instruction can skip pipeline stages and there are multiple prefetcher to keep up

  • I am missing something




Answer



First, as Keelan's comment and Turbo J's answer point out, the measurement was 113,093 Dhrystone MIPS not native MIPS.


The Ivy Bridge microarchitecture of the i7 3630QM can only commit 4 fused µops per cycle, though it can begin execution of 6 µops per cycle. (The number of fused µops in a trace of code is roughly equal to the number of instructions; some complex instructions are decoded into multiple µops that are not fused and some pairs of instructions can be fused into a single µop, e.g., a compare immediately followed by a conditional jump.)


Two of your speculations on how multiple instructions can be executed in a single cycle are quite valid and have been used in actual processors. Your first speculation, that a faster internal clock is used, was used in the original Pentium 4's "fireball" ALUs. These ALUs were clocked at twice the frequency of the rest of the core, which was already relatively high.


(This was accomplished by using a staggered ALU in which the lower half of an addition was done in one cycle, allowing a dependent operation to use the lower half of the result in the next cycle. For operations like add, xor, or left shift which only need the lower half of operands to produce the full lower half of the result, such staggering—also known as width-pipelining—allows single cycle result latency as well as single cycle throughput.)


A somewhat related technique, cascaded ALUs, was used by the HyperSPARC. The HyperSPARC fed the results from two ALUs into a third ALU. This allowed two independent and a third dependent operation to be executed in a single cycle.


Your speculation that "there are multiple concurrent pipelines per core" is the other technique that has been used. This type of design is called superscalar and is by far the most common means of increasing the number of operations executed in a single cycle.


There are also a few other odds and ends of instruction execution that might be worth noting. Some operations can be more efficiently performed outside of the ordinary execution units. The technique of move elimination exploits the use of register renaming in out-of-order processors to perform move operations during register renaming; the move simply copies the physical register number from one position in the renaming table (called a register alias table) to another. Not only does this effectively increase execution width but it also removes a dependency. This technique was used early with the stack-based x87, but is now broadly used in Intel's high performance x86 processors. (The use of destructive, two-operand instructions in x86 makes move elimination more helpful than it would be in a typical RISC.)


A technique similar to move elimination is the handling of register zeroing instructions during renaming. By providing a register name that provides the zero value, a register clearing instruction (like xor or subtract with the both operands being the same register) can simply insert that name into the renaming table (RAT).



Another technique used by some x86 processors reduces the cost of push and pop operations. Ordinarily an instruction using the stack pointer would have to wait a full cycle for a previous push or pop to update the value for the stack pointer. By recognizing that push and pop only add or subtract a small value to the stack pointer, one can compute the results of multiple additions/subtactions in parallel. The main delay for addition is carry propagation, but with small values the more significant bits of the base value—in this case the stack pointer—will only have at most one carry-in. This allows an optimization similar to that of a carry-select adder to be applied to multiple additions of small values. In addition, since the stack pointer is typically only updated by constants, these updates can be performed earlier in the pipeline separately from the main execution engine.


It is also possible to merge instructions into a single, more complex operation. While the reverse process of splitting instructions into multiple, simpler operations is an old technique, merging instructions (which Intel terms macro-op fusion) can allow the implementation to support operations more complex than those exposed in the instruction set.


On the theoretical side, other techniques have been proposed. Small constants other than zero could be supported in the RAT and some simple operations that use or reliably produce such small values could be handled early. ("Physical Register Inlining", Mikko H. Lipasti et al., 2004, suggested using the RAT as a means of reducing register count, but the idea could be extended to support loading small immediates and simple operations on small numbers.)


For trace caches (which store sequences of instructions under particular assumptions of control flow), there can be opportunities to merge operations separated by branches and remove operations that produce unused results in the trace. The caching of the optimizations in a trace cache can also encourage performing optimizations such as instruction merging which might not be worthwhile if they had to be done each time the instruction stream was fetched.


Value prediction can be used to increase the number of operations that can be executed in parallel by removing dependencies. A stride-based value predictor is similar to the pop/push optimization of a specialized stack engine mentioned earlier. It can compute multiple additions mostly in parallel, removing the serialization. The general idea of value prediction is that with a predicted value, dependent operations can proceed without delay. (Branch direction and target prediction is effectively just a very limited form of value prediction, allowing the fetching of following instructions which are dependent on the "value" of the branch—taken or not—and the next instruction address, another value.)


No comments:

Post a Comment

arduino - Can I use TI's cc2541 BLE as micro controller to perform operations/ processing instead of ATmega328P AU to save cost?

I am using arduino pro mini (which contains Atmega328p AU ) along with cc2541(HM-10) to process and transfer data over BLE to smartphone. I...