The English used in this article or section may not be easy for everybody to understand. (June 2012)
A superscalar CPU design makes a form of parallel computing called Instruction-level parallelism inside a single CPU, which allows more work to be done at the same clock rate. This means the CPU executes more than one instruction during a clock cycle by running multiple instructions at the same time (called instruction dispatching) on duplicate functional units. Each functional unit is just an execution resource inside the CPU core, like an arithmetic logic unit (ALU), floating point unit (FPU), a bit shifter, or a multiplier.
Most superscalar CPUs are also pipelined, but it's possible to have a non-pipelined superscalar CPU or a pipelined non-superscalar CPU.
The superscalar technique is supported by several features of the CPU core:
- Instructions come from an ordered instruction list.
- CPU hardware can work out which instructions have which data dependencies.
- Can read multiple instructions per clock cycle
Each instruction run by a scalar processor changes one or two data items at a time, but each instruction executed by a vector processor handles many data items at once. A superscalar processor is a mixture of the two:
- Each instruction processes one data item.
- There are multiple duplicate functional units inside each CPU core, so that multiple instructions handle independent data items at the same time.
In a superscalar CPU an instruction dispatcher reads instructions from memory and decides which ones can be run in parallel, dispatching them on the multiple duplicate functional units available inside the CPU.
Superscalar CPU design is concerned with improving accuracy of the instruction dispatcher, and allowing it to keep the multiple functional units busy at all times. As of 2008, all general-purpose CPUs are superscalar, a typical superscalar CPU may include up to 4 ALUs, 2 FPUs, and two SIMD units. If the dispatcher can't keep all of the units busy, the performance of the CPU will be lower.
Performance improvement in Superscalar CPU design is limited by two things:
- The level of built-in parallelism in the instruction list
- The complexity and time cost of the dispatcher and data dependency checking.
Even given infinitely fast dependency checking inside a normal superscalar CPU, if the instruction list itself has many dependencies, this would also limit the possible performance improvement, so the amount of built-in parallelism in the code is another limitation.
No matter how fast the dispatcher speed, there is a practical limit on how many instructions can be simultaneously dispatched. While hardware advances will allow for more functional units (e.g., ALUs) per CPU core, the problem of checking instruction dependencies increases to a limit that the achievable superscalar dispatching limit is somewhat small. -- Likely on the order of five to six simultaneously dispatched instructions.
- Simultaneous multithreading: often abbreviated as SMT, is a technique for improving the overall speed of superscalar CPUs. SMT allows multiple independent threads of execution to better use the resources available inside a modern superscalar processor.
- Multi-core processors: superscalar processors differ from multi-core processors in that the multiple redundant functional units are not entire processors. A single superscalar processor is composed of advanced functional units such as the ALU, integer multiplier, integer shifter, floating point unit (FPU), etc. There may be multiple versions of each functional unit to enable execution of many instructions in parallel. This differs from a Multi-core processors that concurrently processes instructions from multiple threads, one thread per core.
- Pipelined processors: superscalar processors also differs from a pipelined CPU, where the multiple instructions can concurrently be in various stages of execution.
The various alternative techniques are not mutually exclusive—they can be (and frequently are) combined in a single processor, so it is possible to design a multicore CPU is where each core is an independent processor with multiple parallel superscalar pipelines. Some multicore processors also include vector capability.
- Mike Johnson, Superscalar Microprocessor Design, Prentice-Hall, 1991, ISBN 0-13-875634-1
- Sorin Cotofana, Stamatis Vassiliadis, "On the Design Complexity of the Issue Logic of Superscalar Machines", EUROMICRO 1998: 10277-10284
- Steven McGeady, "The i960CA SuperScalar Implementation of the 80960 Architecture", IEEE 1990, pp. 232–240
- Steven McGeady, et al., "Performance Enhancements in the Superscalar i960MM Embedded Microprocessor," ACM Proceedings of the 1991 Conference on Computer Architecture (Compcon), 1991, pp. 4–7