# Instruction Pipelining and Hazards Daniel Sanchez Computer Science and Artificial Intelligence Laboratory M.I.T. ### Princeton Microarchitecture Datapath & Control for 2-cycles-per-instruction #### Princeton Microarchitecture (redrawn) Only one of the phases is active in any cycle ⇒ a lot of datapath is not in use at any given time Sanchez & Emer February 18, 2014 ### Princeton Microarchitecture Overlapped execution Can we overlap instruction fetch and execute? Yes, unless IR contains a Load or Store Which action should be prioritized? Execute What do we do with Fetch? Stall it How? #### Stalling the instruction fetch Princeton Microarchitecture When stall condition is indicated - don't fetch a new instruction and don't change the PC - insert a nop in the IR - set the Memory Address mux to ALU (not shown) What if IR contains a jump or branch instruction? #### Need to stall on branches Princeton Microarchitecture When IR contains a jump or branch-taken - no "structural conflict" for the memory - but we do not have the correct PC value in the PC - memory cannot be used Address Mux setting is irrelevant - insert a nop in the IR - insert the nextPC (branch-target) address in the PC #### Pipelined Princeton Microarchitecture **PCen** #### Pipelined Princeton: Control Table | Opcode | Stall | Ext<br>Sel | B<br>Src | Op<br>Sel | Mem<br>W | Reg<br>W | WB<br>Src | Reg<br>Dst | PC<br>Src1 | PC<br>Src2 | IR<br>Src | MAddr<br>Src | |--------------|-------|------------------|----------|-----------|----------|----------|-----------|------------|------------|------------|-----------|--------------| | ALU | no | * | Reg | Func | no | yes | ALU | rd | pc+4 | npc | mem | рс | | ALUi | no | sE <sub>16</sub> | Imm | Ор | no | yes | ALU | rt | pc+4 | npc | mem | рс | | ALUiu | no | uE <sub>16</sub> | Imm | Ор | no | yes | ALU | rt | pc+4 | npc | mem | рс | | LW | | sE <sub>16</sub> | _ | + | no | yes | Mem | rt | pc+4 | рс | nop | ALU | | SW | yes | sE <sub>16</sub> | Imm | + | yes | no | * | * | pc+4 | рс | nop | ALU | | $BEQZ_{z=0}$ | yes | sE <sub>16</sub> | * | 0? | no | no | * | * | br | npc | nop | * | | $BEQZ_{z=1}$ | | sE <sub>16</sub> | l | 0? | no | no | * | * | pc+4 | npc | mem | рс | | J | yes | * | * | * | no | no | * | * | jabs | npc | nop | * | | JAL | yes | * | * | * | no | yes | PC | R31 | jabs | npc | nop | * | | JR | yes | * | * | * | no | no | * | * | rind | npc | nop | * | | JALR | yes | * | * | * | no | yes | PC | R31 | rind | npc | nop | * | | NOP | no | * | * | * | no | no | * | * | pc+4 | npc | mem | рс | BSrc = Reg / Imm ; WBSrc = ALU / Mem / PC; IRSrc = nop/mem; MAddSrc = pc/ALU RegDst = rt / rd / R31; PCSrc1 = pc+4 / br / rind / jabs; PCSrc2 = pc/nPC stall & IRSrc columns are identical #### Pipelined Princeton Architecture Clock: $t_{C-Princeton} > t_{RF} + t_{ALU} + t_{M} + t_{WB}$ CPI: (1- f) + 2f cycles per instruction where f is the fraction of instructions that cause a stall What is a likely value of f? #### An Ideal Pipeline - All objects go through the same stages - No sharing of resources between any two stages - Propagation delay through all pipeline stages is equal - The scheduling of an object entering the pipeline is not affected by the objects in other stages These conditions generally hold for industrial assembly lines. But what about an instruction pipeline? #### Pipelined Datapath Clock period can be reduced by dividing the execution of an instruction into multiple cycles $$t_C > max \{t_{IM}, t_{RF}, t_{ALU}, t_{DM}, t_{RW}\} ( = t_{DM} probably)$$ However, CPI will increase unless instructions are pipelined ### How to divide the datapath into stages Suppose memory is significantly slower than other stages. In particular, suppose ``` t_{IM} = 10 \text{ units} t_{DM} = 10 \text{ units} t_{ALU} = 5 \text{ units} t_{RF} = 1 \text{ unit} t_{RW} = 1 \text{ unit} ``` Since the slowest stage determines the clock, it may be possible to combine some stages without any loss of performance #### Alternative Pipelining Write-back stage takes much less time than other stages. Suppose we combined it with the memory phase ### Maximum Speedup by Pipelining | | Assumptions | Unpipelined | Pipelined | Speedup | |----|--------------------------------------------------------------------|----------------|----------------|---------| | 1. | $t_{IM} = t_{DM} = 10,$<br>$t_{ALU} = 5,$<br>$t_{RF} = t_{RW} = 1$ | t <sub>C</sub> | t <sub>C</sub> | | | | 4-stage pipeline | 27 | 10 | 2.7 | | 2. | $t_{IM} = t_{DM} = t_{ALU} = t_{RF} = t_{RW} =$ 4-stage pipeline | 5<br>25 | 10 | 2.5 | | 3. | $t_{IM} = t_{DM} = t_{ALU} = t_{RF} = t_{RW} =$ 5-stage pipeline | 5<br>25 | 5 | 5.0 | What seems to be the message here? One can achieve higher speedup with more pipeline stages. February 18, 2014 #### 5-Stage Pipelined Execution Instruction Flow Diagram #### 5-Stage Pipelined Execution Resource Usage Diagram ### Pipelined Execution: ALU Instructions Not quite correct! We need an Instruction Reg (IR) for each stage ### Pipelined MIPS Datapath without jumps ## How instructions can interact with each other in a pipeline - An instruction in the pipeline may need a resource being used by another instruction in the pipeline → structural hazard - An instruction may depend on something produced by an earlier instruction - Dependence may be for a data calculation data hazard - Dependence may be for calculating the next PC control hazard (branches, interrupts) #### Data Hazards r1 is stale. Oops! #### Resolving Data Hazards Strategy 1: Wait for the result to be available by freezing earlier pipeline stages → interlocks Strategy 2: Route data as soon as possible after it is calculated to the earlier pipeline stage > bypass Strategy 3: Speculate on the dependence Two cases: Guessed correctly → do nothing Guessed incorrectly → kill and restart #### Resolving Data Hazards (1) Strategy 1: Wait for the result to be available by freezing earlier pipeline stages → interlocks #### Feedback to Resolve Hazards - Later stages provide dependence information to earlier stages which can *stall (or kill) instructions* - Controlling a pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur) Sanchez & Emer February 18, 2014 #### Interlocks to resolve Data Hazards ### Stalled Stages and Pipeline Bubbles nop ⇒ pipeline bubble #### Interlock Control Logic Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions.