Instruction Pipelining and Hazards

Daniel Sanchez
Computer Science and Artificial Intelligence Laboratory
M.I.T.
Reminder: Harvard-Style Single-Cycle Datapath for MIPS

[Diagram showing the Harvard-Style Single-Cycle Datapath for MIPS]

- PCSrc: br, jabs, pc+4
- Add: 0x4, Inst.
- Memory: addr, inst
- Inst. Memory: 0x4, Add
- RegWrite: clk, rs1, rs2, rd1, we, ws, wd, rd2
- ALU: Imm Ext, ALU Control: z, zero?
- ALU: z, wdata, rdata
- MemWrite: clk, addr
- WBSrc: clk, we
Reminder: Princeton Microarchitecture
Datapath & Control for 2 cycles-per-instruction

March 4, 2021
The same
(mux not shown)

Only one of the phases is active in any cycle
⇒ a lot of datapath not used at any given time
Can we overlap instruction fetch and execute?

Yes, unless IR contains a Load or Store

Which action should be prioritized?  Execute

What do we do with Fetch?  Stall it
When stall condition is indicated

- don’t fetch a new instruction and don’t change the PC
- insert a nop in the IR
- set the Memory Address mux to ALU (not shown)

What if IR contains a jump or branch instruction?
Need to stall on branches

**Princeton Microarchitecture**

When IR contains a jump or taken branch:
- **no “structural conflict” for the memory**
- **but we do not have the correct PC value in the PC**
- **memory cannot be used** – Address Mux setting is irrelevant
- **insert a nop in the IR**
- **insert the nextPC (branch-target) address in the PC**
Pipelined Princeton Microarchitecture

March 4, 2021

MIT 6.823 Spring 2021
## Pipelined Princeton: Control Table

<table>
<thead>
<tr>
<th>Opcode</th>
<th>Stall</th>
<th>ExtSel</th>
<th>Src</th>
<th>B</th>
<th>Sel</th>
<th>OpSel</th>
<th>MemW</th>
<th>RegW</th>
<th>WBSrc</th>
<th>RegDst</th>
<th>PCSrc1</th>
<th>PCSrc2</th>
<th>IRSrc</th>
<th>MAddrSrc</th>
</tr>
</thead>
<tbody>
<tr>
<td>ALU</td>
<td>no</td>
<td>*</td>
<td>Reg</td>
<td>Func</td>
<td>no</td>
<td>yes</td>
<td>ALU</td>
<td>rd</td>
<td>pc+4</td>
<td>npc</td>
<td>mem</td>
<td>pc</td>
<td></td>
<td></td>
</tr>
<tr>
<td>ALUi</td>
<td>no</td>
<td>sE₁₆</td>
<td>Imm</td>
<td>Op</td>
<td>no</td>
<td>yes</td>
<td>ALU</td>
<td>rt</td>
<td>pc+4</td>
<td>npc</td>
<td>mem</td>
<td>pc</td>
<td></td>
<td></td>
</tr>
<tr>
<td>ALUiu</td>
<td>no</td>
<td>uE₁₆</td>
<td>Imm</td>
<td>Op</td>
<td>no</td>
<td>yes</td>
<td>ALU</td>
<td>rt</td>
<td>pc+4</td>
<td>npc</td>
<td>mem</td>
<td>pc</td>
<td></td>
<td></td>
</tr>
<tr>
<td>LW</td>
<td>yes</td>
<td>sE₁₆</td>
<td>Imm</td>
<td>+</td>
<td>no</td>
<td>yes</td>
<td>Mem</td>
<td>rt</td>
<td>pc+4</td>
<td>pc</td>
<td>nop</td>
<td>ALU</td>
<td></td>
<td></td>
</tr>
<tr>
<td>SW</td>
<td>yes</td>
<td>sE₁₆</td>
<td>Imm</td>
<td>+</td>
<td>yes</td>
<td>no</td>
<td>*</td>
<td>*</td>
<td>pc+4</td>
<td>pc</td>
<td>nop</td>
<td>ALU</td>
<td></td>
<td></td>
</tr>
<tr>
<td>BEQZₜ=1</td>
<td>yes</td>
<td>sE₁₆</td>
<td>*</td>
<td>0?</td>
<td>no</td>
<td>no</td>
<td>*</td>
<td>*</td>
<td>br</td>
<td>npc</td>
<td>nop</td>
<td>*</td>
<td></td>
<td></td>
</tr>
<tr>
<td>BEQZₜ=0</td>
<td>no</td>
<td>sE₁₆</td>
<td>*</td>
<td>0?</td>
<td>no</td>
<td>no</td>
<td>*</td>
<td>*</td>
<td>pc+4</td>
<td>npc</td>
<td>mem</td>
<td>pc</td>
<td></td>
<td></td>
</tr>
<tr>
<td>J</td>
<td>yes</td>
<td>*</td>
<td>*</td>
<td>*</td>
<td>no</td>
<td>no</td>
<td>*</td>
<td>*</td>
<td>jabs</td>
<td>npc</td>
<td>nop</td>
<td>*</td>
<td></td>
<td></td>
</tr>
<tr>
<td>JAL</td>
<td>yes</td>
<td>*</td>
<td>*</td>
<td>*</td>
<td>no</td>
<td>yes</td>
<td>PC</td>
<td>R31</td>
<td>jabs</td>
<td>npc</td>
<td>nop</td>
<td>*</td>
<td></td>
<td></td>
</tr>
<tr>
<td>JR</td>
<td>yes</td>
<td>*</td>
<td>*</td>
<td>*</td>
<td>no</td>
<td>no</td>
<td>*</td>
<td>*</td>
<td>rind</td>
<td>npc</td>
<td>nop</td>
<td>*</td>
<td></td>
<td></td>
</tr>
<tr>
<td>JALR</td>
<td>yes</td>
<td>*</td>
<td>*</td>
<td>*</td>
<td>no</td>
<td>yes</td>
<td>PC</td>
<td>R31</td>
<td>rind</td>
<td>npc</td>
<td>nop</td>
<td>*</td>
<td></td>
<td></td>
</tr>
<tr>
<td>NOP</td>
<td>no</td>
<td>*</td>
<td>*</td>
<td>*</td>
<td>no</td>
<td>no</td>
<td>*</td>
<td>*</td>
<td>pc+4</td>
<td>npc</td>
<td>mem</td>
<td>pc</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

BSrc = Reg/Imm; WBSrc = ALU/Mem/PC; IRSrc = nop/mem; MAddrSrc = pc/ALU
RegDst = rt/rd/R31; PCSrc1 = pc+4/br/rind/jabs; PCSrc2 = pc/nPC

* stall & IRSrc columns are identical
Pipelined Princeton Architecture

Clock: \[ t_{\text{C-Princeton}} > t_{\text{RF}} + t_{\text{ALU}} + t_{\text{M}} + t_{\text{WB}} \]

CPI: \( (1 - f) + 2f \) cycles per instruction

where f is the fraction of instructions that cause a stall

What is a likely value of f?
An Ideal Pipeline

- All objects go through the same stages
- No sharing of resources between any two stages
- Propagation delay through all pipeline stages is equal
- The scheduling of an object entering the pipeline is not affected by the objects in other stages

These conditions generally hold for industrial assembly lines. But what about an instruction pipeline?
Clock period can be reduced by dividing the execution of an instruction into multiple cycles

\[ t_C > \max \{ t_{IM}, t_{RF}, t_{ALU}, t_{DM}, t_{RW} \} \ ( = t_{DM} \text{ probably}) \]

However, CPI will increase unless instructions are pipelined
How to divide datapath into stages

Suppose memory is significantly slower than other stages. For example, suppose

\[
\begin{align*}
t_{IM} &= 10 \text{ units} \\
t_{DM} &= 10 \text{ units} \\
t_{ALU} &= 5 \text{ units} \\
t_{RF} &= 1 \text{ unit} \\
t_{RW} &= 1 \text{ unit}
\end{align*}
\]

Since the slowest stage determines the clock, it may be possible to combine some stages without any loss of performance.
Write-back stage takes much less time than other stages. Suppose we combined it with the memory phase.
Maximum Speedup by Pipelining

<table>
<thead>
<tr>
<th>Assumptions</th>
<th>Unpipelined $t_C$</th>
<th>Pipelined $t_C$</th>
<th>Speedup</th>
</tr>
</thead>
<tbody>
<tr>
<td>1. $t_{IM} = t_{DM} = 10$, $t_{ALU} = 5$, $t_{RF} = t_{RW} = 1$</td>
<td>27</td>
<td>10</td>
<td>2.7</td>
</tr>
<tr>
<td>2. $t_{IM} = t_{DM} = t_{ALU} = t_{RF} = t_{RW} = 5$</td>
<td>25</td>
<td>10</td>
<td>2.5</td>
</tr>
<tr>
<td>3. $t_{IM} = t_{DM} = t_{ALU} = t_{RF} = t_{RW} = 5$</td>
<td>25</td>
<td>5</td>
<td>5.0</td>
</tr>
</tbody>
</table>

What seems to be the message here?

One can achieve higher speedup with more pipeline stages
5-Stage Pipelined Execution

Instruction Flow Diagram

I-Fetch (IF)

Decode, Reg. Fetch (ID)

Execute (EX)

Memory (MA)

Write-Back (WB)

Time:
- t0: Instruction 1
- t1: Instruction 2
- t2: Instruction 3
- t3: Instruction 4
- t4: Instruction 5

PC → Adder

I-Fetch (IF):
- IF1
- IF2
- IF3
- IF4
- IF5

Decode, Reg. Fetch (ID):
- ID1
- ID2
- ID3
- ID4
- ID5

Execute (EX):
- EX1
- EX2
- EX3
- EX4
- EX5

Memory (MA):
- MA1
- MA2
- MA3
- MA4
- MA5

Write-Back (WB):
- WB1
- WB2
- WB3
- WB4
- WB5

Memory:
- GPRs
- Addr
- Rdata

ALU:
- Imm Ext

Adder:
- Addr
- Rdata

Control:
- Write-Back (WB)
5-Stage Pipelined Execution

Resource Usage Diagram

- **I-Fetch (IF)**
  - PC
  - Addr
  - Rdata
  - Inst. Memory

- **Decode, Reg. Fetch (ID)**
  - IR
  - Rs1
  - Rs2
  - Rd1
  - Ws
  - Wd
  - Rd2
  - GPRs
  - Imm
  - Ext

- **Execute (EX)**
  - ALU
  - Addr
  - Rdata
  - Memory
  - Wdata

- **Memory (MA)**

- **Write-Back (WB)**

**Resources**
- **time**
  - t0 I1
  - t1 I2
  - t2 I3
  - t3 I4
  - t4 I5
  - t5 I6
  - t6 I7
  - t7 ...

**March 4, 2021**
Not quite correct!

We need an Instruction Reg (IR) for each stage
Pipelined MIPS Datapath
without jumps

What else is needed?

Control Points Need to Be Connected
How instructions can interact with each other in a pipeline

• An instruction in the pipeline may need a resource being used by another instruction in the pipeline → **structural hazard**

• An instruction may depend on a value produced by an earlier instruction
  
  – Dependence may be for a data calculation → **data hazard**
  
  – Dependence may be for calculating the next PC → **control hazard** (branches, interrupts)
Data Hazards

\[
\begin{align*}
\text{r4} & \leftarrow \text{r1} \ldots \\
\text{r1} & \leftarrow \ldots \\
\end{align*}
\]

\[
\text{r1 is stale. Oops!}
\]

\[
\begin{align*}
\text{r1} & \leftarrow \text{r0} + 10 \\
\text{r4} & \leftarrow \text{r1} + 17 \\
\end{align*}
\]
Resolving Data Hazards

Strategy 1: *Wait for the result to be available by freezing earlier pipeline stages → stall*

Strategy 2: *Route data as soon as possible after it is calculated to the earlier pipeline stage → bypass*

Strategy 3: *Speculate on the dependence*
   
   *Two cases:*
   
   - *Guessed correctly → do nothing*
   - *Guessed incorrectly → kill and restart*
Resolving Data Hazards (1)

Strategy 1:

*Wait for the result to be available by freezing earlier pipeline stages → stall*
Feedback to Resolve Hazards

Later stages provide dependence information to earlier stages which can **stall (or kill) instructions**.

Controlling a pipeline in this manner works provided the instruction at stage $i+1$ can complete without any interference from instructions in stages 1 to $i$ (otherwise deadlocks may occur).
Resolving Data Hazards by Stalling

Stall Condition

... r1 ← r0 + 10
r4 ← r1 + 17
...

March 4, 2021
Stalled Stages and Pipeline Bubbles

\( (I_1) \) \( r_1 \leftarrow (r_0) + 10 \)

\( (I_2) \) \( r_4 \leftarrow (r_1) + 17 \)

\( (I_3) \)

\( (I_4) \)

\( (I_5) \)

\( time \)

\( t_0 \) \( t_1 \) \( t_2 \) \( t_3 \) \( t_4 \) \( t_5 \) \( t_6 \) \( t_7 \) \( \ldots \)

\( IF \)

\( ID \)

\( EX \)

\( MA \)

\( WB \)

\( \text{Resource Usage} \)

\( nop \Rightarrow \text{pipeline bubble} \)

\( L06-26 \)
Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions.
Stall Control Logic

ignoring jumps & branches

Should we always stall if the rs field matches some rd?
not every instruction writes a register ⇒ we
not every instruction reads a register ⇒ re
## Source & Destination Registers

### R-type:

<table>
<thead>
<tr>
<th>op</th>
<th>rs</th>
<th>rt</th>
<th>rd</th>
<th>func</th>
</tr>
</thead>
</table>

### I-type:

<table>
<thead>
<tr>
<th>op</th>
<th>rs</th>
<th>rt</th>
<th>immediate16</th>
</tr>
</thead>
</table>

### J-type:

<table>
<thead>
<tr>
<th>op</th>
<th>immediate26</th>
</tr>
</thead>
</table>

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Description</th>
<th>Source(s)</th>
<th>Destination</th>
</tr>
</thead>
<tbody>
<tr>
<td>ALU</td>
<td>rd ← (rs) func (rt)</td>
<td>rs, rt</td>
<td>rd</td>
</tr>
<tr>
<td>ALUi</td>
<td>rt ← (rs) op imm</td>
<td>rs</td>
<td>rt</td>
</tr>
<tr>
<td>LW</td>
<td>rt ← M [(rs) + imm]</td>
<td>rs</td>
<td>rt</td>
</tr>
<tr>
<td>SW</td>
<td>M [(rs) + imm] ← (rt)</td>
<td>rs, rt</td>
<td></td>
</tr>
<tr>
<td>BZ</td>
<td>cond (rs)</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>true: PC ← (PC) + imm</td>
<td>rs</td>
<td></td>
</tr>
<tr>
<td></td>
<td>false: PC ← (PC) + 4</td>
<td>rs</td>
<td></td>
</tr>
<tr>
<td>J</td>
<td>PC ← (PC) + imm</td>
<td></td>
<td></td>
</tr>
<tr>
<td>JAL</td>
<td>r31 ← (PC), PC ← (PC) + imm</td>
<td>31</td>
<td></td>
</tr>
<tr>
<td>JR</td>
<td>PC ← (rs)</td>
<td>rs</td>
<td></td>
</tr>
<tr>
<td>JALR</td>
<td>r31 ← (PC), PC ← (rs)</td>
<td>rs</td>
<td>31</td>
</tr>
</tbody>
</table>
Deriving the Stall Signal

\[ C_{dest} \]

\[ ws = \text{Case opcode} \]
- ALU \( \Rightarrow \) rd
- ALUi, LW \( \Rightarrow \) rt
- JAL, JALR \( \Rightarrow \) R31

\[ we = \text{Case opcode} \]
- ALU, ALUi, LW \( \Rightarrow (ws \neq 0) \)
- JAL, JALR \( \Rightarrow \) on
- ... \( \Rightarrow \) off

\[ C_{re} \]

\[ re1 = \text{Case opcode} \]
- ALU, ALUi, LW \( \Rightarrow \) on
- JR, JALR \( \Rightarrow \) off
- J, JAL \( \Rightarrow \) off

\[ re2 = \text{Case opcode} \]
- ALU, SW \( \Rightarrow \) on
- ... \( \Rightarrow \) off

\[ C_{stall} \]

\[ \text{stall} = ((rs_D == ws_E) \cdot we_E + (rs_D == ws_M) \cdot we_M + (rs_D == ws_W) \cdot we_W) \cdot re1_D + ((rt_D == ws_E) \cdot we_E + (rt_D == ws_M) \cdot we_M + (rt_D == ws_W) \cdot we_W) \cdot re2_D \]

This is not the full story!
Is there any possible data hazard in this instruction sequence?

What if \((r1) + 7 = (r3) + 5\)?
Load & Store Hazards

However, the hazard is avoided because our memory system completes writes in one cycle!

Load/Store hazards are sometimes resolved in the pipeline and sometimes in the memory system itself.

More on this later in the course.
Next lecture: Control Hazards, Bypassing, and Speculation