## Complex Pipelining:

## Out-of-Order Execution, Register Renaming and Exceptions

Daniel Sanchez<br>Computer Science and Artificial Intelligence Laboratory M.I.T.

## CDC 6600-style Scoreboard



## CDC 6600-style Scoreboard

Instructions are issued in order;


## CDC 6600-style Scoreboard

Instructions are issued in order;
An instruction is issued only if


## CDC 6600-style Scoreboard

Instructions are issued in order; An instruction is issued only if<br>- It cannot cause a RAW hazard



## CDC 6600-style Scoreboard

> Instructions are issued in order; An instruction is issued only if
> - It cannot cause a RAW hazard
> $\Rightarrow$ if operands are read immediately then no need to remember sources of instructions in the execute phases


## CDC 6600-style Scoreboard

Instructions are issued in order;
An instruction is issued only if

- It cannot cause a RAW hazard
$\Rightarrow$ if operands are read
immediately then no need to
remember sources of
instructions in the execute
phases
- It cannot cause a WAW hazard



## CDC 6600-style Scoreboard

> Instructions are issued in order; An instruction is issued only if
> - It cannot cause a RAW hazard
> $\Rightarrow$ if operands are read immediately then no need to remember sources of instructions in the execute phases
> - It cannot cause a WAW hazard
> $\Rightarrow$ There can be at most instruction in the execute phase that can write in a


## CDC 6600-style Scoreboard

Instructions are issued in order; An instruction is issued only if

- It cannot cause a RAW hazard
$\Rightarrow$ if operands are read immediately then no need to remember sources of instructions in the execute phases
- It cannot cause a WAW hazard
$\Rightarrow$ There can be at most instruction in the execute phase that can write in a


Scoreboard:
Two bit-vectors

## CDC 6600-style Scoreboard

Instructions are issued in order; An instruction is issued only if

- It cannot cause a RAW hazard
$\Rightarrow$ if operands are read immediately then no need to remember sources of instructions in the execute phases
- It cannot cause a WAW hazard
$\Rightarrow$ There can be at most instruction in the execute phase that can write in a
 particular register

Busy[FU\#]: Indicates FU's availability These bits are hardwired to FU's.

Scoreboard:
Two bit-vectors

## CDC 6600-style Scoreboard

Instructions are issued in order; An instruction is issued only if

- It cannot cause a RAW hazard
$\Rightarrow$ if operands are read immediately then no need to remember sources of instructions in the execute phases
- It cannot cause a WAW hazard
$\Rightarrow$ There can be at most instruction in the execute phase that can write in a
 particular register

Scoreboard:
Two bit-vectors

Busy[FU\#]: Indicates FU's availability These bits are hardwired to FU's.

WP[reg\#]: Records if a write is pending for a register

Set to true by the Issue stage and set to false by the WB stage

## Reminder: Scoreboard Dynamics



## Reminder: Scoreboard Dynamics



## Reminder: Scoreboard Dynamics



## Reminder: Scoreboard Dynamics



## Reminder: Scoreboard Dynamics



## Reminder: Scoreboard Dynamics



## Reminder: Scoreboard Dynamics



## Reminder: Scoreboard Dynamics



## Reminder: Scoreboard Dynamics



Sanchez \& Emer

## Reminder: Scoreboard Dynamics



Sanchez \& Emer

## Reminder: Scoreboard Dynamics

| $\underbrace{\begin{array}{l} \text { Issue } \\ \text { time } \end{array}}$ | Functional Unit Stat Int(1) Add(1) ${ }^{\text {Mult(3) }}$ |  |  |  | Div(4) |  |  |  | WB | WP |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| t0 |  |  |  |  | f6 |  |  |  |  | f6 |  |
| $\mathrm{t} 1 \mathrm{I}_{2}$ | f2 |  |  |  |  | f6 |  |  |  | f6, f2 |  |
| t2 |  |  |  |  |  |  | f6 |  | f2 | f6, f2 | $\underline{I}_{2}$ |
| t3 $I_{3}$ |  |  | f0 |  |  |  |  | $f 6$ |  | f6, f0 |  |
| t4 |  |  |  | f0 |  |  |  |  | f6 | f6, f0 | $\underline{\underline{I}}$ |
| t5 $\mathrm{I}_{4}$ |  |  |  | f0 | 8 |  |  |  |  | f0, f8 |  |
| t6 |  |  |  |  |  | f8 |  |  | f0 | f0, f8 | $\underline{I}_{3}$ |
| t7 $\mathrm{I}_{5}$ |  | f10 |  |  |  |  | 88 |  |  | f8, f10 |  |
| t8 |  |  |  |  |  |  |  | f8 | f10 | f8, f10 | $\underline{I}_{5}$ |
| t9 |  |  |  |  |  |  |  |  | f8 | f8 | $\underline{I}_{4}$ |
|  |  |  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |  |  |
|  | DIVD |  |  | f6, |  | f6 | , |  | $f 4$ | Issue |  |
| $I_{2}$ | LD |  |  | f2, |  |  | 5 (r3) |  |  | WP |  |
| $I_{3}$ | MUL |  |  | fo, |  | f2 |  |  | $f 4$ | WP |  |
| $I_{4}$ | DIVD |  |  | f8, |  | $f 6$ |  |  | $f 2$ | WP | WP |
| $I_{5}$ | SUB |  |  | f10, |  | f0 |  |  | f6 | Busy[ |  |

Sanchez \& Emer

## Reminder: Scoreboard Dynamics

| $\begin{aligned} & \text { Issue } \\ & \text { time } \end{aligned}$ | Functional Unit Status Int(1) Add(1) Mult(3) $\operatorname{Div}(4)$ WB |  |  |  |  |  |  |  |  | WP |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| t0 ${ }_{1}$ |  |  |  |  | f6 |  |  |  |  | f6 |  |
| $\mathrm{t} 1 \mathrm{I}_{2}$ | f2 |  |  |  |  | f6 |  |  |  | f6, f2 |  |
| t2 |  |  |  |  |  |  | f6 |  | f2 | f6, f2 | $\underline{I}$ |
| t3 $I_{3}$ |  |  | f0 |  |  |  |  | f6 |  | f6, f0 |  |
| t4 |  |  |  | f0 |  |  |  |  | f6 | f6, f0 | $\underline{\underline{I}}$ |
| $t 5{ }_{4}$ |  |  |  | f0 | 8 |  |  |  |  | f0, f8 |  |
| t6 |  |  |  |  |  | f8 |  |  | f0 | f0, f8 | $\underline{I}$ |
| t7 $I_{5}$ |  | f10 |  |  |  |  | 88 |  |  | f8, f10 |  |
| t8 |  |  |  |  |  |  |  | ${ }^{\text {f }}$ | f10 | f8, f10 | $\underline{I}$ |
| t9 |  |  |  |  |  |  |  |  | $f 8$ | f8 | $I_{4}$ |
| $\mathrm{t} 10 \mathrm{I}_{6}$ |  | f6 |  |  |  |  |  |  |  | f6 |  |
|  |  |  |  |  |  |  |  |  |  |  |  |
| $I_{1}$ $I_{2}$ | $\begin{aligned} & \text { DIVD } \\ & \text { LD } \end{aligned}$ |  | $\begin{aligned} & \text { f6, } \\ & \text { f2, } \end{aligned}$ |  |  | f6, 45(r3) |  |  | f4 | Issue checks: |  |
| $\mathrm{I}_{3}$ | MULTD |  |  |  |  | f2, |  |  | $f 4$ | WP[de |  |
| $I_{4}$ | DIVD |  |  |  |  |  |  |  | f2 | WP[src1] or W |  |
| $I_{5}$ | SUBD |  | $\begin{aligned} & \text { f8, } \\ & \text { f10, } \end{aligned}$ |  |  | f0, |  |  | f6 | Busy[ |  |

Sanchez \& Emer

## Reminder: Scoreboard Dynamics



Sanchez \& Emer

## In-Order Issue Limitations: an example



In-order: $1(2, \underline{1})$. . . . . . $234 \underline{3} 5$. . . $\underline{5} 6 \underline{6}$

## In-Order Issue Limitations: an example



## Out-of-Order Issue

How can we address the delay caused by a RAW dependence associated with the next in-order instruction?


## Out-of-Order Issue

How can we address the delay caused by a RAW dependence associated with the next in-order instruction?


Find something else to do!

## Out-of-Order Issue

How can we address the delay caused by a RAW dependence associated with the next in-order instruction?


Find something else to do!

- Issue stage buffer holds multiple instructions waiting to issue.


## Out-of-Order Issue

How can we address the delay caused by a RAW dependence associated with the next in-order instruction?


Find something else to do!

- Issue stage buffer holds multiple instructions waiting to issue.
- Decode adds next instruction to buffer if there is space and the instruction does not cause a WAR or WAW hazard.


## Out-of-Order Issue

How can we address the delay caused by a RAW dependence associated with the next in-order instruction?


Find something else to do!

- Issue stage buffer holds multiple instructions waiting to issue.
- Decode adds next instruction to buffer if there is space and the instruction does not cause a WAR or WAW hazard.
- Can issue any instruction in buffer whose RAW hazards are satisfied (for now at most one dispatch per cycle). A writeback (WB) may enable more instructions.


## In-Order Issue Limitations: an example

|  | F2, | $34(\mathrm{R} 2)$ | latency |  |  |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 1 | LD | 1 |  |  |  |
| 2 | LD | F 4, | $45(\mathrm{R} 3)$ | long |  |
| 3 | MULTD | F 6, | F 4, | F 2 | 3 |
| 4 | SUBD | F, | F 2, | F 2 | 1 |
| 5 | DIVD | F 4, | F 2, | F 8 | 4 |
| 6 | ADDD | F 10, | F 6, | F 4 | 1 |

In-order: $1(2, \underline{1})$. . . . . . $234 \underline{3} 5$. . . $\underline{5} 6 \underline{6}$

## In-Order Issue Limitations: an example

|  | F2, | $34(\mathrm{R} 2)$ | latency |  |  |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 1 | LD | 1 |  |  |  |
| 2 | LD | F 4, | $45(\mathrm{R} 3)$ | long |  |
| 3 | MULTD | F 6, | F 4, | F 2 | 3 |
| 4 | SUBD | F, | F 2, | F 2 | 1 |
| 5 | DIVD | F 4, | F 2, | F 8 | 4 |
| 6 | ADDD | F 10, | F 6, | F 4 | 1 |


| In-order: | $1(2, \underline{1}) . . . . . . \underline{2} 34 \underline{3} 5 . . . \underline{5} 6 \underline{6}$ |
| :--- | :--- |
| Out-of-order: | $1(2, \underline{1}) 4 \underline{4} . . . .23 . . \underline{3} 5 . . . \underline{5} 6 \underline{6}$ |

## In-Order Issue Limitations: an example



| In-order: | $1(2, \underline{1}) . . . . . . \underline{2} 34 \underline{3} 5 . . . \underline{5} 6 \underline{6}$ |
| :--- | :--- |
| Out-of-order: | $1(2, \underline{1}) 4 \underline{4} . . . .23 . . \underline{3} 5 . . . \underline{5} 6 \underline{6}$ |

Out-of-order execution did not allow any significant improvement!

## Instruction-level Parallelism via Renaming

|  |  |  | latency |  |  |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 1 | LD | F2, | $34(\mathrm{R} 2)$ | 1 |  |
| 2 | LD | F 4, | $45(\mathrm{R} 3)$ | long |  |
| 3 | MULTD | F 6, | F 4, | F 2 | 3 |
| 4 | SUBD | F, | F, | F 2 | 1 |
| 5 | DIVD | F 4, | F 2, | F 8 | 4 |
| 6 | ADDD | F 10, | F 6, | F 4 | 1 |

In-order:

$$
1(2, \underline{1}) \text {. . . . . . } \underline{2} 34 \underline{4} \underline{3} 5 \text {. . . } \underline{5} 6 \underline{6}
$$

## Instruction-level Parallelism via Renaming

|  |  |  | latency |  |  |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 1 | LD | F2, | $34(\mathrm{R} 2)$ | 1 |  |
| 2 | LD | F 4, | $45(\mathrm{R} 3)$ | long |  |
| 3 | MULTD | F, | F, | F, | 3 |
| 4 | SUBD | F, | F, | F, | 1 |
| 5 | DIVD | F 4, | F 2, | F 8 | 4 |
| 6 | ADDD | F 10, | F 6, | F 4 | 1 |

In-order: $1(2, \underline{1})$. . . . . . $234 \underline{3} 3$. . . $\underline{5} 6 \underline{6}$

Renaming eliminates WAR and WAW hazards

## Instruction-level Parallelism via Renaming



Renaming eliminates WAR and WAW hazards (renaming $\Rightarrow$ additional storage)

## Instruction-level Parallelism via Renaming



Renaming eliminates WAR and WAW hazards (renaming $\Rightarrow$ additional storage)

## Instruction-level Parallelism via Renaming



Renaming eliminates WAR and WAW hazards (renaming $\Rightarrow$ additional storage)

## Instruction-level Parallelism via Renaming



Renaming eliminates WAR and WAW hazards (renaming $\Rightarrow$ additional storage)

Which feature of an ISA limits the number of instructions in the pipeline?

Which feature of an ISA limits the number of instructions in the pipeline?

## Number of Registers

## How many Instructions can be in the pipeline

Which feature of an ISA limits the number of instructions in the pipeline?

## Number of Registers

Out-of-order dispatch by itself does not provide any significant performance improvement!

## Little's Law

Throughput $(T)=$ Number in Flight (N) / Latency (L)


Example:
4 floating point registers
8 cycles per floating point operation
$\Rightarrow$

## Little's Law

Throughput $(T)=$ Number in Flight $(N) /$ Latency $(L)$


Example:
4 floating point registers
8 cycles per floating point operation
$\Rightarrow \quad 1 / 2$ issues per cycle!

## Overcoming the Lack of Register Names

Floating Point pipelines often cannot be kept filled with small number of registers.

IBM 360 had only 4 Floating Point Registers
Can a microarchitecture use more registers than specified by the ISA without loss of ISA compatibility ?

## Overcoming the Lack of Register Names

Floating Point pipelines often cannot be kept filled with small number of registers.

IBM 360 had only 4 Floating Point Registers
Can a microarchitecture use more registers than specified by the ISA without loss of ISA compatibility ?

Yes, Robert Tomasulo of IBM suggested an ingenious solution in 1967 based on on-the-fly register renaming

## Register Renaming



- Decode does register renaming and adds instructions to the issue stage reorder buffer (ROB)
$\Rightarrow$ renaming makes WAR or WAW hazards impossible
- Any instruction in ROB whose RAW hazards have been satisfied can be dispatched.
$\Rightarrow$ Out-of-order or dataflow execution


## Dataflow execution



Instruction slot is candidate for execution when:
-It holds a valid instruction ("use" bit is set)
-It has not already started execution ("exec" bit is clear)
-Both operands are available (p1 and p2 are set)

## Renaming \& Out-of-order Issue

 An example

| 1 | LD | F2, | 34(R2) |  |
| :--- | :--- | :--- | :--- | :--- |
| 2 | LD | F4, | $45($ R3 $)$ |  |
| 3 | MULTD | F6, | F4, | F2 |
| 4 | SUBD | F8, | F2, | F2 |
| 5 | DIVD | F4, | F2, | F8 |
| 6 | ADDD | F10, | F6, | F4 |

- When are names in sources replaced by data?
-When can a name be reused?


## Renaming \& Out-of-order Issue

 An example

| 1 | LD | F2, | 34(R2) |  |
| :--- | :--- | :--- | :--- | :--- |
| 2 | LD | F4, | $45($ R3 $)$ |  |
| 3 | MULTD | F6, | F4, | F2 |
| 4 | SUBD | F8, | F2, | F2 |
| 5 | DIVD | F4, | F2, | F8 |
| 6 | ADDD | F10, | F6, | F4 |

- When are names in sources replaced by data?

Whenever an FU produces data
-When can a name be reused?

## Renaming \& Out-of-order Issue

 An exampleRenaming table
Reorder buffer

|  | data |
| :---: | :---: |
| F1 |  |
| F2 |  |
| F3 |  |
| F4 |  |
| F5 |  |
| F6 |  |
| F7 |  |
| F8 |  |
|  |  |


| 1 | LD | F2, | $34($ R2 $)$ |  |
| :--- | :--- | :--- | :--- | :--- |
| 2 | LD | F4, | $45($ R3 $)$ |  |
| 3 | MULTD | F6, | F4, | F2 |
| 4 | SUBD | F8, | F2, | F2 |
| 5 | DIVD | F4, | F2, | F8 |
| 6 | ADDD | F10, | F6, | F4 |

- When are names in sources replaced by data?

Whenever an FU produces data
-When can a name be reused?
Whenever an instruction completes

## Renaming \& Out-of-order Issue

 An example

| 1 | LD | F2, | 34(R2) |  |
| :--- | :--- | :--- | :--- | :--- |
| 2 | LD | F4, | $45($ R3 $)$ |  |
| 3 | MULTD | F6, | F4, | F2 |
| 4 | SUBD | F8, | F2, | F2 |
| 5 | DIVD | F4, | F2, | F8 |
| 6 | ADDD | F10, | F6, | F4 |

- When are names in sources replaced by data?

Whenever an FU produces data
-When can a name be reused?
Whenever an instruction completes

## Renaming \& Out-of-order Issue

 An exampleRenaming table
Reorder buffer

| F1 p data |  |
| :---: | :---: |
|  |  |
| F2 | t1 |
| F3 |  |
| F4 |  |
| F5 |  |
| F6 |  |
| F7 |  |
| F8 |  |
|  |  |


| 1 | LD | F2, | $34($ R2) |  |
| :--- | :--- | :--- | :--- | :--- |
| 2 | LD | F4, | $45($ R3 $)$ |  |
| 3 | MULTD | F6, | F4, | F2 |
| 4 | SUBD | F8, | F2, | F2 |
| 5 | DIVD | F4, | F2, | F8 |
| 6 | ADDD | F10, | F6, | F4 |

- When are names in sources replaced by data?

Whenever an FU produces data
-When can a name be reused?
Whenever an instruction completes

## Renaming \& Out-of-order Issue

 An exampleRenaming table
Reorder buffer

| F1 p data |  |
| :---: | :---: |
|  |  |
| F2 | t1 |
| F3 |  |
| F4 |  |
| F5 |  |
| F6 |  |
| F7 |  |
| F8 |  |
|  |  |


| 1 | LD | F2, | 34(R2) |  |
| :--- | :--- | :--- | :--- | :--- |
| 2 | LD | F4, | 45(R3) |  |
| 3 | MULTD | F6, | F4, | F2 |
| 4 | SUBD | F8, | F2, | F2 |
| 5 | DIVD | F4, | F2, | F8 |
| 6 | ADDD | F10, | F6, | F4 |

- When are names in sources replaced by data?

Whenever an FU produces data
-When can a name be reused?
Whenever an instruction completes

## Renaming \& Out-of-order Issue

 An exampleRenaming table
Reorder buffer

| F1 p data |  |
| :---: | :---: |
|  |  |
| F2 | t1 |
| F3 |  |
| F4 |  |
| F5 |  |
| F6 |  |
| F7 |  |
| F8 |  |
|  |  |


| 1 | LD | F2, | $34($ R2) |  |
| :--- | :--- | :--- | :--- | :--- |
| 2 | LD | F4, | $45($ R3 $)$ |  |
| 3 | MULTD | F6, | F4, | F2 |
| 4 | SUBD | F8, | F2, | F2 |
| 5 | DIVD | F4, | F2, | F8 |
| 6 | ADDD | F10, | F6, | F4 |

- When are names in sources replaced by data?

Whenever an FU produces data
-When can a name be reused?
Whenever an instruction completes

## Renaming \& Out-of-order Issue

 An exampleRenaming table
Reorder buffer

| F1 p data |  |
| :---: | :---: |
|  |  |
| F2 | v1 |
| F3 |  |
| F4 |  |
| F5 |  |
| F6 |  |
| F7 |  |
| F8 |  |
|  |  |


| 1 | LD | F2, | $34($ R2) |  |
| :--- | :--- | :--- | :--- | :--- |
| 2 | LD | F4, | $45($ R3 $)$ |  |
| 3 | MULTD | F6, | F4, | F2 |
| 4 | SUBD | F8, | F2, | F2 |
| 5 | DIVD | F4, | F2, | F8 |
| 6 | ADDD | F10, | F6, | F4 |

- When are names in sources replaced by data?

Whenever an FU produces data
-When can a name be reused?
Whenever an instruction completes

## Renaming \& Out-of-order Issue An example



| 1 | LD | F2, | 34(R2) |  |
| :--- | :--- | :--- | :--- | :--- |
| 2 | LD | F4, | $45($ R3 $)$ |  |
| 3 | MULTD | F6, | F4, | F2 |
| 4 | SUBD | F8, | F2, | F2 |
| 5 | DIVD | F4, | F2, | F8 |
| 6 | ADDD | F10, | F6, | F4 |

- When are names in sources replaced by data?

Whenever an FU produces data
-When can a name be reused?
Whenever an instruction completes

## Renaming \& Out-of-order Issue An example



| 1 | LD | F2, | 34(R2) |  |
| :--- | :--- | :--- | :--- | :--- |
| 2 | LD | F4, | $45($ R3 $)$ |  |
| 3 | MULTD | F6, | F4, | F2 |
| 4 | SUBD | F8, | F2, | F2 |
| 5 | DIVD | F4, | F2, | F8 |
| 6 | ADDD | F10, | F6, | F4 |

- When are names in sources replaced by data?

Whenever an FU produces data
-When can a name be reused?
Whenever an instruction completes

## Renaming \& Out-of-order Issue An example



| 1 | LD | F2, | $34($ R2 $)$ |  |
| :--- | :--- | :--- | :--- | :--- |
| 2 | LD | F4, | $45($ R3 $)$ |  |
| 3 | MULTD | F6, | F4, | F2 |
| 4 | SUBD | F8, | F2, | F2 |
| 5 | DIVD | F4, | F2, | F8 |
| 6 | ADDD | F10, | F6, | F4 |

- When are names in sources replaced by data?

Whenever an FU produces data
-When can a name be reused?
Whenever an instruction completes

## Renaming \& Out-of-order Issue An example



| 1 | LD | F2, | $34($ R2 $)$ |  |
| :--- | :--- | :--- | :--- | :--- |
| 2 | LD | F4, | $45($ R3 $)$ |  |
| 3 | MULTD | F6, | F4, | F2 |
| 4 | SUBD | F8, | F2, | F2 |
| 5 | DIVD | F4, | F2, | F8 |
| 6 | ADDD | F10, | F6, | F4 |

- When are names in sources replaced by data?

Whenever an FU produces data
-When can a name be reused?
Whenever an instruction completes

## Renaming \& Out-of-order Issue An example



| 1 | LD | F2, | $34($ R2 $)$ |  |
| :--- | :--- | :--- | :--- | :--- |
| 2 | LD | F4, | $45($ R3 $)$ |  |
| 3 | MULTD | F6, | F4, | F2 |
| 4 | SUBD | F8, | F2, | F2 |
| 5 | DIVD | F4, | F2, | F8 |
| 6 | ADDD | F10, | F6, | F4 |

- When are names in sources replaced by data?

Whenever an FU produces data
-When can a name be reused?
Whenever an instruction completes

## Renaming \& Out-of-order Issue An example



| 1 | LD | F2, | $34($ R2 $)$ |  |
| :--- | :--- | :--- | :--- | :--- |
| 2 | LD | F4, | $45($ R3 $)$ |  |
| 3 | MULTD | F6, | F4, | F2 |
| 4 | SUBD | F8, | F2, | F2 |
| 5 | DIVD | F4, | F2, | F8 |
| 6 | ADDD | F10, | F6, | F4 |

- When are names in sources replaced by data?

Whenever an FU produces data
-When can a name be reused?
Whenever an instruction completes

## Renaming \& Out-of-order Issue An example



| 1 | LD | F2, | $34($ R2 $)$ |  |
| :--- | :--- | :--- | :--- | :--- |
| 2 | LD | F4, | $45($ R3 $)$ |  |
| 3 | MULTD | F6, | F4, | F2 |
| 4 | SUBD | F8, | F2, | F2 |
| 5 | DIVD | F4, | F2, | F8 |
| 6 | ADDD | F10, | F6, | F4 |

- When are names in sources replaced by data?

Whenever an FU produces data
-When can a name be reused?
Whenever an instruction completes

## Renaming \& Out-of-order Issue An example



| 1 | LD | F2, | $34($ R2 $)$ |  |
| :--- | :--- | :--- | :--- | :--- |
| 2 | LD | F4, | $45($ R3 $)$ |  |
| 3 | MULTD | F6, | F4, | F2 |
| 4 | SUBD | F8, | F2, | F2 |
| 5 | DIVD | F4, | F2, | F8 |
| 6 | ADDD | F10, | F6, | F4 |

- When are names in sources replaced by data?

Whenever an FU produces data
-When can a name be reused?
Whenever an instruction completes

## Renaming \& Out-of-order Issue An example



| 1 | LD | F2, | $34($ R2 $)$ |  |
| :--- | :--- | :--- | :--- | :--- |
| 2 | LD | F4, | $45($ R3 $)$ |  |
| 3 | MULTD | F6, | F4, | F2 |
| 4 | SUBD | F8, | F2, | F2 |
| 5 | DIVD | F4, | F2, | F8 |
| 6 | ADDD | F10, | F6, | F4 |

- When are names in sources replaced by data?

Whenever an FU produces data
-When can a name be reused?
Whenever an instruction completes

## Renaming \& Out-of-order Issue An example

| Renaming table p data |  | Reorder buffer |  |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | Ins\# use exec op p1 src1 p2 src2 |  |  |  |  |  |  |  |  |
| F1 |  |  | 0 |  |  |  |  |  |  |  |
| F2 | v1 | 2 | 1 | 1 | LD |  |  |  |  |  |
| F3 |  | 3 | 1 | 0 | MUL | 0 | t2 |  | 1 | v1 |
| F4 | t5 | 4 | 0 |  |  |  |  |  |  |  |
| F5 |  | 5 | 1 | 0 | DIV | 1 | v |  | 0 | t4 |
| F6 | t3 |  |  |  |  |  |  |  |  |  |
| F7 |  |  |  |  |  |  |  |  |  |  |
| F8 | v4 |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |  |
| dat |  |  |  |  |  |  |  |  |  |  |


| 1 | LD | F2, | $34($ R2 $)$ |  |
| :--- | :--- | :--- | :--- | :--- |
| 2 | LD | F4, | $45($ R3 $)$ |  |
| 3 | MULTD | F6, | F4, | F2 |
| 4 | SUBD | F8, | F2, | F2 |
| 5 | DIVD | F4, | F2, | F8 |
| 6 | ADDD | F10, | F6, | F4 |

- When are names in sources replaced by data?

Whenever an FU produces data
-When can a name be reused?
Whenever an instruction completes

## Renaming \& Out-of-order Issue An example

| Renaming table p data |  | Reorder buffer |  |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | Ins\# use exec op p1 src1 p2 src2 |  |  |  |  |  |  |  |  |
| F1 |  |  | 0 |  |  |  |  |  |  |  |
| F2 | v1 | 2 | 1 | 1 | LD |  |  |  |  |  |
| F3 |  | 3 | 1 | 0 | MUL | 0 | t2 |  | 1 | v1 |
| F4 | t5 | 4 | 0 |  |  |  |  |  |  |  |
| F5 |  | 5 | 1 | 0 | DIV | 1 | v |  | 1 | v4 |
| F6 | t3 |  |  |  |  |  |  |  |  |  |
| F7 |  |  |  |  |  |  |  |  |  |  |
| F8 | v4 |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |  |
| dat |  |  |  |  |  |  |  |  |  |  |


| 1 | LD | F2, | $34($ R2 $)$ |  |
| :--- | :--- | :--- | :--- | :--- |
| 2 | LD | F4, | $45($ R3 $)$ |  |
| 3 | MULTD | F6, | F4, | F2 |
| 4 | SUBD | F8, | F2, | F2 |
| 5 | DIVD | F4, | F2, | F8 |
| 6 | ADDD | F10, | F6, | F4 |

- When are names in sources replaced by data?

Whenever an FU produces data
-When can a name be reused?
Whenever an instruction completes

## Renaming \& Out-of-order Issue An example

| Ren | O |  |  |  | ord | r | $f$ |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | da | Ins | use |  | op | p1 | S |  | p2 | src2 |
| F1 |  |  | 0 |  |  |  |  |  |  |  |
| F2 | v1 | 2 | 0 |  |  |  |  |  |  |  |
| F3 |  | 3 | 1 | 0 | MUL | 0 | t2 |  | 1 | v1 |
| F4 | t5 | 4 | 0 |  |  |  |  |  |  |  |
| F5 |  | 5 | 1 | 0 | DIV | 1 | v |  | 1 | v4 |
| F6 | t3 |  |  |  |  |  |  |  |  |  |
| F7 |  |  |  |  |  |  |  |  |  |  |
| F8 | $v 4$ |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |  |
| data $\left(\mathrm{v}_{\mathrm{i}}\right) / \operatorname{tag}\left(\mathrm{t}_{\mathrm{i}}\right)$ |  |  |  |  |  |  |  |  |  |  |


| 1 | LD | F2, | $34($ R2 $)$ |  |
| :--- | :--- | :--- | :--- | :--- |
| 2 | LD | F4, | $45($ R3 $)$ |  |
| 3 | MULTD | F6, | F4, | F2 |
| 4 | SUBD | F8, | F2, | F2 |
| 5 | DIVD | F4, | F2, | F8 |
| 6 | ADDD | F10, | F6, | F4 |

- When are names in sources replaced by data?

Whenever an FU produces data
-When can a name be reused?
Whenever an instruction completes

## Renaming \& Out-of-order Issue An example



| 1 | LD | F2, | 34(R2) |  |
| :--- | :--- | :--- | :--- | :--- |
| 2 | LD | F4, | $45($ R3 $)$ |  |
| 3 | MULTD | F6, | F4, | F2 |
| 4 | SUBD | F8, | F2, | F2 |
| 5 | DIVD | F4, | F2, | F8 |
| 6 | ADDD | F10, | F6, | F4 |

- When are names in sources replaced by data?

Whenever an FU produces data
-When can a name be reused?
Whenever an instruction completes

## Data-Driven Execution

Renaming table \& reg file

Reorder buffer


- Instruction template (i.e., tag t ) is allocated by the Decode stage, which also stores the tag in the reg file
- When an instruction completes, its tag is deallocated


## Data-Driven Execution

Renaming
table \&
reg file

Reorder buffer

## Replacing the

 tag by its value is an expensive operation

| Ins\# | use | exec | op | p1 | src1 | p2 | src2 |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
|  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |
| $t_{2}$ |  |  |  |  |  |  |  |
| $\cdot$ |  |  |  |  |  |  |  |
| $t_{2}$ |  |  |  |  |  |  |  |
| $t_{n}$ |  |  |  |  |  |  |  |

- Instruction template (i.e., tag t ) is allocated by the Decode stage, which also stores the tag in the reg file
- When an instruction completes, its tag is deallocated


## Simplifying Allocation/Deallocation



Instruction buffer is managed circularly
-"exec" bit is set when instruction begins execution
-When an instruction completes its "use" bit is marked free

- $\mathrm{ptr}_{2}$ is incremented only if the "use" bit is marked free


## IBM 360/91 Floating Point Unit

 R. M. Tomasulo, 1967

## Effectiveness?

Renaming and Out-of-order execution was first implemented in 1969 in IBM 360/91 but did not show up in the subsequent models until midnineties.
Why?

## Effectiveness?

Renaming and Out-of-order execution was first implemented in 1969 in IBM 360/91 but did not show up in the subsequent models until midnineties.
Why?

1. Effective on a very small class of programs

## Effectiveness?

Renaming and Out-of-order execution was first implemented in 1969 in IBM 360/91 but did not show up in the subsequent models until midnineties.
Why?

1. Effective on a very small class of programs
2. Did not address the memory latency problem which turned out be a much bigger issue than FU latency
3. Made exceptions imprecise

## Effectiveness?

Renaming and Out-of-order execution was first implemented in 1969 in IBM 360/91 but did not show up in the subsequent models until midnineties.
Why?

1. Effective on a very small class of programs
2. Did not address the memory latency problem which turned out be a much bigger issue than FU latency
3. Made exceptions imprecise

One more problem needed to be solved

## Effectiveness?

Renaming and Out-of-order execution was first implemented in 1969 in IBM 360/91 but did not show up in the subsequent models until midnineties.
Why?

1. Effective on a very small class of programs
2. Did not address the memory latency problem which turned out be a much bigger issue than FU latency
3. Made exceptions imprecise

One more problem needed to be solved
Control transfers

## Effectiveness?

Renaming and Out-of-order execution was first implemented in 1969 in IBM 360/91 but did not show up in the subsequent models until midnineties.
Why?

1. Effective on a very small class of programs
2. Did not address the memory latency problem which turned out be a much bigger issue than FU latency
3. Made exceptions imprecise

One more problem needed to be solved

## Control transfers

More on this in the next lecture

## Precise Exceptions

Exceptions are relatively unlikely events that need special processing, but where adding explicit control flow instructions is not desired, e.g., divide by 0, page fault

Exceptions can be viewed as an implicit conditional subroutine call that is inserted between two instructions.

Therefore, it must appear as if the exception is taken between two instructions (say $\mathrm{I}_{\mathrm{i}}$ and $\mathrm{I}_{\mathrm{i}+1}$ )

- the effect of all instructions up to and including $\mathrm{I}_{\mathrm{i}}$ is complete
- no effect of any instruction after $I_{i}$ has taken place

The handler either aborts the program or restarts it at $\mathrm{I}_{\mathrm{i}+1}$.

## Effect on Exceptions

Out-of-order Completion

| $I_{1}$ | DIVD | f6, | f6, | f4 |
| :--- | :--- | :--- | :--- | :--- |
| $I_{2}$ | LD | f2, | $45(\mathrm{r} 3)$ |  |
| $I_{3}$ | MULTD | f0, | f2, | f4 |
| $I_{4}$ | DIVD | f8, | f6, | f2 |
| $I_{5}$ | SUBD | f10, | f0, | f6 |
| $I_{6}$ | ADDD | f6, | f8, | f2 |

out-of-order comp $1 \begin{array}{llllllllllll} & 2 & \underline{2} & 3 & \underline{1} & 4 & \underline{3} & 5 & \underline{5} & \underline{4} & 6 & \underline{6}\end{array}$

## Effect on Exceptions

Out-of-order Completion

| $I_{1}$ | DIVD | f6, | f6, | f4 |
| :--- | :--- | :--- | :--- | :--- |
| $I_{2}$ | LD | f2, | $45(r 3)$ |  |
| $I_{3}$ | MULTD | f0, | f2, | $\mathrm{f4}$ |
| $I_{4}$ | DIVD | f8, | f6, | f 2 |
| $I_{5}$ | SUBD | f10, | f0, | $\mathrm{f6}$ |
| $I_{6}$ | ADDD | f6, | f8, | f 2 |

out-of-order comp $1 \begin{array}{llllllllllll} & 2 & \underline{2} & 3 & \underline{1} & 4 & \underline{3} & 5 & \underline{5} & \underline{4} & 6 & \underline{6}\end{array}$

Consider exceptions

## Effect on Exceptions

Out-of-order Completion

| $I_{1}$ | DIVD | f6, | f6, | f4 |
| :--- | :--- | :--- | :--- | :--- |
| $I_{2}$ | LD | f2, | $45(r 3)$ |  |
| $I_{3}$ | MULTD | f0, | f2, | $\mathrm{f4}$ |
| $I_{4}$ | DIVD | f8, | f6, | f 2 |
| $I_{5}$ | SUBD | f10, | f0, | $\mathrm{f6}$ |
| $I_{6}$ | ADDD | f6, | f8, | f 2 |

out-of-order comp $1 \begin{array}{llllllllllll} & 2 & \underline{2} & 3 & 1 & 4 & \underline{3} & 5 & \underline{5} & \underline{4} & 6 & \underline{6}\end{array}$
Consider exceptions

## Effect on Exceptions

Out-of-order Completion

| $I_{1}$ | DIVD | f6, | f6, | $\mathrm{f4}$ |
| :--- | :--- | :--- | :--- | :--- |
| $I_{2}$ | LD | f2, | $45(\mathrm{r} 3)$ |  |
| $I_{3}$ | MULTD | f0, | f2, | $\mathrm{f4}$ |
| $I_{4}$ | DIVD | f8, | f6, | f 2 |
| $I_{5}$ | SUBD | f10, | f0, | $\mathrm{f6}$ |
| $I_{6}$ | ADDD | f6, | f8, | f 2 |

$\begin{array}{llllllllllllll}\text { out-of-order comp } & 1 & 2 & \underline{2} & 3 & 1 & 1 & 4 & \underline{3} & 5 & \underline{5} & \underline{4} & 6 & \underline{6} \\ \text { Consider exceptions } & & & & & & \end{array}$

## Effect on Exceptions

Out-of-order Completion

| $I_{1}$ | DIVD | f6, | f6, | f4 |
| :---: | :---: | :---: | :---: | :---: |
| $I_{2}$ | LD | f2, | 45(r3) |  |
| $I_{3}$ | MULTD | f0, | f2, | f4 |
| $I_{4}$ | DIVD | f8, | f6, | f2 |
| $I_{5}$ | SUBD | f10, | f0, | f6 |
| $I_{6}$ | ADDD | f6, | f8, | f2 |



## Effect on Exceptions

Out-of-order Completion

| $I_{1}$ | DIVD | f6, | f6, | f4 |
| :--- | :--- | :--- | :--- | :--- |
| $I_{2}$ | LD | f2, | $45(r 3)$ |  |
| $I_{3}$ | MULTD | f0, | f2, | $\mathrm{f4}$ |
| $I_{4}$ | DIVD | f8, | f6, | f 2 |
| $I_{5}$ | SUBD | f10, | f0, | $\mathrm{f6}$ |
| $I_{6}$ | ADDD | f6, | f8, | f 2 |



## Effect on Exceptions

## Out-of-order Completion

| $I_{1}$ | DIVD | f6, | f6, | f4 |
| :--- | :--- | :--- | :--- | :--- |
| $I_{2}$ | LD | f2, | $45(\mathrm{r3})$ |  |
| $I_{3}$ | MULTD | f0, | f2, | f4 |
| $I_{4}$ | DIVD | f8, | f6, | f2 |
| $I_{5}$ | SUBD | f10, | fo, | f6 |
| $I_{6}$ | ADDD | f6, | f8, | f2 |



Precise exceptions are difficult to implement at high speed

- want to start execution of later instructions before exception checks finished on earlier instructions


## Exceptions

## Exceptions

- Exceptions create a dependence on the value of the next PC


## Exceptions

- Exceptions create a dependence on the value of the next PC
- Options for handling this dependence:


## Exceptions

- Exceptions create a dependence on the value of the next PC
- Options for handling this dependence:
- Stall


## Exceptions

- Exceptions create a dependence on the value of the next PC
- Options for handling this dependence:
- Stall

No

## Exceptions

- Exceptions create a dependence on the value of the next PC
- Options for handling this dependence:
- Stall

No

- Bypass


## Exceptions

- Exceptions create a dependence on the value of the next PC
- Options for handling this dependence:
- Stall

No

- Bypass

No

## Exceptions

- Exceptions create a dependence on the value of the next PC
- Options for handling this dependence:
- Stall
- Bypass
- Find something else to do

No
No

## Exceptions

- Exceptions create a dependence on the value of the next PC
- Options for handling this dependence:
- Stall
- Bypass
- Find something else to do

No
No

## Exceptions

- Exceptions create a dependence on the value of the next PC
- Options for handling this dependence:
- Stall
- Bypass
- Find something else to do
- Change the architecture

No
No
No

## Exceptions

- Exceptions create a dependence on the value of the next PC
- Options for handling this dependence:
- Stall
- Bypass
- Find something else to do
- Change the architecture

No
No
No
Sometimes: Alpha, Multiflow

## Exceptions

- Exceptions create a dependence on the value of the next PC
- Options for handling this dependence:
- Stall
- Bypass
- Find something else to do
- Change the architecture
- Speculate!

No
No
No
Sometimes: Alpha, Multiflow

## Exceptions

- Exceptions create a dependence on the value of the next PC
- Options for handling this dependence:
- Stall
- Bypass
- Find something else to do
- Change the architecture
- Speculate!

No
No
No
Sometimes: Alpha, Multiflow
Most common approach!

## Exceptions

- Exceptions create a dependence on the value of the next PC
- Options for handling this dependence:
- Stall
- Bypass
- Find something else to do
- Change the architecture
- Speculate!

No
No
No
Sometimes: Alpha, Multiflow
Most common approach!

- How can we handle rollback on mis-speculation


## Exceptions

- Exceptions create a dependence on the value of the next PC
- Options for handling this dependence:
- Stall
- Bypass
- Find something else to do
- Change the architecture
- Speculate!

No
No
No
Sometimes: Alpha, Multiflow Most common approach!

- How can we handle rollback on mis-speculation

Delay state update until commit on speculated instructions

## Exceptions

- Exceptions create a dependence on the value of the next PC
- Options for handling this dependence:
- Stall
- Bypass
- Find something else to do
- Change the architecture
- Speculate!

No
No
No
Sometimes: Alpha, Multiflow Most common approach!

- How can we handle rollback on mis-speculation

Delay state update until commit on speculated instructions

- Note: earlier exceptions must override later ones


## Phases of Instruction Execution



## Exception Handling

 (In-Order Five-Stage Pipeline)

## Exception Handling (In-Order Five-Stage Pipeline)



## Exception Handling (In-Order Five-Stage Pipeline)



## Exception Handling (In-Order Five-Stage Pipeline)



## Exception Handling (In-Order Five-Stage Pipeline)



# Exception Handling (In-Order Five-Stage Pipeline) 



# Exception Handling (In-Order Five-Stage Pipeline) 



Hold exception flags in pipeline until commit point (M stage)

## Exception Handling (In-Order Five-Stage Pipeline)



Hold exception flags in pipeline until commit point (M stage)

# Exception Handling (In-Order Five-Stage Pipeline) 



Hold exception flags in pipeline until commit point (M stage)
-If exception at commit:

- update Cause/EPC registers
- kill all stages
- fetch at handler PC


# Exception Handling (In-Order Five-Stage Pipeline) 



Hold exception flags in pipeline until commit point (M stage)
-If exception at commit:

- update Cause/EPC registers
- kill all stages
- fetch at handler PC


## Exception Handling (In-Order Five-Stage Pipeline)



Hold exception flags in pipeline until commit point (M stage)
-If exception at commit:

- update Cause/EPC registers
- kill all stages
- fetch at handler PC


## Exception Handling (In-Order Five-Stage Pipeline)



Hold exception flags in pipeline until commit point (M stage) -If exception at commit:

- update Cause/EPC registers
- kill all stages
- fetch at handler PC


# Exception Handling (In-Order Five-Stage Pipeline) 



- update Cause/EPC registers
- kill all stages
- fetch at handler PC

Inject external interrupts at commit point

# Exception Handling (In-Order Five-Stage Pipeline) 



- update Cause/EPC registers
- kill all stages
- fetch at handler PC

Inject external interrupts at commit point

## In-Order Commit for Precise Exceptions



- Instructions fetched and decoded into instruction reorder buffer in-order
- Execution is out-of-order ( $\Rightarrow$ out-of-order completion)
- Commit (write-back to architectural state, i.e., regfile \& memory, is in-order

Temporary storage needed to hold results before commit (shadow registers and store buffers)

## Extensions for Precise Exceptions

|  | Inst\# | use | exec | op | p1 | src1 | p2 | src2 |  | dest | data | cause |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |  |  |  |
| $\mathrm{ptr}_{2}$ next to commit |  |  |  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |  |  |  |
| next |  |  |  |  | - |  |  |  |  |  |  |  |
| available |  |  |  |  | $\underline{ }$ |  | $\cdots$ |  | , |  |  |  |

- add <pd, dest, data, cause> fields in the instruction template
- commit instructions to reg file and memory in program order $\Rightarrow$ buffers can be maintained circularly
- on exception, clear reorder buffer by resetting $\operatorname{ptr}_{1}=\mathrm{ptr}_{2}$ (stores must wait for commit before updating memory)


## Rollback and Renaming

Register File (now holds only committed state)


Reorder buffer


Register file does not contain renaming tags any more. How does the decode stage find the tag of a source register?

## Rollback and Renaming

Register File (now holds only committed state)


Reorder buffer


Register file does not contain renaming tags any more.
How does the decode stage find the tag of a source register? Search the "dest" field in the reorder buffer

## Renaming Table



Renaming table is a cache to speed up register name lookup.
It needs to be cleared after each exception taken.
When else are valid bits cleared?

## Renaming Table



Renaming table is a cache to speed up register name lookup.
It needs to be cleared after each exception taken.
When else are valid bits cleared?
Control transfers

## Physical Register Files

- Reorder buffers are space inefficient - a data value may be stored in multiple places in the reorder buffer
- idea - keep all data values in a physical register file
- Tag represents the name of the data value and name of the physical register that holds it
- Reorder buffer contains only tags

Thus, 64 data values may be replaced by 8-bit tags for a 256 element physical register file

More on this in later lectures ...

## Branch Penalty

How many instructions need to be killed on a misprediction?

Modern processors may have > 10 pipeline stages between nextPC calculation and branch resolution!

Next fetch started


## Branch Penalty

How many instructions need to be killed on a misprediction?

Modern processors may have > 10 pipeline stages between nextPC calculation and branch resolution!


