# Complex Pipelining: Out-of-Order Execution, Register Renaming, and Exceptions

Daniel Sanchez
Computer Science and Artificial Intelligence Laboratory
M.I.T.

## CDC 6600-style Scoreboard

Instructions are issued in order.

An instruction is issued only if

- It cannot cause a RAW hazard
- It cannot cause a WAW hazard
  - ⇒There can be at most instruction in the execute phase that can write to a particular register

#### WAR hazards are not possible

 Due to in-order issue + operands read immediately



Scoreboard: Two bit-vectors Busy[FU#]: Indicates FU's availability
These bits are hardwired to FU's.

WP[reg#]: Records if a write is pending for a register

Set to true by the Issue stage and set to false by the WB stage

## Reminder: Scoreboard Dynamics

|                                    |    |                   |    |    |    |     |            |                 |             |     | -     |                     |                               |
|------------------------------------|----|-------------------|----|----|----|-----|------------|-----------------|-------------|-----|-------|---------------------|-------------------------------|
| Issue time                         |    | ional U<br>Add(1) |    |    |    |     |            | <sub>/</sub> (4 | ) 1         | WB  | Regis | sters Re<br>for Wri | eserved WE time               |
| t0 <i>I</i> <sub>1</sub>           |    |                   |    |    |    | f6  |            |                 |             |     | f6    |                     |                               |
| $t1 I_2$                           | f2 |                   |    |    |    |     | f6         |                 |             |     | f6,   | f2                  | _                             |
| t2                                 |    |                   |    |    |    |     |            | f6              |             | f2  | f6,   |                     | <u>I</u> 2                    |
| $t\overline{3}$ $I_3$              |    |                   | fO |    |    |     |            |                 | f6          |     | f6,   | f0                  |                               |
| t4                                 |    |                   |    | f0 |    |     |            |                 |             | f6  | f6,   | f0                  | $\underline{I}_1$             |
| t 5 I <sub>4</sub>                 |    |                   |    |    | f0 | f8  |            |                 |             |     | fO,   | f8                  |                               |
| t <del>6</del>                     |    |                   |    |    |    |     | f8         |                 |             | f0  | f0,   | f8                  | <u>I</u> 3                    |
| $t\overline{7}$ $I_5$              |    | f10               |    |    |    |     |            | f8              |             |     | f8,   | f10                 |                               |
| t8                                 |    |                   |    |    |    |     |            |                 | f8          | f10 | f8,   | f10                 | <u>I</u> 5                    |
| t9                                 |    |                   |    |    |    |     |            |                 |             | f8  | f8    |                     | $\underline{I}_{\mathcal{A}}$ |
| t10 <i>I</i> <sub>6</sub>          |    | f6                |    |    |    |     |            |                 |             |     | f6    |                     |                               |
| t1 <u>1</u>                        |    |                   |    |    |    |     |            |                 |             | f6  | f6    |                     | <u>I</u> 6                    |
| $I_1$ FDIV.D $I_2$ FLD $I_3$ FMUL. |    |                   |    |    |    |     |            | )<br>-1         | f6,<br>45(x |     |       |                     |                               |
| $I_3$ FMUL.D                       |    |                   | •  |    |    | •   | f2,<br>f6, | f4<br>f2        |             |     |       |                     |                               |
| $I_5$ FSUB.                        |    |                   |    |    |    |     |            | f0,             |             |     |       |                     |                               |
| $I_6$ FADD.D f6,                   |    |                   |    |    |    | f8, | f2         |                 |             |     |       |                     |                               |

September 27, 2023

MIT 6.5900 Fall 2023

## In-Order Issue Limitations

An example

| 1 | FLD    | f2,  | 34(x1 | 12) | latency<br>1 | 1 2   |
|---|--------|------|-------|-----|--------------|-------|
| 2 | FLD    | f4,  | 45(x1 | 13) | long         |       |
| 3 | FMUL.D | f6,  | f4,   | f2  | 3            | 4 / 3 |
| 4 | FSUB.D | f8,  | f2,   | f2  | 1            |       |
| 5 | FDIV.D | f4,  | f2,   | f8  | 4            | 5     |
| 6 | FADD.D | f10, | f6,   | f4  | 1            | 6     |
|   |        |      |       |     |              |       |

1 (2,1) . . . . . .  $\underline{2}$  3 4  $\underline{4}$   $\underline{3}$  5 . . .  $\underline{5}$  6  $\underline{6}$  In-order restriction prevents instruction 4 In-order:

from being dispatched

## Out-of-Order Issue

How can we address the delay caused by a RAW dependence associated with the next in-order instruction?



- Issue stage buffer holds <u>multiple</u> instructions waiting to issue.
- Decode adds next instruction to buffer if there is space and the instruction does not cause a WAR or WAW hazard.
- Can issue any instruction in buffer whose RAW hazards are satisfied (for now at most one dispatch per cycle).
   Note: A writeback (WB) may enable more instructions.

## **In-Order Issue Limitations**

#### An example

| 1 | FLD    | f2,  | 34(x1 | 12) | latency<br>1 | 1 2   |
|---|--------|------|-------|-----|--------------|-------|
| 2 | FLD    | f4,  | 45(x1 | 13) | long         |       |
| 3 | FMUL.D | f6,  | f4,   | f2  | 3            | 4 / 3 |
| 4 | FSUB.D | f8,  | f2,   | f2  | 1            |       |
| 5 | FDIV.D | f4,  | f2,   | f8  | 4            | 5     |
| 6 | FADD.D | f10, | f6,   | f4  | 1            | 6     |
|   |        |      |       |     |              |       |

In-order: 1(2,1)......234435....566

Out-of-order: 1(2,1) 4 4 . . . . . 2 3 5 . 3 . 5 6 6

WAR/WAW hazards prevent instruction 5 from being dispatched

Out-of-order execution did not produce a significant improvement!

# How many Instructions can be in the pipeline

Throughput is limited by number of instructions in flight, but which feature of an ISA limits the number of instructions in the pipeline?

Out-of-order dispatch by itself does not provide a significant performance improvement!

How can we better understand the impact of number of registers on throughput?

## Little's Law

Throughput  $\overline{T}$  = Number in Flight  $\overline{(N)}$  / Latency  $\overline{(L)}$ 



#### Example:

4 floating point registers

8 cycles per floating point operation

 $\Rightarrow$ 

# Overcoming the Lack of Register Names

Floating Point pipelines often cannot be kept filled with small number of registers.

IBM 360 had only 4 Floating Point Registers

Can a microarchitecture use more registers than specified by the ISA without loss of ISA compatibility?

Yes, Robert Tomasulo of IBM suggested an ingenious solution in 1967 based on on-the-fly register renaming

## Instruction-level Parallelism via Renaming

| 1 | FLD    | f2,  | 34(x1 | 12) | latency<br>1 | 1 2 |
|---|--------|------|-------|-----|--------------|-----|
| 2 | FLD    | f4,  | 45(x1 | 13) | long         |     |
| 3 | FMUL.D | f6,  | f4,   | f2  | 3            | 4 3 |
| 4 | FSUB.D | f8,  | f2,   | f2  | 1            | X   |
| 5 | FDIV.D | f4′, | f2,   | f8  | 4            | 5   |
| 6 | FADD.D | f10, | f6,   | f4′ | 1            | 6   |
|   |        |      |       |     |              |     |

In-order:  $1 (2,\underline{1}) . . . . . . \underline{2} 34\underline{4} \underline{3}5 . . . \underline{5}6\underline{6}$ Out-of-order:  $1 (2,\underline{1}) 4\underline{4}5 . . . . (\underline{2,5}) 3 . . . \underline{3}6\underline{6}$ 

Renaming eliminates WAR and WAW hazards (renaming  $\Rightarrow$  additional storage)

## Handling register dependencies



- Decode does register renaming, providing a new spot for each register write
  - Renaming eliminates WAR and WAW hazards by allowing use of more storage space
- Renamed instructions added to an issue stage structure, called the reorder buffer (ROB). Any instruction in the ROB whose RAW hazards have been satisfied can be dispatched
  - Out-of-order or dataflow execution handles RAW hazards

#### Reorder Buffer Smith and Pleszkun, 1985



#### Instruction slot is candidate for execution when:

- It holds a valid instruction ("use" bit is set)
- It has not already started execution ("exec" bit is clear)
- Both operands are available ("present" bits p1 and p2 are set)

Is it obvious where an architectural register value is?

## Renaming & Out-of-order Issue

Renaming table & reg file

Reorder buffer

 $t_1$   $t_2$ 

*t*<sub>3</sub>

*t*<sub>4</sub>

*t*<sub>5</sub>





- When are names (tags) in sources replaced by data?
- When can a name (tag) be reused?

## Renaming & Out-of-order Issue

An example

Renaming table & reg file

Reorder buffer



- Insert instruction in ROB
- Issue instruction from ROB
- Complete instruction
- Empty ROB entry

| 1 | FLD    | f2,  | 34(x1 | L2) |
|---|--------|------|-------|-----|
| 2 | FLD    | f4,  | 45(x1 | •   |
| 3 | FMUL.D | f6,  | f4,   | f2  |
| 4 | FSUB.D | f8,  | f2,   | f2  |
| 5 | FDIV.D | f4,  | f2,   | f8  |
| 6 | FADD.D | f10, | f6,   | f4  |

 $t_1$ 

 $t_2$ 

*t*<sub>3</sub>

 $t_4$ 

*t*<sub>5</sub>

## Simplifying Allocation/Deallocation



#### Reorder buffer

#### Instruction buffer is managed circularly

- Set "exec" bit when instruction begins execution
- When an instruction completes its "use" bit is marked free
- Increment ptr<sub>2</sub> only if the "use" bit is marked free

L07-15 September 27, 2023 MIT 6.5900 Fall 2023

#### **Data-Driven Execution**



- Instruction template (i.e., tag t) is allocated by the Decode stage, which also stores the tag in the reg file
- When an instruction completes, its tag is deallocated

## IBM 360/91 Floating Point Unit

R. M. Tomasulo, 1967



#### Effectiveness?

Renaming and Out-of-order execution was first implemented in 1969 in IBM 360/91 but was effective only on a very small class of problems and thus did not show up in the subsequent models until mid-nineties. *Why?* 

One more problem needed to be solved...

## Reminder: Precise Exceptions

Exceptions are relatively unlikely events that need special processing, but where adding explicit control flow instructions is not desired, e.g., divide by 0, page fault

Exceptions can be viewed as an implicit conditional subroutine call that is inserted between two instructions.

Therefore, it must appear as if the exception is taken between two instructions (say  $I_i$  and  $I_{i+1}$ )

- the effect of all instructions up to and including I<sub>i</sub> is complete
- no effect of any instruction after I<sub>i</sub> has taken place

The handler either aborts the program or restarts it at  $I_{i+1}$ .

## Effect on Exceptions Out-of-order Completion

| $I_1$ | FDIV.D | f6,  | f6,  | f4  |
|-------|--------|------|------|-----|
| $I_2$ | FLD    | f2,  | 45(x | 13) |
| $I_3$ | FMUL.D | fO,  | f2,  | f4  |
| $I_4$ | FDIV.D | f8,  | f6,  | f2  |
| $I_5$ | FSUB.D | f10, | f0,  | f6  |
| $I_6$ | FADD.D | f6,  | f8,  | f2  |



Precise exceptions are difficult to implement at high speed - want to start execution of later instructions before exception checks finished on earlier instructions

MIT 6.5900 Fall 2023 L07-20 September 27, 2023

## Exceptions

- Exceptions create a dependence on the value of the next PC
- Options for handling this dependence:
  - Stall
  - Bypass
  - Find something else to do
  - Change the architecture
  - Speculate!
- How can we handle rollback on mis-speculation?

Note: earlier exceptions must override later ones





Commit

Hold exception flags in pipeline until commit point (M stage)

- •If exception at commit:
  - update Cause/EPC registers
  - kill all stages
  - fetch at handler PC

Inject external interrupts at commit point

## Phases of Instruction Execution



## In-Order Commit for Precise Exceptions



- Instructions fetched and decoded into instruction reorder buffer in-order
- Execution is out-of-order ( ⇒ out-of-order completion)
- Commit (write-back to architectural state, i.e., regfile & memory) is in-order

Temporary storage needed to hold results before commit (shadow registers and store buffers)

## **Extensions for Precise Exceptions**



#### Reorder buffer

- add <pd, dest, data, cause> fields in the instruction template
- commit instructions to reg file and memory in program order ⇒ buffers can be maintained circularly
- on exception, clear reorder buffer by resetting ptr<sub>1</sub>=ptr<sub>2</sub>
   (stores must wait for commit before updating memory)

## Rollback and Renaming



Register file does not contain renaming tags any more. How does the decode stage find the tag of a source register?

## Renaming Table



Renaming table is a cache to speed up register name lookup. It needs to be cleared after each exception taken. When else are valid bits cleared?

## Physical Register Files

- Reorder buffers are space inefficient a data value may be stored in multiple places in the reorder buffer
- Idea: Keep all data values in a physical register file
  - Tag represents the name of the data value and name of the physical register that holds it
  - Reorder buffer contains only tags

Thus, 64-bit data values may be replaced by 8-bit tags for a 256-element physical register file

More on this in later lectures ...

## Branch Penalty

