Directory-Based Cache Coherence

Mengjia Yan
Computer Science and Artificial Intelligence Lab
M.I.T.
Maintaining Cache Coherence

It is sufficient to have hardware such that

- Only one processor at a time has write permission for a location
- No processor can load a stale copy of the location after a write
Maintaining Cache Coherence

It is sufficient to have hardware such that

• Only one processor at a time has write permission for a location
• No processor can load a stale copy of the location after a write

⇒ A correct approach could be:
Maintaining Cache Coherence

It is sufficient to have hardware such that

- Only one processor at a time has write permission for a location
- No processor can load a stale copy of the location after a write

⇒ A correct approach could be:

**write request:**

The address is *invalidated* in all other caches *before* the write is performed
Maintaining Cache Coherence

It is sufficient to have hardware such that

- Only one processor at a time has write permission for a location
- No processor can load a stale copy of the location after a write

⇒ A correct approach could be:

write request:
  The address is *invalidated* in all other caches
  *before* the write is performed

read request:
  If a dirty copy is found in some cache, a write-back is performed before the memory is read
Directory-Based Coherence
[Censier and Feautrier, 1978]

Snoopy Protocols

- Snoopy schemes broadcast requests over memory bus
- Difficult to scale to large numbers of processors
- Requires additional bandwidth to cache tags for snoop requests
Directory-Based Coherence
[Censier and Feautrier, 1978]

- Snoopy schemes broadcast requests over memory bus
- Difficult to scale to large numbers of processors
- Requires additional bandwidth to cache tags for snoop requests
Directory-Based Coherence
[Censier and Feautrier, 1978]

Snoopy Protocols

- Snoopy schemes broadcast requests over memory bus
- Difficult to scale to large numbers of processors
- Requires additional bandwidth to cache tags for snoop requests

Directory Protocols

- Directory schemes send messages to only those caches that might have the line
- Can scale to large numbers of processors
- Requires extra directory storage to track possible sharers
An MSI Directory Protocol

- Cache states: Modified (M) / Shared (S) / Invalid (I)
An MSI Directory Protocol

- Cache states: Modified (M) / Shared (S) / Invalid (I)
- Directory states:
  - Uncached (Un): No sharers
  - Shared (Sh): One or more sharers with read permission (S)
  - Exclusive (Ex): A single sharer with read & write permissions (M)
An MSI Directory Protocol

- Cache states: Modified (M) / Shared (S) / Invalid (I)
- Directory states:
  - Uncached (Un): No sharers
  - Shared (Sh): One or more sharers with read permission (S)
  - Exclusive (Ex): A single sharer with read & write permissions (M)
- Transient states not drawn for clarity; for now, assume no racing requests
Transitions initiated by processor accesses:

- M
- S
- I

<table>
<thead>
<tr>
<th>Actions</th>
</tr>
</thead>
<tbody>
<tr>
<td>Processor Read (PrRd)</td>
</tr>
<tr>
<td>Processor Write (PrWr)</td>
</tr>
<tr>
<td>Shared Request (ShReq)</td>
</tr>
<tr>
<td>Exclusive Request (ExReq)</td>
</tr>
</tbody>
</table>
Transitions initiated by processor accesses:

- Processor Read (PrRd)
- Processor Write (PrWr)
- Shared Request (ShReq)
- Exclusive Request (ExReq)
Transitions initiated by processor accesses:

- **M**
  - PrWr / ExReq
- **I**
  - PrRd / ShReq

**Actions**

<table>
<thead>
<tr>
<th>Actions</th>
</tr>
</thead>
<tbody>
<tr>
<td>Processor Read (PrRd)</td>
</tr>
<tr>
<td>Processor Write (PrWr)</td>
</tr>
<tr>
<td>Shared Request (ShReq)</td>
</tr>
<tr>
<td>Exclusive Request (ExReq)</td>
</tr>
</tbody>
</table>
MSI Protocol: Caches (1/3)

Transitions initiated by processor accesses:

- PrWr / ExReq
- PrRd / --
- PrRd / ShReq

<table>
<thead>
<tr>
<th>Actions</th>
</tr>
</thead>
<tbody>
<tr>
<td>Processor Read (PrRd)</td>
</tr>
<tr>
<td>Processor Write (PrWr)</td>
</tr>
<tr>
<td>Shared Request (ShReq)</td>
</tr>
<tr>
<td>Exclusive Request (ExReq)</td>
</tr>
</tbody>
</table>
MSI Protocol: Caches (1/3)

Transitions initiated by processor accesses:

- PrWr / ExReq
- PrRd / --
- PrRd / ShReq

### Actions

<table>
<thead>
<tr>
<th>Actions</th>
</tr>
</thead>
<tbody>
<tr>
<td>Processor Read (PrRd)</td>
</tr>
<tr>
<td>Processor Write (PrWr)</td>
</tr>
<tr>
<td>Shared Request (ShReq)</td>
</tr>
<tr>
<td>Exclusive Request (ExReq)</td>
</tr>
</tbody>
</table>
Transitions initiated by processor accesses:

- **M**
  - PrRd / --
  - PrWr / --

- **S**
  - PrWr / ExReq
  - PrRd / --

- **I**
  - PrRd / ShReq

**Actions**

- Processor Read (PrRd)
- Processor Write (PrWr)
- Shared Request (ShReq)
- Exclusive Request (ExReq)
Transitions initiated by directory requests:

- \( \text{M} \)
- \( \text{S} \)
- \( \text{I} \)

<table>
<thead>
<tr>
<th>Actions</th>
</tr>
</thead>
<tbody>
<tr>
<td>Invalidation Request (InvReq)</td>
</tr>
<tr>
<td>Downgrade Request (DownReq)</td>
</tr>
<tr>
<td>Invalidation Response (InvResp)</td>
</tr>
<tr>
<td>Downgrade Response (DownResp)</td>
</tr>
</tbody>
</table>
MSI Protocol: Caches (2/3)

Transitions initiated by directory requests:

- InvReq / InvResp (with data)

Actions:

- Invalidation Request (InvReq)
- Downgrade Request (DownReq)
- Invalidation Response (InvResp)
- Downgrade Response (DownResp)
Transitions initiated by directory requests:

- Invalidation Request (InvReq)
- Downgrade Request (DownReq)
- Invalidation Response (InvResp)
- Downgrade Response (DownResp)
Transitions initiated by directory requests:

- DownReq / DownResp (with data)
- InvReq / InvResp (with data)
- InvReq / InvResp (without data)

<table>
<thead>
<tr>
<th>Actions</th>
</tr>
</thead>
<tbody>
<tr>
<td>Invalidation Request (InvReq)</td>
</tr>
<tr>
<td>Downgrade Request (DownReq)</td>
</tr>
<tr>
<td>Invalidation Response (InvResp)</td>
</tr>
<tr>
<td>Downgrade Response (DownResp)</td>
</tr>
</tbody>
</table>
Transitions initiated by evictions:

- **M**
- **S**
- **I**

<table>
<thead>
<tr>
<th>Actions</th>
</tr>
</thead>
<tbody>
<tr>
<td>Writeback Request (WbReq)</td>
</tr>
</tbody>
</table>

MIT 6.5900 Fall 2023
Transitions initiated by evictions:

- M -> Eviction / WbReq (with data) -> S
- M -> I

<table>
<thead>
<tr>
<th>Actions</th>
</tr>
</thead>
<tbody>
<tr>
<td>Writeback Request (WbReq)</td>
</tr>
</tbody>
</table>
Transitions initiated by evictions:

- Eviction / WbReq (with data)
- Eviction / WbReq (without data)

Actions
- Writeback Request (WbReq)
MSI Protocol: Caches

- Transitions initiated by processor accesses
- Transitions initiated by directory requests
- Transitions initiated by evictions
MSI Protocol: Directory (1/2)

Transitions initiated by data requests:

- Ex
- Sh
- Un
MSI Protocol: Directory (1/2)

Transitions initiated by data requests:

- Ex

- Sh
  - ShReq / Sharers = Sharers + \{P\}; ShResp
  - ShReq / Sharers = \{P\}; ShResp

- Un
Transitions initiated by data requests:

ExReq / Sharers = \{P\}; ExResp

ShReq / Sharers = Sharers + \{P\}; ShResp

ShReq / Sharers = \{P\}; ShResp
Transitions initiated by data requests:

ExReq / Sharers = \{P\}; ExResp

Ex

ExReq / Inv(Sharers \setminus \{P\}); Sharers = \{P\}; ExResp

Sh

ShReq / Sharers = Sharers + \{P\}; ShResp

Un

ShReq / Sharers = \{P\}; ShResp
Transitions initiated by data requests:

ExReq / Sharers = \{P\}; ExResp

Ex

ShReq / Down(Sharer); Sharers = Sharer + \{P\}; ShResp

ExReq / Inv(Sharers - \{P\}); Sharers = \{P\}; ExResp

Sh

ShReq / Sharers = Sharers + \{P\}; ShResp

ShReq / Sharers = \{P\}; ShResp

Un
Transitions initiated by data requests:

ExReq / Sharers = \{P\}; ExResp

Ex

ExReq / Inv(Sharers), Sharers=\{P\}; ExResp

Sh

ShReq / Down(Sharer); Sharers = Sharer + \{P\}; ShResp

ShReq / Sharers = Sharers + \{P\}; ShResp

Un

ShReq / Sharers = \{P\}; ShResp

ExReq / Inv(Sharers - \{P\}); Sharers = \{P\}; ExResp
Transitions initiated by writeback requests:

- Ex
- Sh
- Un
MSI Protocol: Directory (2/2)

Transitions initiated by writeback requests:

WbReq / Sharers = {}; WbResp
Transitions initiated by writeback requests:

WbReq / Sharers = {}; WbResp

WbReq && |Sharers| > 1 / Sharers = Sharers - {P}; WbResp
MSI Protocol: Directory (2/2)

Transitions initiated by writeback requests:

\[
\begin{align*}
\text{Ex} & \quad \text{WbReq} / \text{Sharers} = \{\}; \text{WbResp} \\
\text{Sh} & \quad \text{WbReq} \land |\text{Sharers}| > 1 / \\
& \quad \text{Sharers} = \text{Sharers} - \{P\}; \text{WbResp} \\
& \quad \text{WbReq} \land |\text{Sharers}| = 1 / \\
& \quad \text{Sharers} = \{\}; \text{WbResp} \\
\text{Un} & \quad \text{WbReq} / \text{Sharers} = \{\}; \text{WbResp}
\end{align*}
\]
MSI Directory Protocol Example

Main Memory

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Core 0

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Core 1

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Core 2
MSI Directory Protocol Example

Core 0

Main Memory

Directory

Tag | State | Sharers
--- | --- | ---

Cache 0

Tag | State | Data
--- | --- | ---

Cache 1

Tag | State | Data
--- | --- | ---

Cache 2

Tag | State | Data
--- | --- | ---

Core 1

Core 2

1. LD 0xA
MSI Directory Protocol Example

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;S</td>
<td></td>
</tr>
</tbody>
</table>

Core 0

1 LD 0xA

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
</table>

Core 1

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
</table>

Core 2
MSI Directory Protocol Example

1. LD 0xA

2. ShReq 0xA
MSI Directory Protocol Example

1. LD 0xA

2. ShReq 0xA
MSI Directory Protocol Example

1. LD 0xA

2. ShReq 0xA

3. Mem[0xA] = 3
MSI Directory Protocol Example

1. LD 0xA
2. ShReq 0xA
3. Mem[0xA] = 3
4. ShResp 0xA, data=3
MSI Directory Protocol Example

1. LD 0xA

2. ShReq 0xA

3. Mem[0xA] = 3

4. ShResp 0xA, data=3
MSI Directory Protocol Example

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0}</td>
</tr>
</tbody>
</table>

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>S</td>
<td>3</td>
</tr>
</tbody>
</table>

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
MSI Directory Protocol Example

Main Memory

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0}</td>
</tr>
</tbody>
</table>

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>S</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 0

Cache 1

Core 1

Cache 2

Core 2

1 LD 0xA
**MSI Directory Protocol Example**

### Main Memory

### Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0}</td>
</tr>
</tbody>
</table>

### Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>S</td>
<td>3</td>
</tr>
</tbody>
</table>

### Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;S</td>
<td></td>
</tr>
</tbody>
</table>

### Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;S</td>
<td></td>
</tr>
</tbody>
</table>

1. LD 0xA
MSI Directory Protocol Example

1. LD 0xA
2. ShReq 0xA
MSI Directory Protocol Example

- **Core 0**: Cache 0 - Tag: 0xA, State: S, Data: 3
- **Core 1**: Cache 1
- **Core 2**: Cache 2 - Tag: 0xA, State: I->S

<table>
<thead>
<tr>
<th>Directory</th>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>0xA</td>
<td>Sh</td>
<td>{0,2}</td>
</tr>
</tbody>
</table>

1. **LD 0xA**
2. **ShReq 0xA**
MSI Directory Protocol Example

1. LD 0xA
2. ShReq 0xA
3. Mem[0xA] = 3
MSI Directory Protocol Example

1. LD 0xA
2. ShReq 0xA
3. Mem[0xA] = 3
4. ShResp 0xA, data=3
MSI Directory Protocol Example

1. LD 0xA
2. ShReq 0xA
3. Mem[0xA] = 3
4. ShResp 0xA, data=3
MSI Directory Protocol Example

Core 0

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>S</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 1

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;M</td>
<td></td>
</tr>
</tbody>
</table>

Core 2

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>S</td>
<td>3</td>
</tr>
</tbody>
</table>

AT 0xA

1
MSI Directory Protocol Example

1. ST 0xA

2. ExReq 0xA

- **Main Memory**
  - **Directory**
    - **Tag**: 0xA
    - **State**: Sh
    - **Sharers**: \{0,2\}

- **Cache 0**
  - **Tag**: 0xA
  - **State**: S
  - **Data**: 3

- **Cache 1**
  - **Tag**: 0xA
  - **State**: I->M
  - **Data**: 

- **Cache 2**
  - **Tag**: 0xA
  - **State**: S
  - **Data**: 3
MSI Directory Protocol Example

Core 0

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0,2}</td>
</tr>
</tbody>
</table>

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>S</td>
<td>3</td>
</tr>
</tbody>
</table>

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;M</td>
<td></td>
</tr>
</tbody>
</table>

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>S</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 1

Core 2

ST 0xA

ExReq 0xA

InvReq 0xA

InvReq 0xA
MSI Directory Protocol Example

Core 0

Main Memory

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0,2}</td>
</tr>
</tbody>
</table>

Core 1

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;M</td>
<td></td>
</tr>
</tbody>
</table>

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 2

ExReq 0xA

InvReq 0xA

InvReq 0xA

ST 0xA
MSI Directory Protocol Example

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0,2}</td>
</tr>
</tbody>
</table>

1. **ST 0xA**
2. **ExReq 0xA**
3. **InvReq 0xA**
4. **InvResp 0xA**

Core 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;M</td>
<td></td>
</tr>
</tbody>
</table>

Core 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>
MSI Directory Protocol Example

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

1. ST 0xA
2. ExReq 0xA
3. InvReq 0xA
4. InvResp 0xA
MSI Directory Protocol Example

1. **ST 0xA**

2. **ExReq 0xA**

3. **InvReq 0xA**

4. **InvResp 0xA**

5. **Mem[0xA] = 3**

- **Directory**
  - Tag | State | Sharers
  - 0xA | Ex    | {1}

- **Main Memory**

- **Cache 0**
  - Tag | State | Data
  - 0xA | I     | 3

- **Cache 1**
  - Tag | State | Data
  - 0xA | I->M  |

- **Cache 2**
  - Tag | State | Data
  - 0xA | I     | 3
MSI Directory Protocol Example

Core 0

Main Memory

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

Core 1

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-M</td>
<td></td>
</tr>
</tbody>
</table>

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 2

Mem[0xA] = 3

ExReq 0xA

InvReq 0xA

ExResp 0xA

data = 3

InvResp 0xA
MSI Directory Protocol Example

Core 0

Main Memory

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

Core 1

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>M</td>
<td>5</td>
</tr>
</tbody>
</table>

Core 2

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 0

1. ST 0xA

Core 1

2. ExReq 0xA

3. InvReq 0xA

Core 2

4. InvResp 0xA

5. Mem[0xA] = 3

6. ExResp 0xA

data = 3
Extra Hops and 3-Hop Protocols
Reducing Protocol Latency

- Problem: Data in another cache needs to pass through the directory, adding latency
- Optimization: Forward data to requester directly

![Diagram showing cache and directory states](image-url)
Extra Hops and 3-Hop Protocols
Reducing Protocol Latency

- Problem: Data in another cache needs to pass through the directory, adding latency
- Optimization: Forward data to requester directly

<table>
<thead>
<tr>
<th>Main Memory</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
</tr>
<tr>
<td>Directory</td>
</tr>
<tr>
<td>Tag</td>
</tr>
<tr>
<td>------</td>
</tr>
<tr>
<td>0xA</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cache 0</th>
<th>Cache 1</th>
<th>Cache 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>Tag</td>
<td>State</td>
<td>Data</td>
</tr>
<tr>
<td>0xA</td>
<td>M</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 0  Core 1  Core 2

1 ST 0xA
Extra Hops and 3-Hop Protocols
Reducing Protocol Latency

- Problem: Data in another cache needs to pass through the directory, adding latency
- Optimization: Forward data to requester directly

**Diagram**

- **Main Memory**
- **Directory**
  - Tag | State | Sharers
  - 0xA | Ex    | {0}

- **Cache 0**
  - Tag | State | Data
  - 0xA | M     | 3

- **Cache 1**
  - Tag | State | Data
  - 0xA | I->M  |  

- **Cache 2**
  - Tag | State | Data
  - 0xA |       |  

- **Core 0**
- **Core 1**
- **Core 2**
- **ST 0xA**
Extra Hops and 3-Hop Protocols
Reducing Protocol Latency

• Problem: Data in another cache needs to pass through the directory, adding latency
• Optimization: Forward data to requester directly

![Diagram showing cache and directory states]

1. ST 0xA
2. ExReq 0xA

<table>
<thead>
<tr>
<th>Main Memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>Directory</td>
</tr>
<tr>
<td>Tag</td>
</tr>
<tr>
<td>0xA</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cache 0</th>
</tr>
</thead>
<tbody>
<tr>
<td>Tag</td>
</tr>
<tr>
<td>0xA</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cache 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>Tag</td>
</tr>
<tr>
<td>0xA</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cache 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>Tag</td>
</tr>
<tr>
<td></td>
</tr>
</tbody>
</table>
Extra Hops and 3-Hop Protocols
Reducing Protocol Latency

- Problem: Data in another cache needs to pass through the directory, adding latency
- Optimization: Forward data to requester directly

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Ex-&gt;Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

Main Memory

```
<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>M</td>
<td>3</td>
</tr>
</tbody>
</table>
```

Directory

```
<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Ex</td>
<td>{0}</td>
</tr>
</tbody>
</table>
```

Core 0

```
<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;M</td>
<td></td>
</tr>
</tbody>
</table>
```

Core 1

```
<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;M</td>
<td></td>
</tr>
</tbody>
</table>
```

Core 2

```
<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;M</td>
<td></td>
</tr>
</tbody>
</table>
```

1. ST 0xA

2. ExReq 0xA
Extra Hops and 3-Hop Protocols
Reducing Protocol Latency

- Problem: Data in another cache needs to pass through the directory, adding latency
- Optimization: Forward data to requester directly

1. **ST 0xA**
2. **ExReq 0xA**
3. **ExFwd 0xA, req=1**

### Diagram

#### Main Memory

#### Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Ex-&gt;Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

#### Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>M</td>
<td>3</td>
</tr>
</tbody>
</table>

#### Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;M</td>
<td></td>
</tr>
</tbody>
</table>

#### Cache 2
Extra Hops and 3-Hop Protocols
Reducing Protocol Latency

- Problem: Data in another cache needs to pass through the directory, adding latency
- Optimization: Forward data to requester directly

1. ST 0xA
2. ExReq 0xA
3. ExFwd 0xA, req=1

Diagram:
- Core 0: Cache 0 (Tag 0xA, State I, Data 3)
- Core 1: Cache 1 (Tag 0xA, State I->M, Data)
- Core 2: Cache 2

Directory:
<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Ex-&gt;Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>
Extra Hops and 3-Hop Protocols
Reducing Protocol Latency

- Problem: Data in another cache needs to pass through the directory, adding latency
- Optimization: Forward data to requester directly

### Diagram

- **Main Memory**
- **Directory**
  - | Tag | State | Sharers |
  - | 0xA | Ex->Ex | {1} |

- **Cache 0**
  - Tag: 0xA, State: I, Data: 3

- **Cache 1**
  - Tag: 0xA, State: I->M, Data: 3

- **Cache 2**
  - No data

- **Core 0**
  - Tag: 0xA

- **Core 1**
  - Tag: 0xA, Data: 3

- **Core 2**
  - No data

1. **ST 0xA, data=3**
2. **ExReq 0xA**
3. **ExFwd 0xA, req=1**
4. **ExResp 0xA, data=3**
Extra Hops and 3-Hop Protocols
Reducing Protocol Latency

- Problem: Data in another cache needs to pass through the directory, adding latency
- Optimization: Forward data to requester directly

ExFwd 0xA, req=1
ExReq 0xA
ExResp 0xA, data=3
ST 0xA
Extra Hops and 3-Hop Protocols
Reducing Protocol Latency

• Problem: Data in another cache needs to pass through the directory, adding latency

• Optimization: Forward data to requester directly

3. ExFwd 0xA, req=1
2. ExReq 0xA
4. ExAck 0xA

3. ExResp 0xA, data=3
1. ST 0xA
Extra Hops and 3-Hop Protocols
Reducing Protocol Latency

- Problem: Data in another cache needs to pass through the directory, adding latency
- Optimization: Forward data to requester directly

```
<table>
<thead>
<tr>
<th></th>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>Directory</td>
<td>0xA</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>
```

1. **ST 0xA**
   - **Core 0**
   - **Core 1**
   - **Core 2**

2. **ExReq 0xA**
   - **Core 1**
   - **Core 2**

3. **ExFwd 0xA, req=1**
   - **Core 0**
   - **Core 1**

4. **ExAck 0xA**
   - **Core 1**
   - **Core 2**

5. **ExResp 0xA, data=3**
   - **Core 0**
   - **Core 1**
   - **Core 2**
# MSI Directory Protocol Example

## Main Memory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

## Directory

### Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

### Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>M-&gt;I</td>
<td>5</td>
</tr>
</tbody>
</table>

### Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

### Core 0

1. ST 0xB

### Core 1

### Core 2
MSI Directory Protocol Example

2. WbReq 0xA, data=5

1. ST 0xDB

Core 0

Cache 0

- Tag: 0xA
- State: I
- Data: 3

Core 1

Cache 1

- Tag: 0xA
- State: M->I
- Data: 5

Core 2

Cache 2

- Tag: 0xA
- State: I
- Data: 3
MSI Directory Protocol Example

Core 0

Cache 0

Core 1

Cache 1

Core 2

Cache 2

1. ST 0xB

2. WbReq 0xA, data=5

3. Mem[0xA] = 5
MSI Directory Protocol Example

1. ST 0xB

2. WbReq 0xA, data=5

3. Mem[0xA] = 5
MSI Directory Protocol Example

![Diagram showing the MSI Directory Protocol example.

1. ST 0xB
2. WbReq 0xA, data=5
3. Mem[0xA] = 5
4. WbResp 0xA]
MSI Directory Protocol Example

1. ST 0xB
2. WbReq 0xA, data=5
3. Mem[0xA] = 5
4. WbResp 0xA

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Un</td>
<td>{}</td>
</tr>
</tbody>
</table>
MSI Directory Protocol Example

1. ST 0xB
2. WbReq 0xA, data=5
3. Mem[0xA] = 5
4. WbResp 0xA
5. ExReq 0xB
MSI Directory Protocol Example

1. **ST 0xB**
   - Core 2 places a new entry in itsDirectory with state Ex and sharers {1}.

2. **WbReq 0xA, data=5**
   - Core 1 sends a WbReq to Core 0 with data 5.

3. **Mem[0xA] = 5**
   - Core 0 updates its directory entry for tag 0xA with state I and data 3.

4. **WbResp 0xA**
   - Core 0 sends a WbResp to Core 1.

5. **ExReq 0xB**
   - Core 2 sends an ExReq to Core 0.
MSI Directory Protocol Example

1. ST 0xB
2. WbReq 0xA, data=5
3. Mem[0xA] = 5
4. WbResp 0xA
5. ExReq 0xB
6. Mem[0xB] = 10
MSI Directory Protocol Example

1. ST 0xB
2. WbReq 0xA, data=5
3. Mem[0xA] = 5
4. WbResp 0xA
5. ExReq 0xB
6. Mem[0xB] = 10
7. ExResp 0xB, data=10

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xB</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

Cache 0
- Tag: 0xA
- State: I
- Data: 3

Cache 1
- Tag: 0xB
- State: I->M

Cache 2
- Tag: 0xA
- State: I
- Data: 3

Main Memory
- Mem[0xA] = 5
- Mem[0xB] = 10
MSI Directory Protocol Example

1. ST 0xB
2. WbReq 0xA, data=5
3. Mem[0xA] = 5
4. WbResp 0xA
5. ExReq 0xB
6. Mem[0xB] = 10
7. ExResp 0xB, data=10

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xB</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cache 0</th>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cache 1</th>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
<td></td>
</tr>
<tr>
<td>0xB</td>
<td>M</td>
<td>10</td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cache 2</th>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
<td></td>
</tr>
</tbody>
</table>
MSI Directory Protocol Example

Why are 0xA’s wb and 0xB’s req serialized?
MSI Directory Protocol Example

Why are 0xA’s wb and 0xB’s req serialized?

Structural dependence

Why are 0xA’s wb and 0xB’s req serialized?
MSI Directory Protocol Example

Why are 0xA’s wb and 0xB’s req serialized? Possible solutions?

Structural dependence
MSI Directory Protocol Example

**Core 0**
- Cache 0
  - Tag: 0xA
  - State: I
  - Data: 3

**Core 1**
- Cache 1
  - Tag: 0xB
  - State: M
  - Data: 10

**Core 2**
- Cache 2
  - Tag: 0xA
  - State: I
  - Data: 3

Main Memory
- Mem[0xA] = 5
- Mem[0xB] = 10

Diagram:
- **WbReq 0xA, data=5**
- **WbResp 0xA**
- **ST 0xB**
- **ExReq 0xB**
- **ExResp 0xB, data=10**

**Why are 0xA’s wb and 0xB’s req serialized?**
Structural dependence

**Possible solutions?**
Buffer outside of cache to hold write data
Miss Status Holding Register

MSHR – Holds load misses and writes outside of cache

MSHR entry

<table>
<thead>
<tr>
<th>V</th>
<th>X</th>
<th>Addr</th>
<th>Data</th>
</tr>
</thead>
</table>

• On eviction/writeback
  – No free MSHR entry: stall
  – Allocate new MSHR entry
  – When channel available send WBReq and data
  – Deallocate entry on WBResp
Miss Status Holding Register

MSHR – Holds load misses and writes outside of cache

<table>
<thead>
<tr>
<th>MSHR entry</th>
<th>per ld/st slots</th>
</tr>
</thead>
<tbody>
<tr>
<td>V</td>
<td>X</td>
</tr>
</tbody>
</table>

- On cache load miss
  - No free MSHR entry: stall
  - Allocate new MSHR entry
  - Send ShReq (or ExReq)
  - On *Resp forward data to CPU and cache
  - Deallocate MSHR
Miss Status Holding Register

MSHR – Holds load misses and writes outside of cache

<table>
<thead>
<tr>
<th>V</th>
<th>X</th>
<th>Addr</th>
<th>Data</th>
<th>L/S</th>
<th>Inum</th>
<th>Block Offset</th>
</tr>
</thead>
</table>

L13-18
Miss Status Holding Register

MSHR – Holds load misses and writes outside of cache

<table>
<thead>
<tr>
<th>MSHR entry</th>
<th>per ld/st slots</th>
</tr>
</thead>
<tbody>
<tr>
<td>V</td>
<td>X</td>
</tr>
<tr>
<td>V</td>
<td>L/S</td>
</tr>
<tr>
<td>V</td>
<td>L/S</td>
</tr>
</tbody>
</table>

Per ld/st slots allow servicing multiple requests with one entry
Miss Status Holding Register

MSHR – Holds load misses and writes outside of cache

- **MSHR entry**
  - V X Addr Data

- **per ld/st slots**
  - V L/S Inum Block Offset
  - V L/S Inum Block Offset

- **On cache load miss**
  - Look for matching address in MSHRs
    - If not found
      - If no free MSHR entry: stall
      - Allocate new MSHR entry and fill in
    - If found, just fill in per ld/st slot
  - Send ShReq (or ExReq)
  - On *Resp forward data to CPU and cache
  - Deallocate MSHR

Per ld/st slots allow servicing multiple requests with one entry
Directory Organization

• Requirement: Directory needs to keep track of all the cores that are sharing a cache block

• Challenge: For each block, the space needed to hold the list of sharers grows with number of possible sharers...
Flat, Memory-based Directories

- Dedicate a few bits of main memory to store the state and sharers of every line
- Encode sharers using a bit-vector

Main Memory

64 bytes 10 bits

<table>
<thead>
<tr>
<th>State</th>
<th>Sharer Set</th>
</tr>
</thead>
<tbody>
<tr>
<td>Sh</td>
<td>01001100</td>
</tr>
</tbody>
</table>
Flat, Memory-based Directories

- Dedicate a few bits of main memory to store the state and sharers of every line
- Encode sharers using a bit-vector

Main Memory

64 bytes 10 bits

State

Sharer Set

✓ Simple
✗ Slow
✗ Very inefficient with many processors (∼P bits/line)
Sparse Full-Map Directories

- Not every line in the system needs to be tracked – only those in private caches!
- Idea: Organize directory as a cache

<table>
<thead>
<tr>
<th>Way 1</th>
<th>Way 2</th>
<th>Way 3</th>
<th>Way 4</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Directory Entry Format

- Line Address: 0xF00
- State: Sh
- Sharer Set: 01001100

October 23, 2023
Sparse Full-Map Directories

- Not every line in the system needs to be tracked – only those in private caches!
- Idea: Organize directory as a cache

✓ Low latency, energy-efficient
× Bit-vectors grow with # cores → Area scales poorly
× Limited associativity → Directory-induced invalidations
Directory-Induced Invalidations

- To retain **inclusion**, must invalidate all sharers of an entry before reusing it for another address
- Example: 2-way set-associative sparse directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0}</td>
<td>0xF</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

Main Memory

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>S</td>
<td>3</td>
</tr>
</tbody>
</table>

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xF</td>
<td>M</td>
<td>1</td>
</tr>
</tbody>
</table>

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Directory-Induced Invalidations

- To retain **inclusion**, must invalidate all sharers of an entry before reusing it for another address
- Example: 2-way set-associative sparse directory
## Directory-Induced Invalidations

- To retain **inclusion**, must invalidate all sharers of an entry before reusing it for another address.
- Example: 2-way set-associative sparse directory

### Main Memory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0}</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xF</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>
Directory-Induced Invalidations

- To retain **inclusion**, must invalidate all sharers of an entry before reusing it for another address
- Example: 2-way set-associative sparse directory

<table>
<thead>
<tr>
<th>Main Memory</th>
<th>Directory</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Tag</strong></td>
<td><strong>State</strong></td>
</tr>
<tr>
<td>0xA</td>
<td>Sh</td>
</tr>
</tbody>
</table>

Core 0

- **Cache 0**
  - Tag: 0xA
  - State: S
  - Data: 3

Core 1

- **Cache 1**
  - Tag: 0xF
  - State: M
  - Data: 1

- **ShReq 0xB**

Core 2

- **Cache 2**
  - Tag: 0xB
  - State: I->S

- **LD 0xB**
To retain **inclusion**, must invalidate all sharers of an entry before reusing it for another address.

Example: 2-way set-associative sparse directory
Directory-Induced Invalidations

- To retain **inclusion**, must invalidate all sharers of an entry before reusing it for another address.
- Example: 2-way set-associative sparse directory.

### Diagram

**Main Memory**

**Directory**

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0}</td>
<td>0xF</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

**Core 0**

- **Cache 0**
  - Tag: 0xA
  - State: I
  - Data: 3

**Core 1**

- **Cache 1**
  - Tag: 0xF
  - State: M
  - Data: 1

**Core 2**

- **Cache 2**
  - Tag: 0xB
  - State: I->S

**Instructions**

1. **LD 0xB**
2. **ShReq 0xB**
3. **InvReq 0xA**
Directory-Induced Invalidations

- To retain **inclusion**, must invalidate all sharers of an entry before reusing it for another address
- Example: 2-way set-associative sparse directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0}</td>
<td>0xF</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

1. **LD 0xB**
2. **ShReq 0xB**
3. **InvReq 0xA**
4. **InvResp 0xA**
Directory-Induced Invalidations

- To retain **inclusion**, must invalidate all sharers of an entry before reusing it for another address.
- Example: 2-way set-associative sparse directory.

### Main Memory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xB</td>
<td>Sh</td>
<td>{2}</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xF</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

### Core 0

- **Cache 0**
  - Tag: 0xA
  - State: I
  - Data: 3

### Core 1

- **Cache 1**
  - Tag: 0xF
  - State: M
  - Data: 1

### Core 2

- **Cache 2**
  - Tag: 0xB
  - State: I->S

---

1. **LD 0xB**
2. **ShReq 0xB**
3. **InvReq 0xA**
4. **InvResp 0xA**
Directory-Induced Invalidations

- To retain **inclusion**, must invalidate all sharers of an entry before reusing it for another address
- Example: 2-way set-associative sparse directory

<table>
<thead>
<tr>
<th>Directory</th>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>0xB</td>
<td>Sh</td>
<td>{2}</td>
<td>0xF</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Core 0</th>
<th>Cache 0</th>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Core 1</th>
<th>Cache 1</th>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>0xF</td>
<td>M</td>
<td>1</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Core 2</th>
<th>Cache 2</th>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>0xB</td>
<td>I-&gt;S</td>
<td></td>
</tr>
</tbody>
</table>

1. **LD 0x0B**
2. **ShReq 0x0B**
3. **InvReq 0x0A**
4. **InvResp 0x0A**
5. **Mem[0xB] = 5**

October 23, 2023
Directory-Induced Invalidations

- To retain **inclusion**, must invalidate all sharers of an entry before reusing it for another address
- Example: 2-way set-associative sparse directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xB</td>
<td>Sh</td>
<td>{2}</td>
</tr>
<tr>
<td>0xF</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

```
Core 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xF</td>
<td>M</td>
<td>1</td>
</tr>
</tbody>
</table>

Core 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xB</td>
<td>I-&gt;S</td>
<td></td>
</tr>
</tbody>
</table>
```

```
Mem[0xB] = 5
```

```
InvReq 0xA
```

```
InvResp 0xA
```

```
ShReq 0xB
```

```
ShResp 0xB, data=5
```

```
LD 0xB
```
Directory-Induced Invalidations

- To retain **inclusion**, must invalidate all sharers of an entry before reusing it for another address
- Example: 2-way set-associative sparse directory

```
<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xB</td>
<td>Sh</td>
<td>{2}</td>
</tr>
<tr>
<td>0xF</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>
```

```
Core 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xF</td>
<td>M</td>
<td>1</td>
</tr>
</tbody>
</table>

Core 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xB</td>
<td>S</td>
<td>5</td>
</tr>
</tbody>
</table>

```

1. LD 0xB
2. ShReq 0xB
3. InvReq 0xA
4. InvResp 0xA
5. Mem[0xB] = 5
6. ShResp 0xB, data=5
Directory-Induced Invalidations

- To retain **inclusion**, must invalidate all sharers of an entry before reusing it for another address.
- Example: 2-way set-associative sparse directory.

How many entries should the directory have?
Inexact Representations of Sharer Sets

- Coarse-grain bit-vectors (e.g., 1 bit per 4 cores)

  Sharer Set

  \[
  \begin{array}{ccccccc}
  & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
  0-3 & 4-7 & 8-11 & 12-15 & 16-19 & 20-23 \\
  \end{array}
  \]

- Limited pointers: Maintain a few sharer pointers, on overflow mark ‘all’ and broadcast (or invalidate another sharer)

  Sharer Set

  \[
  \begin{array}{cccc}
  & 0 & 8 & 14 & 33 \\
  all & sharer 1 & sharer 2 & sharer 3 \\
  \end{array}
  \]

- Allow false positives (e.g., Bloom filters)
Inexact Representations of Sharer Sets

- Coarse-grain bit-vectors (e.g., 1 bit per 4 cores)

```
Sharer Set
0 0 0 0 0 0 0
0-3 4-7 8-11 12-15 16-19 20-23
```

- Limited pointers: Maintain a few sharer pointers, on overflow mark ‘all’ and broadcast (or invalidate another sharer)

```
Sharer Set
0 8 14 33
all sharer 1 sharer 2 sharer 3
```

- Allow false positives (e.g., Bloom filters)

- Reduced area & energy
- Overheads still not scalable (these techniques simply play with constant factors)
- Inexact sharers → Broadcasts, invalidations or spurious invalidations and downgrades
In-Cache Directories

• Common multicore memory hierarchy:
  – 1+ levels of private caches
  – A shared last-level cache
  – Need to enforce coherence among private caches

• Idea: Embed the directory information in shared cache tags
  – Shared cache must be inclusive
In-Cache Directories

- **Common multicore memory hierarchy:**
  - 1+ levels of private caches
  - A shared last-level cache
  - Need to enforce coherence among private caches

- **Idea:** Embed the directory information in shared cache tags
  - Shared cache must be inclusive

- ✔ Avoids tag overheads & separate lookups
- ✗ Can be inefficient if shared cache size $>>$ sum(private cache sizes)

Diagram:
- Main Memory
- Shared cache
  - Private cache
  - Private cache
  - Core 0
  - Core 1
  - Core N
Coherence in Multi-Level Hierarchies

- Can use the same or different protocols to keep coherence across multiple levels
- Key invariant: Ensure sufficient permissions in all intermediate levels
- Example: 8-socket Xeon E7 (8 cores/socket)
Protocol Races

- Directory serializes multiple requests for the same address
  - Same-address requests are queued or NACKed and retried
- But races still exist due to conflicting requests
- Example: Upgrade race

```
<table>
<thead>
<tr>
<th>Dir Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0,1}</td>
</tr>
</tbody>
</table>
```

```
<table>
<thead>
<tr>
<th>Core 0</th>
<th>Core 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>Cache 0</td>
<td>Cache 1</td>
</tr>
<tr>
<td>Tag</td>
<td>State</td>
</tr>
<tr>
<td>---------</td>
<td>---------</td>
</tr>
<tr>
<td>0xA</td>
<td>S</td>
</tr>
<tr>
<td>0xA</td>
<td>S</td>
</tr>
</tbody>
</table>
```
Protocol Races

- Directory serializes multiple requests for the same address
  - Same-address requests are queued or NACKed and retried
- But races still exist due to conflicting requests
- Example: Upgrade race

<table>
<thead>
<tr>
<th>ReqQ</th>
<th>Directory</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Tag</td>
</tr>
<tr>
<td></td>
<td>0xA</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Main Memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>Core 0</td>
</tr>
<tr>
<td>Cache 0</td>
</tr>
<tr>
<td>Tag</td>
</tr>
<tr>
<td>------</td>
</tr>
<tr>
<td>0xA</td>
</tr>
<tr>
<td>Core 1</td>
</tr>
<tr>
<td>Cache 1</td>
</tr>
<tr>
<td>Tag</td>
</tr>
<tr>
<td>------</td>
</tr>
<tr>
<td>0xA</td>
</tr>
</tbody>
</table>

<p>| Core 0      |
| Cache 0     |</p>
<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>S</td>
<td>3</td>
</tr>
</tbody>
</table>
Protocol Races

- Directory serializes multiple requests for the same address
  - Same-address requests are queued or NACKed and retried
- But races still exist due to conflicting requests
- Example: Upgrade race

---

**Main Memory**

<table>
<thead>
<tr>
<th>ReqQ</th>
<th>Directory</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0,1}</td>
</tr>
</tbody>
</table>

**Cache 0**

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>S-&gt;M</td>
<td>3</td>
</tr>
</tbody>
</table>

**Cache 1**

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>S-&gt;M</td>
<td>3</td>
</tr>
</tbody>
</table>

**Core 0**

1. ST 0xA

**Core 1**

1'. ST 0xA
Protocol Races

- Directory serializes multiple requests for the same address
  - Same-address requests are queued or NACKed and retried
- But races still exist due to conflicting requests
- Example: Upgrade race

Caches 0 and 1 issue simultaneous ExReqs

October 23, 2023
MIT 6.5900 Fall 2023
Protocol Races

- Directory serializes multiple requests for the same address
  - Same-address requests are queued or NACKed and retried
- But races still exist due to conflicting requests
- Example: Upgrade race

---

Caches 0 and 1 issue simultaneous ExReqs
Directory starts serving cache 0’s ExReq, queues cache 1’s
Protocol Races

- Directory serializes multiple requests for the same address
  - Same-address requests are queued or NACKed and retried
- But races still exist due to conflicting requests
- Example: Upgrade race

Caches 0 and 1 issue simultaneous ExReqs
Directory starts serving cache 0’s ExReq, queues cache 1’s
Cache 1 expected ExResp, but got InvReq!
Protocol Races

- Directory serializes multiple requests for the same address
  - Same-address requests are queued or NACKed and retried
- But races still exist due to conflicting requests
- Example: Upgrade race

Caches 0 and 1 issue simultaneous ExReqs
Directory starts serving cache 0’s ExReq, queues cache 1’s

Cache 1 expected ExResp, but got InvReq!
Cache 1 should transition from S->M to I->M and send InvResp
Coherence and Synchronization

**Processor 1**

R ← 1
L: swap (R, mutex);
   if <R> then goto L;
   <critical section>
   M[mutex] ← 0;

**Processor 2**

R ← 1
L: swap (R, mutex);
   if <R> then goto L;
   <critical section>
   M[mutex] ← 0;

**Processor 3**

R ← 1
L: swap (R, mutex);
   if <R> then goto L;
   <critical section>
   M[mutex] ← 0;

CPU-Memory Bus

swap (R, mutex):
R = test&set(mutex)
Coherence and Synchronization

**swap (R, mutex):**
\[ R = \text{test\&set}(mutex) \]

**test\&set(mutex):**
\[
\text{old\_val} = M[\text{mutex}] ; \\
M[\text{mutex}] = 1 ; \\
\text{return old\_val} ;
\]
Our cache coherence protocol will introduce a performance issue here. What is the problem?
Cache coherence protocols will cause \textit{mutex} to \textit{ping-pong} between P1’s and P2’s caches.
Cache coherence protocols will cause \texttt{mutex} to \textit{ping-pong} between P1’s and P2’s caches.

Ping-ponging can be reduced by first reading the \texttt{mutex} location \textit{(non-atomically)} and executing a swap only if it is found to be zero (test&test&set).
Implementing Atomic Instructions

• In general, an atomic read-modify-write instruction requires two memory (bus) operations without intervening memory operations by other processors.

• Implementation options:
  • With snoopy coherence, lock the bus → expensive
  • With directory-based coherence, lock the line in the cache (prevent invalidations or evictions until atomic op finishes) → complex

```c
test&set(mutex):
    old_val = M[mutex];
    M[mutex] = 1;
    return old_val;
```
Implementing Atomic Instructions

• In general, an **atomic read-modify-write** instruction requires two memory (bus) operations without intervening memory operations by other processors.

• Implementation options:
  • *With snoopy coherence, lock the bus* → expensive
  • *With directory-based coherence, lock the line in the cache (prevent invalidations or evictions until atomic op finishes)* → complex

• modern processors use
  - *load-reserve*
  - *store-conditional*

test&set(mutex):
  old_val = M[mutex];
  M[mutex] = 1;
  return old_val;
Load-reserve & Store-conditional

Special register(s) to hold reservation flag and address, and the outcome of store-conditional

Load-reserve R, (a):
<flag, adr> ← <1, a>;
R ← M[a];

Store-conditional (a), R:
if <flag, adr> == <1, a>
then cancel other procs’ reservation on a;
M[a] ← <R>;
status ← succeed;
else status ← fail;

If the cache receives an invalidation to the address in the reserve register, the reserve bit is set to 0

- Several processors may reserve ‘a’ simultaneously
- These instructions are like ordinary loads and stores with respect to the bus traffic
Load-Reserve/Store-Conditional

Swap implemented with Ld-Reserve/St-Conditional

# Swap(R1, mutex):

L: Ld-Reserve R2, (mutex)
   St-Conditional (mutex), R1
   if (status == fail) goto L
   R1 <- R2
The total number of coherence transactions is not necessarily reduced, but splitting an atomic instruction into load-reserve & store-conditional:

- *increases utilization* (and reduces processor stall time), especially in split-transaction buses and directories

- *reduces cache ping-pong effect* because processors trying to acquire a semaphore do not have to perform stores each time
Thank you!

Next Lecture: Consistency and Relaxed Memory Models