Directory-Based Cache Coherence

Daniel Sanchez
Computer Science and Artificial Intelligence Lab
M.I.T.
Maintaining Cache Coherence

It is sufficient to have hardware such that
• only one processor at a time has write permission for a location
• no processor can load a stale copy of the location after a write
Maintaining Cache Coherence

It is sufficient to have hardware such that
• only one processor at a time has write permission for a location
• no processor can load a stale copy of the location after a write

⇒ A correct approach could be:
Maintaining Cache Coherence

It is sufficient to have hardware such that
• only one processor at a time has write permission for a location
• no processor can load a stale copy of the location after a write

⇒ A correct approach could be:

write request:
The address is invalidated in all other caches before the write is performed
Maintaining Cache Coherence

It is sufficient to have hardware such that
• only one processor at a time has write permission for a location
• no processor can load a stale copy of the location after a write

⇒ A correct approach could be:

write request:
   The address is invalidated in all other caches before the write is performed

read request:
   If a dirty copy is found in some cache, a write-back is performed before the memory is read
Directory-Based Coherence
(Censier and Feautrier, 1978)

Snoopy Protocols

- Snoopy schemes broadcast requests over memory bus
- Difficult to scale to large numbers of processors
- Requires additional bandwidth to cache tags for snoop requests
Directory-Based Coherence  
(Censier and Feautrier, 1978)

- Snoopy schemes broadcast requests over memory bus
- Difficult to scale to large numbers of processors
- Requires additional bandwidth to cache tags for snoop requests
Directory-Based Coherence
(Censier and Feautrier, 1978)

Snoopy Protocols

- Snoopy schemes broadcast requests over memory bus
- Difficult to scale to large numbers of processors
- Requires additional bandwidth to cache tags for snoop requests

Directory Protocols

- Directory schemes send messages to only those caches that might have the line
- Can scale to large numbers of processors
- Requires extra directory storage to track possible sharers

April 8, 2021
An MSI Directory Protocol

- Cache states: Modified (M) / Shared (S) / Invalid (I)
An MSI Directory Protocol

- Cache states: Modified (M) / Shared (S) / Invalid (I)
- Directory states:
  - Uncached (Un): No sharers
  - Shared (Sh): One or more sharers with read permission (S)
  - Exclusive (Ex): A single sharer with read & write permissions (M)
An MSI Directory Protocol

- Cache states: Modified (M) / Shared (S) / Invalid (I)
- Directory states:
  - Uncached (Un): No sharers
  - Shared (Sh): One or more sharers with read permission (S)
  - Exclusive (Ex): A single sharer with read & write permissions (M)
- Transient states not drawn for clarity; for now, assume no racing requests
MSI Protocol: Caches (1/3)

Transitions initiated by processor accesses:

- M
- S
- I

<table>
<thead>
<tr>
<th>Actions</th>
</tr>
</thead>
<tbody>
<tr>
<td>Processor Read (PrRd)</td>
</tr>
<tr>
<td>Processor Write (PrWr)</td>
</tr>
<tr>
<td>Shared Request (ShReq)</td>
</tr>
<tr>
<td>Exclusive Request (ExReq)</td>
</tr>
</tbody>
</table>
MSI Protocol: Caches (1/3)

Transitions initiated by processor accesses:

```
M

S

PrRd / ShReq

I

<table>
<thead>
<tr>
<th>Actions</th>
</tr>
</thead>
<tbody>
<tr>
<td>Processor Read (PrRd)</td>
</tr>
<tr>
<td>Processor Write (PrWr)</td>
</tr>
<tr>
<td>Shared Request (ShReq)</td>
</tr>
<tr>
<td>Exclusive Request (ExReq)</td>
</tr>
</tbody>
</table>
```
Transitions initiated by processor accesses:

- Processor Read (PrRd) / Shared Request (ShReq)
- Processor Write (PrWr) / Exclusive Request (ExReq)

### Actions

<table>
<thead>
<tr>
<th>Actions</th>
</tr>
</thead>
<tbody>
<tr>
<td>Processor Read (PrRd)</td>
</tr>
<tr>
<td>Processor Write (PrWr)</td>
</tr>
<tr>
<td>Shared Request (ShReq)</td>
</tr>
<tr>
<td>Exclusive Request (ExReq)</td>
</tr>
</tbody>
</table>
MSI Protocol: Caches (1/3)

Transitions initiated by processor accesses:

- **PrWr / ExReq**
- **PrRd / ShReq**
- **PrRd / --**

<table>
<thead>
<tr>
<th>Actions</th>
</tr>
</thead>
<tbody>
<tr>
<td>Processor Read (PrRd)</td>
</tr>
<tr>
<td>Processor Write (PrWr)</td>
</tr>
<tr>
<td>Shared Request (ShReq)</td>
</tr>
<tr>
<td>Exclusive Request (ExReq)</td>
</tr>
</tbody>
</table>
Transitions initiated by processor accesses:

- Processor Read (PrRd)
- Processor Write (PrWr)
- Shared Request (ShReq)
- Exclusive Request (ExReq)
Transitions initiated by processor accesses:

- PrRd / --
- PrWr / --
- PrWr / ExReq
- PrRd / --
- PrRd / ShReq

Actions

<table>
<thead>
<tr>
<th>Actions</th>
</tr>
</thead>
<tbody>
<tr>
<td>Processor Read (PrRd)</td>
</tr>
<tr>
<td>Processor Write (PrWr)</td>
</tr>
<tr>
<td>Shared Request (ShReq)</td>
</tr>
<tr>
<td>Exclusive Request (ExReq)</td>
</tr>
</tbody>
</table>
MSI Protocol: Caches (2/3)

Transitions initiated by directory requests:

- M (Invalidation Request (InvReq))
- S (Downgrade Request (DownReq))
- I (Invalidation Response (InvResp))
- Downgrade Response (DownResp)
MSI Protocol: Caches (2/3)

Transitions initiated by directory requests:

- **InvReq / InvResp** (with data)

**Actions**

- Invalidation Request (InvReq)
- Downgrade Request (DownReq)
- Invalidation Response (InvResp)
- Downgrade Response (DownResp)
Transitions initiated by directory requests:

- InvReq / InvResp (with data)
- DownReq / DownResp (with data)

### Actions

<table>
<thead>
<tr>
<th>Actions</th>
</tr>
</thead>
<tbody>
<tr>
<td>Invalidation Request (InvReq)</td>
</tr>
<tr>
<td>Downgrade Request (DownReq)</td>
</tr>
<tr>
<td>Invalidation Response (InvResp)</td>
</tr>
<tr>
<td>Downgrade Response (DownResp)</td>
</tr>
</tbody>
</table>
MSI Protocol: Caches (2/3)

Transitions initiated by directory requests:

M (M) → S
InvReq / InvResp (with data)

S (I) → I
InvReq / InvResp (without data)

DownReq / DownResp (with data)

Actions

<table>
<thead>
<tr>
<th>Actions</th>
</tr>
</thead>
<tbody>
<tr>
<td>Invalidation Request (InvReq)</td>
</tr>
<tr>
<td>Downgrade Request (DownReq)</td>
</tr>
<tr>
<td>Invalidation Response (InvResp)</td>
</tr>
<tr>
<td>Downgrade Response (DownResp)</td>
</tr>
</tbody>
</table>
MSI Protocol: Caches (3/3)

Transitions initiated by evictions:

M

S

I

Actions

Writeback Request (WbReq)
MSI Protocol: Caches (3/3)

Transitions initiated by evictions:

- **M** (Main Memory) → Eviction / WbReq (with data) → **S** (Secondary Cache) → **I** (Instruction Cache)

Actions:

- Writeback Request (WbReq)
MSI Protocol: Caches (3/3)

Transitions initiated by evictions:

- From M to S: Eviction / WbReq (with data)
- From S to I: Eviction / WbReq (without data)

Actions:
- Writeback Request (WbReq)
MSI Protocol: Caches

- Transitions initiated by processor accesses
- Transitions initiated by directory requests
- Transitions initiated by evictions
MSI Protocol: Directory (1/2)

Transitions initiated by data requests:

- Ex
- Sh
- Un
Transitions initiated by data requests:

- **Ex**

- **Sh**
  - ShReq / Sharers = Sharers + \{P\}; ShResp

- **Un**
  - ShReq / Sharers = \{P\}; ShResp
Transitions initiated by data requests:

**ExReq** / Sharers = \{P\}; **ExResp**

**Ex**

**Sh**

ShReq / Sharers = Sharers + \{P\}; **ShResp**

ShReq / Sharers = \{P\}; **ShResp**

**Un**
Transitions initiated by data requests:

- **ExReq / Sharers = \{P\}; ExResp**

  - **Ex**

- **ExReq / Inv(Sharers – \{P\}); Sharers = \{P\}; ExResp**

  - **Sh**

- **ShReq / Sharers = Sharers + \{P\}; ShResp**

  - **Un**

- **ShReq / Sharers = \{P\}; ShResp**
Transitions initiated by data requests:

- **ExReq** / Sharers = \{P\}; **ExResp**
- **Ex** → **ShReq** / **Down**(Sharer); Sharers = Sharer + \{P\}; **ShResp**
- **ExReq** / **Inv**(Sharers - \{P\}); Sharers = \{P\}; **ExResp**
- **Sh** → **ShReq** / Sharers = Sharers + \{P\}; **ShResp**
- **ShReq** / Sharers = \{P\}; **ShResp**
- **Un**
Transitions initiated by writeback requests:

Ex

Sh

Un
Transitions initiated by writeback requests:

\[
\text{Ex} \rightarrow \text{Sh} \rightarrow \text{Un} \rightarrow \text{Ex}
\]

\[
\text{WbReq / Sharers} = 
\]

WbResp
MSI Protocol: Directory (2/2)

Transitions initiated by writeback requests:

- **Ex**
  - WbReq / Sharers = {}; WbResp

- **Sh**
  - WbReq && |Sharers| > 1 /
    - Sharers = Sharers - {P}; WbResp

- **Un**

Transitions initiated by writeback requests:

- Ex
  - WbReq / Sharers = {}; WbResp

- Sh
  - WbReq && |Sharers| > 1 / Sharers = Sharers - {P}; WbResp
  - WbReq && |Sharers| == 1 / Sharers = {}; WbResp

- Un
MSI Directory Protocol Example

Main Memory

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
</table>

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
</table>

Core 0

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
</table>

Core 1

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
</table>

Core 2

L14-11
MSI Directory Protocol Example

1. LD 0xA
MSI Directory Protocol Example

1. LD 0xA
MSI Directory Protocol Example

1. LD 0xA

2. ShReq 0xA
MSI Directory Protocol Example

1. LD 0xA

2. ShReq 0xA
MSI Directory Protocol Example

1. **LD 0xA**

2. **ShReq 0xA**

3. **Mem[0xA] = 3**

Main Memory

### Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0}</td>
</tr>
</tbody>
</table>

### Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;S</td>
</tr>
</tbody>
</table>

### Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
</table>

### Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
</table>
MSI Directory Protocol Example

1. LD 0xA
2. ShReq 0xA
3. Mem[0xA] = 3
4. ShResp 0xA, data=3

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0}</td>
</tr>
</tbody>
</table>

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;S</td>
</tr>
</tbody>
</table>

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
</table>

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
</table>
MSI Directory Protocol Example

1. LD 0xA
2. ShReq 0xA
3. Mem[0xA] = 3
4. ShResp 0xA, data=3
MSI Directory Protocol Example

Main Memory

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0}</td>
</tr>
</tbody>
</table>

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>S</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 0

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Core 1

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Core 2
MSI Directory Protocol Example

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0}</td>
</tr>
</tbody>
</table>

Main Memory

Directory

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>S</td>
<td>3</td>
</tr>
</tbody>
</table>

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
</table>

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
</table>

Core 0

Core 1

Core 2

1 LD 0xA
MSI Directory Protocol Example

<table>
<thead>
<tr>
<th></th>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>Main Memory</td>
<td>0xA</td>
<td>Sh</td>
<td>{0}</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cache 0</th>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>0xA</td>
<td>S</td>
<td>3</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cache 1</th>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>0xA</td>
<td>I-&gt;S</td>
<td></td>
</tr>
</tbody>
</table>

Core 0

Core 1

Core 2

1. LD 0xA
MSI Directory Protocol Example

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0}</td>
</tr>
</tbody>
</table>

Main Memory

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>S</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 0

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;S</td>
<td></td>
</tr>
</tbody>
</table>

Core 1

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Core 2

1. LD 0xA
2. ShReq 0xA
MSI Directory Protocol Example

1. LD 0xA
2. ShReq 0xA

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0,2}</td>
</tr>
</tbody>
</table>

Main Memory

Core 0

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>S</td>
<td>3</td>
</tr>
</tbody>
</table>

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;S</td>
<td></td>
</tr>
</tbody>
</table>

Cache 2
MSI Directory Protocol Example

1. LD 0xA

2. ShReq 0xA

3. Mem[0xA] = 3
MSI Directory Protocol Example

1. LD 0xA
2. ShReq 0xA
3. Mem[0xA] = 3
4. ShResp 0xA, data=3
### MSI Directory Protocol Example

#### Step 1: LD 0xA
- **Cache 0**
  - Tag: 0xA, State: S, Data: 3
- **Cache 1**
  - Tag: 0xA, State: S, Data: 3
- **Cache 2**
  - Tag: 0xA, State: S, Data: 3

#### Step 2: ShReq 0xA
- **Directory**
  - Tag: 0xA, State: Sh, Sharers: {0, 2}

#### Step 3: Mem[0xA] = 3
- **Main Memory**
  - Mem[0xA] = 3

#### Step 4: ShResp 0xA, data=3
- **Cache 0**
  - Tag: 0xA, State: S, Data: 3
- **Cache 1**
  - Tag: 0xA, State: S, Data: 3
- **Cache 2**
  - Tag: 0xA, State: S, Data: 3
MSI Directory Protocol Example

Main Memory

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0,2}</td>
</tr>
</tbody>
</table>

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>S</td>
<td>3</td>
</tr>
</tbody>
</table>

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;M</td>
<td></td>
</tr>
</tbody>
</table>

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>S</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 0

Core 1

Core 2

1 ST 0xA
MSI Directory Protocol Example

1. **ST 0x10**

2. **ExReq 0x10**

### Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0x10</td>
<td>Sh</td>
<td>{0,2}</td>
</tr>
</tbody>
</table>

### Memory

- **Cache 0**
  - Tag: 0x10
  - State: S
  - Data: 3

- **Cache 1**
  - Tag: 0x10
  - State: I->M
  - Data: 2

- **Cache 2**
  - Tag: 0x10
  - State: S
  - Data: 3
**MSI Directory Protocol Example**

**Main Memory**

**Directory**

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0,2}</td>
</tr>
</tbody>
</table>

Core 0

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>S</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 1

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;M</td>
<td></td>
</tr>
</tbody>
</table>

Core 2

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>S</td>
<td>3</td>
</tr>
</tbody>
</table>

**Step 1:** ST 0xA

**Step 2:** ExReq 0xA

**Step 3:** InvReq 0xA

**Step 3:** InvReq 0xA

**Step 3:** InvReq 0xA
MSI Directory Protocol Example

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0,2}</td>
</tr>
</tbody>
</table>

Main Memory

Core 0

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 1

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;M</td>
<td></td>
</tr>
</tbody>
</table>

Core 2

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

1. ST 0xA
2. ExReq 0xA
3. InvReq 0xA
MSI Directory Protocol Example

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0,2}</td>
</tr>
</tbody>
</table>

Core 0

Main Memory

Directory

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;M</td>
<td></td>
</tr>
</tbody>
</table>

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 1

Core 2

ST 0xA

ExReq 0xA

InvReq 0xA

InvResp 0xA

InvResp 0xA
MSI Directory Protocol Example

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

1. **ST 0xA**
2. **ExReq 0xA**
3. **InvReq 0xA**
4. **InvResp 0xA**

Core 0
- Cache 0
  - Tag: 0xA
  - State: I
  - Data: 3

Core 1
- Cache 1
  - Tag: 0xA
  - State: I->M
  - Data:

Core 2
- Cache 2
  - Tag: 0xA
  - State: I
  - Data: 3

InvReq 0xA
InvResp 0xA
MSI Directory Protocol Example

1. ST 0xA
2. ExReq 0xA
3. InvReq 0xA
4. InvResp 0xA
5. Mem[0xA] = 3

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

Cache 0
- Tag: 0xA
- State: I
- Data: 3

Cache 1
- Tag: 0xA
- State: I->M

Cache 2
- Tag: 0xA
- State: I
- Data: 3

Main Memory

Core 0
Core 1
Core 2
MSI Directory Protocol Example

1. ST 0xA

2. ExReq 0xA

3. InvReq 0xA

4. InvResp 0xA

5. Mem[0xA] = 3

6. ExResp 0xA data = 3

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

Core 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;M</td>
<td></td>
</tr>
</tbody>
</table>

Core 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>
MSI Directory Protocol Example

1. ST 0xA
2. ExReq 0xA
3. InvReq 0xA
4. InvResp 0xA
5. Mem[0xA] = 3
6. ExResp 0xA

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>M</td>
<td>5</td>
</tr>
</tbody>
</table>

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>
MSI Directory Protocol Example

Main Memory

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

Core 0

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 1

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>M-&gt;I</td>
<td>5</td>
</tr>
</tbody>
</table>

Core 2

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

1  ST 0xB
**MSI Directory Protocol Example**

1. **ST 0xB**
2. **WbReq 0xA, data=5**
MSI Directory Protocol Example

1. **ST 0xB**

2. **WbReq 0xA, data=5**

3. **Mem[0xA] = 5**
MSI Directory Protocol Example

1. ST 0xB
2. WbReq 0xA, data=5
3. Mem[0xA] = 5
MSI Directory Protocol Example

1. ST 0xB
2. WbReq 0xA, data=5
3. Mem[0xA] = 5
4. WbResp 0xA
MSI Directory Protocol Example

1. ST 0xB

2. WbReq 0xA, data=5

3. Mem[0xA] = 5

4. WbResp 0xA

<table>
<thead>
<tr>
<th>Core 0</th>
<th>Cache 0</th>
<th>Cache 1</th>
<th>Cache 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>Tag</td>
<td>State</td>
<td>State</td>
<td>State</td>
</tr>
<tr>
<td>0xA</td>
<td>I</td>
<td>I-\rightarrow\text{M}</td>
<td>I</td>
</tr>
<tr>
<td>Data</td>
<td>3</td>
<td>3</td>
<td>3</td>
</tr>
</tbody>
</table>

Directory:

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Un</td>
<td>{}</td>
</tr>
</tbody>
</table>
MSI Directory Protocol Example

1. ST 0xB

2. WbReq 0xA, data=5

3. Mem[0xA] = 5

4. WbResp 0xA

5. ExReq 0xB
MSI Directory Protocol Example

1. ST 0xB
2. WbReq 0xA, data=5
3. Mem[0xA] = 5
4. WbResp 0xA
5. ExReq 0xB
MSI Directory Protocol Example

1. ST 0xB

2. WbReq 0xA, data=5
   - Mem[0xA] = 5

3. Mem[0xA] = 5

4. WbResp 0xA

5. ExReq 0xB
   - Mem[0xB] = 10

6. Mem[0xB] = 10

Directory:

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xB</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

Cache 0:

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

Cache 1:

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;M</td>
<td>3</td>
</tr>
</tbody>
</table>

Cache 2:

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>
MSI Directory Protocol Example

1. ST 0xB
2. WbReq 0xA, data=5
3. Mem[0xA] = 5
4. WbResp 0xA
5. ExReq 0xB
6. Mem[0xB] = 10
7. ExResp 0xB, data=10

Cache 0
- Tag: 0xA
- State: I
- Data: 3

Cache 1
- Tag: 0xB
- State: I->M
- Data: 5

Cache 2
- Tag: 0xA
- State: I
- Data: 3

Core 0
Core 1
Core 2

Main Memory

Directory

- Tag: 0xB
- State: Ex
- Sharers: {1}
MSI Directory Protocol Example

1. ST 0xB

2. WbReq 0xA, data=5

3. Mem[0xA] = 5

4. WbResp 0xA

5. ExReq 0xB

6. Mem[0xB] = 10

7. ExResp 0xB, data=10
Why are 0xA’s wb and 0xB’s req serialized?
Why are 0xA’s wb and 0xB’s req serialized?

Structural dependence
MSI Directory Protocol Example

Why are 0xA’s wb and 0xB’s req serialized? Why?

Possible solutions?

Structural dependence
MSI Directory Protocol Example

Why are 0xA’s wb and 0xB’s req serialized?

Possible solutions?

Buffer outside of cache to hold write data
Miss Status Handling Register

MSHR – Holds load misses and writes outside of cache

MSHR entry

| V | X | Addr | Data |

• On eviction/writeback
  – No free MSHR entry: stall
  – Allocate new MSHR entry
  – When channel available send WBReq and data
  – Deallocate entry on WBResp
Miss Status Handling Register

MSHR – Holds load misses and writes outside of cache

**MSHR entry**

<table>
<thead>
<tr>
<th>V</th>
<th>X</th>
<th>Addr</th>
<th>Data</th>
</tr>
</thead>
</table>

**per ld/st slots**

<table>
<thead>
<tr>
<th>L/S</th>
<th>Inum</th>
<th>Block Offset</th>
</tr>
</thead>
</table>

- **On cache load miss**
  - No free MSHR entry: stall
  - Allocate new MSHR entry
  - Send ShReq (or ExReq)
  - On *Resp forward data to CPU and cache
  - Deallocate MSHR
Miss Status Handling Register

MSHR – Holds load misses and writes outside of cache

<table>
<thead>
<tr>
<th>MSHR entry</th>
<th>per ld/st slots</th>
</tr>
</thead>
<tbody>
<tr>
<td>V X Addr</td>
<td>L/S Inum Block Offset</td>
</tr>
<tr>
<td>Data</td>
<td></td>
</tr>
</tbody>
</table>

Legend:
- **V**: Valid bit
- **X**: Flush valid bit
- **Addr**: Address
- **Data**: Data
- **L/S**: Load/Store indication
- **Inum**: Instruction number
- **Block Offset**: Block offset
Miss Status Handling Register

MSHR – Holds load misses and writes outside of cache

MSHR entry

V  X  Addr  Data

per ld/st slots

V  L/S  Inum  Block Offset
V  L/S  Inum  Block Offset
V  L/S  Inum  Block Offset
V  L/S  Inum  Block Offset

Per ld/st slots allow servicing multiple requests with one entry
Miss Status Handling Register

MSHR – Holds load misses and writes outside of cache

- On cache load miss
  - Look for matching address is MSHR
    - If not found
      - If no free MSHR entry: stall
      - Allocate new MSHR entry and fill in
    - If found, just fill in per ld/st slot
      - Send ShReq (or ExReq)
      - On *Resp forward data to CPU and cache
      - Deallocate MSHR

Per ld/st slots allow servicing multiple requests with one entry
Directory Organization

- Requirement: Directory needs to keep track of all the cores that are sharing a cache block

- Challenge: For each block, the space needed to hold the list of sharers grows with number of possible sharers...
Flat, Memory-based Directories

- Dedicate a few bits of main memory to store the state and sharers of every line
- Encode sharers using a bit-vector

![Diagram of Main Memory with State and Sharer Set](image)
Flat, Memory-based Directories

- Dedicate a few bits of main memory to store the state and sharers of every line
- Encode sharers using a bit-vector

✓ Simple
✗ Slow
✗ Very inefficient with many processors (~P bits/line)
Sparse Full-Map Directories

- Not every line in the system needs to be tracked – only those in private caches!
- Idea: Organize directory as a cache

```
Way 1  Way 2  Way 3  Way 4
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
```

Directory Entry Format

```
Line Address  State  Sharer Set
0xF00  Sh  0 1 0 0 1 1 0 0
```
Sparse Full-Map Directories

- Not every line in the system needs to be tracked – only those in private caches!
- Idea: Organize directory as a cache

✓ Low latency, energy-efficient
✗ Bit-vectors grow with # cores → Area scales poorly
✗ Limited associativity → Directory-induced invalidations

Directory Entry Format:

<table>
<thead>
<tr>
<th>Line Address</th>
<th>State</th>
<th>Sharer Set</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xF00</td>
<td>Sh</td>
<td>0 1 0 0 1 1 0 0</td>
</tr>
</tbody>
</table>

Diagram:

Way 1 Way 2 Way 3 Way 4

- [Diagram of directory entry format]

- [Diagram of directory structure]
Directory-Induced Invalidations

- To retain inclusion, must invalidate all sharers of an entry before reusing it for another address
- Example: 2-way set-associative sparse directory
Directory-Induced Invalidations

- To retain inclusion, must invalidate all sharers of an entry before reusing it for another address
- Example: 2-way set-associative sparse directory

```
Main Memory
```

```
Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0}</td>
<td>0xF</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>
```

```
Cache 0
```

```
Tag | State | Data |
-----|-------|------|
0x0A | S     | 3    |
```

```
Cache 1
```

```
Tag | State | Data |
-----|-------|------|
0xF  | M     | 1    |
```

```
Cache 2
```

```
Tag | State | Data |
-----|-------|------|
      |       |      |
```

```
Core 0
```

```
Core 1
```

```
Core 2
```

1 LD 0xB
Directory-Induced Invalidations

- To retain inclusion, must invalidate all sharers of an entry before reusing it for another address
- Example: 2-way set-associative sparse directory

<table>
<thead>
<tr>
<th>Main Memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>Directory</td>
</tr>
<tr>
<td>Tag</td>
</tr>
<tr>
<td>0xA</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cache 0</th>
</tr>
</thead>
<tbody>
<tr>
<td>Tag</td>
</tr>
<tr>
<td>0xA</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cache 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>Tag</td>
</tr>
<tr>
<td>0xF</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cache 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>Tag</td>
</tr>
<tr>
<td>0xB</td>
</tr>
</tbody>
</table>

April 8, 2021
Directory-Induced Invalidations

- To retain inclusion, must invalidate all sharers of an entry before reusing it for another address
- Example: 2-way set-associative sparse directory
Directory-Induced Invalidations

- To retain inclusion, must invalidate all sharers of an entry before reusing it for another address
- Example: 2-way set-associative sparse directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0}</td>
<td>0xF</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

**Core 0**

- LD 0xB

**Core 1**

- InvReq 0xA
- ShReq 0xB

**Core 2**

- Core Memory
Directory-Induced Invalidations

- To retain inclusion, must invalidate all sharers of an entry before reusing it for another address
- Example: 2-way set-associative sparse directory

### Diagram

**Main Memory**

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0}</td>
</tr>
</tbody>
</table>

**Directory**

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xF</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

**Core 0**

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

**Core 1**

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xF</td>
<td>M</td>
<td>1</td>
</tr>
</tbody>
</table>

**Core 2**

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xB</td>
<td>I-&gt;S</td>
<td></td>
</tr>
</tbody>
</table>

**Events**

1. **LD 0xB**
2. **ShReq 0xB**
3. **InvReq 0xA**
Directory-Induced Invalidations

- To retain inclusion, must invalidate all sharers of an entry before reusing it for another address
- Example: 2-way set-associative sparse directory

### Example Illustration

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0}</td>
</tr>
<tr>
<td>0xF</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

#### Core 0
- **LD 0xB**
- **Cache 0**
  - Tag: 0xA
  - State: I
  - Data: 3

#### Core 1
- **Cache 1**
  - Tag: 0xF
  - State: M
  - Data: 1

#### Core 2
- **ShReq 0xB**
- **Cache 2**
  - Tag: 0xB
  - State: I->S

#### Main Memory
- **Directory**
  - Tag: 0xA
  - State: Sh
  - Sharers: {0}
  - Tag: 0xF
  - State: Ex
  - Sharers: {1}
Directory-Induced Invalidations

- To retain inclusion, must invalidate all sharers of an entry before reusing it for another address
- Example: 2-way set-associative sparse directory

### Main Memory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xB</td>
<td>Sh</td>
<td>{2}</td>
</tr>
<tr>
<td>0xF</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

### Core 0

- **Cache 0**
  - Tag: 0xA
  - State: I
  - Data: 3

### Core 1

- **Cache 1**
  - Tag: 0xF
  - State: M
  - Data: 1

- **Cache 2**
  - Tag: 0xB
  - State: I->S

### Core 2

- **Cache 0**
  - Tag: 0xB
  - State: I
  - Data: 0

#### Process:
1. **LD 0xB**
2. **ShReq 0xB**
3. **InvReq 0xA**
4. **InvResp 0xA**
Directory-Induced Invalidations

- To retain inclusion, must invalidate all sharers of an entry before reusing it for another address
- Example: 2-way set-associative sparse directory

```
<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xB</td>
<td>Sh</td>
<td>{2}</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xF</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>
```

```
Core 0
Cache 0
<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 1
Cache 1
<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xF</td>
<td>M</td>
<td>1</td>
</tr>
</tbody>
</table>

Core 2
Cache 2
<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xB</td>
<td>I-&gt;S</td>
<td></td>
</tr>
</tbody>
</table>
```

1. LD 0xB
2. ShReq 0xB
3. InvReq 0xA
4. InvResp 0xA
5. Mem[0xB] = 5

April 8, 2021
Directory-Induced Invalidations

- To retain inclusion, must invalidate all sharers of an entry before reusing it for another address
- Example: 2-way set-associative sparse directory

---

Main Memory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xB</td>
<td>Sh</td>
<td>{2}</td>
</tr>
</tbody>
</table>

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xF</td>
<td>Ex</td>
<td>{1}</td>
</tr>
</tbody>
</table>

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xF</td>
<td>M</td>
<td>1</td>
</tr>
</tbody>
</table>

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xB</td>
<td>I-&gt;S</td>
<td></td>
</tr>
</tbody>
</table>

Core 0

- LD 0xB

Core 1

- InvReq 0xA
- InvResp 0xA
- ShResp 0xB, data=5

Core 2

- ShReq 0xB
Directory-Induced Invalidations

- To retain inclusion, must invalidate all sharers of an entry before reusing it for another address
- Example: 2-way set-associative sparse directory

Main Memory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xB</td>
<td>Sh</td>
<td>{2}</td>
</tr>
</tbody>
</table>

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xB</td>
<td>Sh</td>
<td>{2}</td>
</tr>
</tbody>
</table>

Mem[0xB] = 5

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
</tbody>
</table>

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xF</td>
<td>M</td>
<td>1</td>
</tr>
</tbody>
</table>

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xB</td>
<td>S</td>
<td>5</td>
</tr>
</tbody>
</table>

Core 0

Core 1

Core 2

1. LD 0xB
2. ShReq 0xB
3. InvReq 0xA
4. InvResp 0xA
5. Mem[0xB] = 5
6. ShResp 0xB, data=5
Directory-Induced Invalidations

- To retain inclusion, must invalidate all sharers of an entry before reusing it for another address
- Example: 2-way set-associative sparse directory

How many entries should the directory have?
Inexact Representations of Sharer Sets

- Coarse-grain bit-vectors (e.g., 1 bit per 4 cores)
  
  Sharer Set
  
<table>
<thead>
<tr>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>0-3</td>
<td>4-7</td>
<td>8-11</td>
<td>12-15</td>
<td>16-19</td>
<td>20-23</td>
<td></td>
</tr>
</tbody>
</table>

- Limited pointers: Maintain a few sharer pointers, on overflow mark ‘all’ and broadcast (or invalidate another sharer)
  
  Sharer Set
  
<table>
<thead>
<tr>
<th>0</th>
<th>8</th>
<th>14</th>
<th>33</th>
</tr>
</thead>
<tbody>
<tr>
<td>all</td>
<td>sharer 1</td>
<td>sharer 2</td>
<td>sharer 3</td>
</tr>
</tbody>
</table>

- Allow false positives (e.g., Bloom filters)
Inexact Representations of Sharer Sets

- Coarse-grain bit-vectors (e.g., 1 bit per 4 cores)

  Sharer Set
  
<p>| | | | | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0-3</td>
<td>4-7</td>
<td>8-11</td>
<td>12-15</td>
<td>16-19</td>
<td>20-23</td>
<td></td>
</tr>
</tbody>
</table>

- Limited pointers: Maintain a few sharer pointers, on overflow mark ‘all’ and broadcast (or invalidate another sharer)

  Sharer Set
  
<p>| | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>8</td>
<td>14</td>
<td>33</td>
</tr>
<tr>
<td>all</td>
<td>sharer 1</td>
<td>sharer 2</td>
<td>sharer 3</td>
</tr>
</tbody>
</table>

- Allow false positives (e.g., Bloom filters)

  ✓ Reduced area & energy
  ✗ Overheads still not scalable (these techniques simply play with constant factors)
  ✗ Inexact sharers → Broadcasts, invalidations or spurious invalidations and downgrades
Protocol Races

- Directory serializes multiple requests for the same address
  - Same-address requests are queued or NACKed and retried
- But races still exist due to conflicting requests
- Example: Upgrade race

**Diagram:**

- **Main Memory**
  - ReqQ
  - Directory
    - Tag | State | Sharers
    - 0xA  | Sh    | \{0,2\}

- **Cache 0**
  - Tag | State | Data
  - 0xA  | S     | 3

- **Cache 1**
  - Tag | State | Data
  - 0xA  | S     | 3
Protocol Races

- Directory serializes multiple requests for the same address
  - Same-address requests are queued or NACKed and retried
- But races still exist due to conflicting requests
- Example: Upgrade race

<table>
<thead>
<tr>
<th>ReqQ</th>
<th>Directory</th>
</tr>
</thead>
<tbody>
<tr>
<td>Tag</td>
<td>State</td>
</tr>
<tr>
<td>0xA</td>
<td>Sh</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cache 0</th>
</tr>
</thead>
<tbody>
<tr>
<td>Tag</td>
</tr>
<tr>
<td>0xA</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Cache 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>Tag</td>
</tr>
<tr>
<td>0xA</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Core 0</th>
</tr>
</thead>
<tbody>
<tr>
<td>1 ST 0xA</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Core 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>1’ ST 0xA</td>
</tr>
</tbody>
</table>
Protocol Races

- Directory serializes multiple requests for the same address
  - Same-address requests are queued or NACKed and retried
- But races still exist due to conflicting requests
- Example: Upgrade race

1. **Directory**
   - **Tag**: 0xA
   - **State**: Sh
   - **Sharers**: \{0, 2\}

2. **Main Memory**

3. **ReqQ**

4. **Directory**

5. **Cache 0**
   - **Tag**: 0xA
   - **State**: S->M
   - **Data**: 3

6. **Core 0**
   - **ST 0xA**

7. **Cache 1**
   - **Tag**: 0xA
   - **State**: S->M
   - **Data**: 3

8. **Core 1**
   - **ST 0xA**
Protocol Races

- Directory serializes multiple requests for the same address
  - Same-address requests are queued or NACKed and retried
- But races still exist due to conflicting requests
- Example: Upgrade race

Caches 0 and 1 issue simultaneous ExReqs

1. ST 0xA
2. ExReq 0xA
2’. ExReq 0xA

Main Memory

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Sh</td>
<td>{0,2}</td>
</tr>
</tbody>
</table>

ReqQ

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>S-&gt;M</td>
<td>3</td>
</tr>
</tbody>
</table>

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>S-&gt;M</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 0

Core 1

April 8, 2021

MIT 6.823 Spring 2021
Protocol Races

- Directory serializes multiple requests for the same address
  - Same-address requests are queued or NACKed and retried
- But races still exist due to conflicting requests
- Example: Upgrade race

Caches 0 and 1 issue simultaneous ExReqs
Directory starts serving cache 0’s ExReq, queues cache 1’s
Protocol Races

- Directory serializes multiple requests for the same address
  - Same-address requests are queued or NACKed and retried
- But races still exist due to conflicting requests
- Example: Upgrade race

Caches 0 and 1 issue simultaneous ExReqs
Directory starts serving cache 0’s ExReq, queues cache 1’s
Cache 1 expected ExResp, but got InvReq!
Protocol Races

- Directory serializes multiple requests for the same address
  - Same-address requests are queued or NACKed and retried
- But races still exist due to conflicting requests
- Example: Upgrade race

Caches 0 and 1 issue simultaneous ExReqs
Directory starts serving cache 0’s ExReq, queues cache 1’s

Cache 1 expected ExResp, but got InvReq!
Cache 1 should transition from S->M to I->M and send InvResp
Extra Hops and 3-Hop Protocols
Reducing Protocol Latency

- Problem: Data in another cache needs to pass through the directory, adding latency
- Optimization: Forward data to requester directly

```
Main Memory

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Ex</td>
<td>{0}</td>
</tr>
</tbody>
</table>

Cache 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>M</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 0

Cache 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
</table>

Core 1

Cache 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
</table>

Core 2
```
Extra Hops and 3-Hop Protocols
Reducing Protocol Latency

- Problem: Data in another cache needs to pass through the directory, adding latency
- Optimization: Forward data to requester directly

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Ex</td>
<td>{0}</td>
</tr>
</tbody>
</table>

Core 0
- Tag: 0xA
- State: M
- Data: 3

Core 1
- Tag: 0xA
- State: M
- Data: 3

Core 2
- Tag: 0xA
- State: M
- Data: 3

1 ST 0xA
**Extra Hops and 3-Hop Protocols**

**Reducing Protocol Latency**

- **Problem:** Data in another cache needs to pass through the directory, adding latency
- **Optimization:** Forward data to requester directly

---

**Diagram:**

- **Main Memory**
- **Directory**
  - **Tag:** 0xA
  - **State:** Ex
  - **Sharers:** {0}

- **Cache 0**
  - **Tag:** 0xA
  - **State:** M
  - **Data:** 3

- **Cache 1**
  - **Tag:** 0xA
  - **State:** I->M
  - **Data:**

- **Cache 2**
  - **Tag:** 0xA
  - **State:**
  - **Data:**

- **Core 0**
- **Core 1**
- **Core 2**

**Note:**

1. ST 0xA
Extra Hops and 3-Hop Protocols
Reducing Protocol Latency

- Problem: Data in another cache needs to pass through the directory, adding latency
- Optimization: Forward data to requester directly

![Diagram of memory hierarchy and cache states]

1. ST 0xA
2. ExReq 0xA
Extra Hops and 3-Hop Protocols
Reducing Protocol Latency

- Problem: Data in another cache needs to pass through the directory, adding latency
- Optimization: Forward data to requester directly

Main Memory

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Ex-&gt;Ex</td>
<td>{2}</td>
</tr>
</tbody>
</table>

Core 0

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>M</td>
<td>3</td>
</tr>
</tbody>
</table>

Core 1

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Core 2

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I-&gt;M</td>
<td></td>
</tr>
</tbody>
</table>
Extra Hops and 3-Hop Protocols
Reducing Protocol Latency

- Problem: Data in another cache needs to pass through the directory, adding latency
- Optimization: Forward data to requester directly

Main Memory

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Ex-&gt;Ex</td>
<td>{2}</td>
</tr>
</tbody>
</table>

1. ST 0xA
2. ExReq 0xA
3. ExFwd 0xA, req=2
Extra Hops and 3-Hop Protocols
Reducing Protocol Latency

- Problem: Data in another cache needs to pass through the directory, adding latency
- Optimization: Forward data to requester directly

**Diagram:**

1. **ST 0xA**
2. **ExReq 0xA**
3. **ExFwd 0xA, req=2**

**Directory Table:**

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Ex-&gt;Ex</td>
<td>{2}</td>
</tr>
</tbody>
</table>
Extra Hops and 3-Hop Protocols
Reducing Protocol Latency

- Problem: Data in another cache needs to pass through the directory, adding latency
- Optimization: Forward data to requester directly

```
Main Memory

Directory
<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Ex-&gt;Ex</td>
<td>{2}</td>
</tr>
</tbody>
</table>

ExFwd 0xA, req=2
ExReq 0xA
ExResp 0xA, data=3
ST 0xA
```

Core 0  
L14-24
Extra Hops and 3-Hop Protocols
Reducing Protocol Latency

- Problem: Data in another cache needs to pass through the directory, adding latency
- Optimization: Forward data to requester directly

```
Main Memory

Directory

<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Ex-&gt;Ex</td>
<td>{2}</td>
</tr>
</tbody>
</table>
```

1. ST 0xA
2. ExReq 0xA
3. ExFwd 0xA, req=2
4. ExResp 0xA, data=3
Extra Hops and 3-Hop Protocols
Reducing Protocol Latency

- Problem: Data in another cache needs to pass through the directory, adding latency
- Optimization: Forward data to requester directly

```
+----------------+               +----------------+               +----------------+               +----------------+                   
| Core 0         |               | Core 1         |               | Core 2         |                   
|                |               |                |               |                |                   
| Cache 0        |               | Cache 1        |               | Cache 2        |                   
|                |               |                |               |                |                   
|                |               |                |               |                |                   
|                |               |                |               |                |                   
|                |               |                |               |                |                   
|                |               |                |               |                |                   
|                |               |                |               |                |                   
|                |               |                |               |                |                   
|                |               |                |               |                |                   
|                |               |                |               |                |                   
|                |               |                |               |                |                   
|                |               |                |               |                |                   
|                |               |                |               |                |                   
|                |               |                |               |                |                   
+----------------+               +----------------+               +----------------+                   
```

April 8, 2021
Extra Hops and 3-Hop Protocols
Reducing Protocol Latency

- Problem: Data in another cache needs to pass through the directory, adding latency
- Optimization: Forward data to requester directly

```
<table>
<thead>
<tr>
<th>Core 0</th>
<th>Core 1</th>
<th>Core 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>ST 0xA</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
```

```
<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>I</td>
<td>3</td>
</tr>
<tr>
<td>0xA</td>
<td>M</td>
<td>3</td>
</tr>
</tbody>
</table>
```

```
<table>
<thead>
<tr>
<th>Tag</th>
<th>State</th>
<th>Sharers</th>
</tr>
</thead>
<tbody>
<tr>
<td>0xA</td>
<td>Ex</td>
<td>{2}</td>
</tr>
</tbody>
</table>
```

1. ST 0xA
2. ExReq 0xA
3. ExResp 0xA, data=3
4. ExAck 0xA
5. ExFwd 0xA, req=2
Coherence in Multi-Level Hierarchies

- Can use the same or different protocols to keep coherence across multiple levels
- Key invariant: Ensure sufficient permissions in all intermediate levels
- Example: 8-socket Xeon E7 (8 cores/socket)
In-Cache Directories

- Common multicore memory hierarchy:
  - 1+ levels of private caches
  - A shared last-level cache
  - Need to enforce coherence among private caches

- Idea: Embed the directory information in shared cache tags
  - Shared cache must be inclusive
In-Cache Directories

• Common multicore memory hierarchy:
  – 1+ levels of private caches
  – A shared last-level cache
  – Need to enforce coherence among private caches

• Idea: Embed the directory information in shared cache tags
  – Shared cache must be inclusive

✓ Avoids tag overheads & separate lookups
× Can be inefficient if shared cache size $\gg \sum$(private cache sizes)
Avoiding Protocol Deadlock

- Protocols can cause deadlocks even if network is deadlock-free! *(more on this later)*

Example: Both nodes saturate all intermediate buffers with requests to each other, blocking responses from entering the network.
Avoiding Protocol Deadlock

• Protocols can cause deadlocks even if network is deadlock-free! (*more on this later*)

Example: Both nodes saturate all intermediate buffers with requests to each other, blocking responses from entering the network

• Solution: Separate *virtual networks*
  - Different sets of virtual channels and endpoint buffers
  - Same physical routers and links
Avoiding Protocol Deadlock

- Protocols can cause deadlocks even if network is deadlock-free! (*more on this later*)

![Diagram showing inter-node communication with requests and responses]

Example: Both nodes saturate all intermediate buffers with requests to each other, blocking responses from entering the network

- **Solution:** Separate *virtual networks*
  - Different sets of virtual channels and endpoint buffers
  - Same physical routers and links

- Most protocols require at least 2 virtual networks (for requests and replies), often >2 needed
Implementing Atomic Instructions

• In general, an *atomic read-modify-write* instruction requires two memory operations without intervening memory operations by other processors.
Implementing Atomic Instructions

• In general, an *atomic read-modify-write* instruction requires two memory operations without intervening memory operations by other processors.

• Implementation options:
  • *With snoopy coherence, lock the bus* -> expensive
In general, an atomic read-modify-write instruction requires two memory operations without intervening memory operations by other processors.

Implementation options:
- With snoopy coherence, lock the bus -> expensive
- With directory-based coherence, lock the line in the cache (prevent invalidations or evictions until atomic op finishes) -> complex
Implementing Atomic Instructions

• In general, an atomic read-modify-write instruction requires two memory operations without intervening memory operations by other processors.

• Implementation options:
  • With snoopy coherence, lock the bus -> expensive
  • With directory-based coherence, lock the line in the cache (prevent invalidations or evictions until atomic op finishes) -> complex

• Modern processors often use
  load-reserve
  store-conditional
Load-reserve & Store-conditional

Special register(s) to hold reservation flag and address, and the outcome of store-conditional

Load-reserve R, (a):
<flag, adr> ← <1, a>;
R ← M[a];

Store-conditional (a), R:
if <flag, adr> == <1, a>
then cancel other procs’ reservation on a;
M[a] ← <R>;
status ← succeed;
else status ← fail;

If the cache receives an invalidation to the address in the reserve register, the reserve bit is set to 0

- Several processors may reserve ‘a’ simultaneously
- These instructions are like ordinary loads and stores with respect to the bus traffic
Load-Reserve/Store-Conditional

Swap implemented with Ld-Reserve/St-Conditional

# Swap(R1, mutex):

L: Ld-Reserve R2, (mutex)
   St-Conditional (mutex), R1
   if (status == fail) goto L
   R1 <- R2
The total number of coherence transactions is not necessarily reduced, but splitting an atomic instruction into load-reserve & store-conditional:

- *increases utilization* (and reduces processor stall time), especially in split-transaction buses and directories

- *reduces cache ping-pong effect* because processors trying to acquire a semaphore do not have to perform stores each time
Thank you!

Next Lecture: Consistency and Relaxed Memory Models