Non-transient Side Channels

Mengjia Yan
Fall 2020
Lab Assignment

• Handout on course website
• Each (regular) student will receive an email
  • Solo or 2-person group
  • Individual GitHub repo
  • Info about accessing a server machine
• Listeners can send us an email if you want to try the lab

• Advice:
  • Start early. The first step is not to implement the attack, but to reverse engineer the machine.
Recap: Prime+Probe

Sender  Receiver

Cache Set  Shared Cache

# ways

Sender line  Receiver line

Prime

Time

6.888 L5-Non-transient Side Channels
Recap: Prime+Probes

6.888 L5-Non-transient Side Channels
Recap: Prime+Probe

- **Prime** and **Probe**
- **Shared Cache**
- **Sender** and **Receiver**
- **Cache Set**
- **Shared Cache**
- **# ways**
- **Sender line**
- **Receiver line**
- **Access**
- **Prime**
- **Wait**
- **Time**

6.888 L5-Non-transient Side Channels
Recap: Prime+Probe

- **Prime**
- **Probe**
- **Shared Cache**
  - **Sender**
  - **Receiver**
  - **Cache Set**
  - **Shared Cache**
  - **# ways**
  - **Sender line**
  - **Receiver line**
  - **Access**
  - **Prime**
  - **Wait**

6.888 L5-Non-transient Side Channels
Recap: Prime+Probe

Receive “1” = 8 accesses → 1 miss
Analogy: Bucket/Ball

Each cache set is a bucket that can hold 8 balls.

Sender’s address
Receiver’s address

Sender
Receiver

Cache Set
Shared Cache

# ways
Analogy: Bucket/Ball

How many cache lines in total in the system?

Sender's address

Receiver's address

Each cache set is a bucket that can hold 8 balls

Sender

Receiver

# ways

Cache Set

Shared Cache

6.888 L5-Non-transient Side Channels
Analogy: Bucket/Ball

Each cache set is a bucket that can hold 8 balls.

How many cache lines in total in the system?
How to find the bucket used by the sender?
Practical Cache Side Channels
Cache Mapping – Directly Mapped Cache

- Can think cache mapping as a hash table with limited size

### Physical Address:

<table>
<thead>
<tr>
<th>Index</th>
<th>Tag</th>
<th>Data (64 bytes)</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td></td>
<td></td>
</tr>
<tr>
<td>1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td></td>
<td></td>
</tr>
<tr>
<td>4</td>
<td></td>
<td></td>
</tr>
<tr>
<td>5</td>
<td></td>
<td></td>
</tr>
<tr>
<td>6</td>
<td></td>
<td></td>
</tr>
<tr>
<td>7</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Cache Mapping – Directly Mapped Cache

- Can think cache mapping as a hash table with limited size
- Linear cache set mapping using modular arithmetic

Physical Address:

32bit

0

index

0
1
2
3
4
5
6
7

Tag

Data (64 bytes)

6.888 L5-Non-transient Side Channels
Cache Mapping – Directly Mapped Cache

- Can think cache mapping as a hash table with limited size
- Linear cache set mapping using modular arithmetic

Set Index = (Addr / Block Size) % Number of Sets
Cache Mapping – Directly Mapped Cache

- Can think cache mapping as a hash table with limited size
- Linear cache set mapping using modular arithmetic

Assuming byte-addressable
Cache Mapping – Directly Mapped Cache

- Can think cache mapping as a hash table with limited size
- Linear cache set mapping using modular arithmetic

**Physical Address:**

<table>
<thead>
<tr>
<th>31</th>
<th>9</th>
<th>8</th>
<th>6</th>
<th>5</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>Tag (high order bits)</td>
<td>Set Index (3 bits)</td>
<td>Line offset (6 bits)</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Question:** Given an 1MB L2 with 1024 sets, how many bits are used for set index?
Cache Mapping – Directly Mapped Cache

- Can think cache mapping as a hash table with limited size
- Linear cache set mapping using modular arithmetic

Assuming byte-addressable

Physical Address:

<table>
<thead>
<tr>
<th>31</th>
<th>9</th>
<th>8</th>
<th>6</th>
<th>5</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>Tag (high order bits)</td>
<td>Set Index (3 bits)</td>
<td>Line offset (6 bits)</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Number of bits for set index = \( \log_2(\text{Number of sets}) \)

**Question:** Given an 1MB L2 with 1024 sets, how many bits are used for set index?
Cache Mapping – Directly Mapped Cache

- Can think cache mapping as a hash table with limited size
- Linear cache set mapping using modular arithmetic

**Physical Address:**

<table>
<thead>
<tr>
<th>31</th>
<th>9</th>
<th>8</th>
<th>6</th>
<th>5</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>Tag (high order bits)</td>
<td>Set Index (3 bits)</td>
<td>Line offset (6 bits)</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

To distinguish addresses in the same set

Number of bits for set index = \( \log_2(\text{Number of sets}) \)

**Question:** Given an 1MB L2 with 1024 sets, how many bits are used for set index?

Assuming byte-addressable
Cache Mapping – Directly Mapped Cache

• Can think cache mapping as a hash table with limited size
• Linear cache set mapping using modular arithmetic

Physical Address:

31 9 8 6 5 0

Tag (high order bits) Set Index (3 bits) Line offset (6 bits)

To distinguish addresses in the same set

Number of bits for set index = \( \log_2(\text{Number of sets}) \)

Question: Given an 1MB L2 with 1024 sets, how many bits are used for set index?
Cache Mapping – Set Associative Cache

- Can think cache mapping as a hash table with limited size
- Linear cache set mapping using modular arithmetic

Physical Address:

<table>
<thead>
<tr>
<th>31</th>
<th>9</th>
<th>8</th>
<th>6</th>
<th>5</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>Tag (high order bits)</td>
<td>Index (3 bits)</td>
<td>Line offset (6 bits)</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

2-way cache

<table>
<thead>
<tr>
<th>index</th>
<th>Tag</th>
<th>Data</th>
<th>Tag</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>5</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>6</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>7</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Cache Mapping – Set Associative Cache

- Can think cache mapping as a hash table with limited size
- Linear cache set mapping using modular arithmetic

**Physical Address:**
- Tag (high order bits)
- Index (3 bits)
- Line offset (6 bits)

**2-way cache:**

**Question:** How to decide which way to use?
Cache Mapping – Set Associative Cache

- Can think cache mapping as a hash table with limited size
- Linear cache set mapping using modular arithmetic

**Physical Address:**

<table>
<thead>
<tr>
<th>31</th>
<th>9</th>
<th>8</th>
<th>6</th>
<th>5</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>Tag (high order bits)</td>
<td>Index (3 bits)</td>
<td>Line offset (6 bits)</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**2-way cache**

<table>
<thead>
<tr>
<th>index</th>
<th>Tag</th>
<th>Data</th>
<th>Tag</th>
<th>Data</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>5</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>6</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>7</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Question:** How to decide which way to use?

**Answer:** Cache replacement policy.
Cache Mapping – Set Associative Cache

• Can think cache mapping as a hash table with limited size
• Linear cache set mapping using modular arithmetic

Physical Address:

<table>
<thead>
<tr>
<th>Tag (high order bits)</th>
<th>Set Index (3 bits)</th>
<th>Line offset (6 bits)</th>
</tr>
</thead>
<tbody>
<tr>
<td>31</td>
<td>9 8 6 5 0</td>
<td></td>
</tr>
</tbody>
</table>

Find eviction set =
Find addresses with the same set index bits

Question: How to decide which way to use?
Answer: Cache replacement policy.
Address Translation (4KB page)

Programmer’s view
Virtual Address (48bit):

Virtual page number

Page offset (12 bits)

system’s view
Physical Address (32bit):

physical page number

Page offset (12 bits)

6.888 L5-Non-transient Side Channels
Address Translation (4KB page)

Programmer’s view
Virtual Address (48bit):

system’s view
Physical Address (32bit):

Virtual page number

Page offset (12 bits)

physical page number

Page offset (12 bits)

Copy page offset

6.888 L5-Non-transient Side Channels
Address Translation (4KB page)

Programmer’s view
Virtual Address (48bit):

System’s view
Physical Address (32bit):

Virtual page number

Page offset (12 bits)

Copy page offset

Page Table

physical page number

Page offset (12 bits)
Find Eviction Set Using Virtual Addresses

Virtual Address (48bit):

- Virtual page number
- Page offset

Physical Address (32bit):
- 4KB page
- Physical page number
- Page offset (12 bits)
Find Eviction Set Using Virtual Addresses

Virtual Address (48bit):

- Virtual page number
- Page offset

Physical Address (32bit):

- 4KB page
- physical page number
- Page offset (12 bits)

Cache mapping:

- (8 sets)

6.888 L5-Non-transient Side Channels
Find Eviction Set Using Virtual Addresses

Virtual Address (48bit):

Physical Address (32bit):
4KB page

Cache mapping:
(8 sets)

Virtual page number
Page offset

physical page number
Page offset (12 bits)
Line offset (6 bits)

6.888 L5-Non-transient Side Channels
Find Eviction Set Using Virtual Addresses

Virtual Address (48bit):

<table>
<thead>
<tr>
<th>48</th>
<th>12</th>
<th>11</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>Virtual page number</td>
<td>Page offset</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Physical Address (32bit):

<table>
<thead>
<tr>
<th>31</th>
<th>12</th>
<th>11</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>physical page number</td>
<td>Page offset (12 bits)</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

- 4KB page
- Cache mapping: (8 sets)

Index (3 bits)  Line offset (6 bits)

6.888 L5-Non-transient Side Channels
Find Eviction Set Using Virtual Addresses

Virtual Address (48bit):

<table>
<thead>
<tr>
<th>48</th>
<th>12</th>
<th>11</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Virtual page number</td>
<td>Page offset</td>
<td></td>
</tr>
</tbody>
</table>

Physical Address (32bit):

<table>
<thead>
<tr>
<th>31</th>
<th>12</th>
<th>11</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>physical page number</td>
<td>Page offset (12 bits)</td>
<td></td>
</tr>
</tbody>
</table>

4KB page

Cache mapping:

<table>
<thead>
<tr>
<th></th>
<th>Tag</th>
<th>Index (3 bits)</th>
<th>Line offset (6 bits)</th>
</tr>
</thead>
</table>

6.888 L5-Non-transient Side Channels
Find Eviction Set Using Virtual Addresses

<table>
<thead>
<tr>
<th>Virtual Address (48bit):</th>
<th>Virtual page number</th>
<th>Page offset</th>
</tr>
</thead>
<tbody>
<tr>
<td>Physical Address (32bit):</td>
<td>physical page number</td>
<td>Page offset (12 bits)</td>
</tr>
<tr>
<td>4KB page</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Cache mapping:</td>
<td>Tag</td>
<td>Index (3 bits)</td>
</tr>
<tr>
<td>(8 sets)</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Cache mapping:</td>
<td></td>
<td></td>
</tr>
<tr>
<td>(256 sets)</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

6.888 L5-Non-transient Side Channels
Find Eviction Set Using Virtual Addresses

Virtual Address (48bit):

<table>
<thead>
<tr>
<th>48</th>
<th>12</th>
<th>11</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>Virtual page number</td>
<td>Page offset</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Physical Address (32bit):

<table>
<thead>
<tr>
<th>31</th>
<th>12</th>
<th>11</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>physical page number</td>
<td>Page offset (12 bits)</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Cache mapping:

- (8 sets)
  - Tag
  - Index (3 bits)
  - Line offset (6 bits)

- (256 sets)
  - Tag
  - Set Index (8 bits)
  - Line offset (6 bits)

6.888 L5-Non-transient Side Channels
Find Eviction Set Using Virtual Addresses

Virtual Address (48bit):

<table>
<thead>
<tr>
<th>48</th>
<th>12</th>
<th>11</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>Virtual page number</td>
<td>Page offset</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Physical Address (32bit):

<table>
<thead>
<tr>
<th>31</th>
<th>12</th>
<th>11</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>physical page number</td>
<td>Page offset (12 bits)</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Cache mapping:

- (8 sets)
  - Tag
  - Index (3 bits)
  - Line offset (6 bits)

- (256 sets)
  - Tag
  - Set Index (8 bits)
  - Line offset (6 bits)

Not controllable via virtual address.

6.888 L5-Non-transient Side Channels
Huge Pages

- Huge page size: 2MB or 1GB
  - Number of bits for page offset?
Huge Pages

- Huge page size: 2MB or 1GB
  - Number of bits for page offset?

```
Virtual Address :
4KB page

  48  12  11  0
  Virtual page number  Page offset (12 bits)

Virtual Address :
2MB page

  48  21  20  0
  Virtual page number  Page offset (21 bits)
```
Huge Pages

- Huge page size: 2MB or 1GB
  - Number of bits for page offset?

Virtual Address: 4KB page

Virtual Address: 2MB page

Cache mapping: (256 sets)

Virtual page number
Page offset (12 bits)

Virtual page number
Page offset (21 bits)

Tag
Set Index (8 bits)
Line offset (6 bits)

6.888 L5-Non-transient Side Channels
Multi-level Caches
Multi-level Caches

• Motivation:
  • A memory cannot be large and fast. Add level of cache to reduce miss penalty
Multi-level Caches

• Motivation:
  • A memory cannot be large and fast. Add level of cache to reduce miss penalty

A typical configuration of Intel Ivy Bridge. Configurations are different with processor types.

<table>
<thead>
<tr>
<th></th>
<th>L1-I/D cache</th>
<th>L2 cache</th>
<th>L3 cache (LLC)</th>
<th>DRAM</th>
</tr>
</thead>
<tbody>
<tr>
<td>Size</td>
<td>32KB</td>
<td>256KB</td>
<td>1MB/core</td>
<td>16GB</td>
</tr>
<tr>
<td>Associativity</td>
<td>4 or 8</td>
<td>8</td>
<td>16</td>
<td>N/A</td>
</tr>
<tr>
<td>(# ways)</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Latency (cycles)</td>
<td>1-5</td>
<td>12</td>
<td>~40</td>
<td>~150</td>
</tr>
</tbody>
</table>

6.888 L5-Non-transient Side Channels
Multi-level Caches

• Motivation:
  • A memory cannot be large and fast. Add level of cache to reduce miss penalty

• LLC is generally divided into multiple slices
Multi-level Caches

• Motivation:
  • A memory cannot be large and fast. Add level of cache to reduce miss penalty

• LLC is generally divided into multiple slices
Multi-level Caches

• Motivation:
  • A memory cannot be large and fast. Add level of cache to reduce miss penalty

• LLC is generally divided into multiple slices

6.888 L5-Non-transient Side Channels
Multi-level Caches

• Motivation:
  • A memory cannot be large and fast. Add level of cache to reduce miss penalty

• LLC is generally divided into multiple slices
  • Conflict happens if addresses map to the same slice and the same set

![Diagram showing cache levels and slice allocation]

6.888 L5-Non-transient Side Channels
Eviction Set Construction Algorithm

Eviction Set Construction Algorithm

Sender

Receiver

Sender line

Receiver line

Access Target Address

Access Candidate Addresses

Wait

Time


6.888 L5-Non-transient Side Channels
Eviction Set Construction Algorithm


6.888 L5-Non-transient Side Channels
Problems Due to Replacement Policy

- Self-eviction due to replacement policy
  - An LRU (least recently used) example

Initial: [Diagram of LRU cache]

6.888 L5-Non-transient Side Channels
Problems Due to Replacement Policy

• Self-eviction due to replacement policy
  • An LRU (least recently used) example

Initial:  
Prime: 1 2 3 4 5 6 7 8
Problems Due to Replacement Policy

• Self-eviction due to replacement policy
  • An LRU (least recently used) example

![Diagram of LRU cache]

Initial: [Blank]
Prime: [1 2 3 4 5 6 7 8]
Victim access: [9 2 3 4 5 6 7 8]
Problems Due to Replacement Policy

• Self-eviction due to replacement policy
  • An LRU (least recently used) example

![LRU example diagram]

Initial: 
Prime: 1 2 3 4 5 6 7 8
Victim access: 9 2 3 4 5 6 7 8
Probe: 9 2 3 4 5 6 7 8
Which to evict?
Problems Due to Replacement Policy

• Self-eviction due to replacement policy
  • An LRU (least recently used) example

• A small trick:
  • Access addresses in reverse order

Initial: [1 2 3 4 5 6 7 8]
Prime: [1 2 3 4 5 6 7 8]
Victim access: [9 2 3 4 5 6 7 8]
Probe: [9 2 3 4 5 6 7 8]
Which to evict?
Measure Latency of Multiple Accesses

• HW Prefetcher + Out-of-order execution

\[
T_1 = \text{rdtsc}() \\
\text{Dummy1}=\text{Ld}(\text{Addr1}) \\
\ldots \\
\text{Dummy8}=\text{Ld}(\text{Addr8}) \\
T_2 = \text{rdtsc}() \\
\text{Latency} = T_2 - T_1
\]
Measure Latency of Multiple Accesses

• HW Prefetcher + Out-of-order execution

T1 = rdtsc()
Dummy1=Ld(Addr1)
......
Dummy8=Ld(Addr8)
T2 = rdtsc()
Latency = T2-T1

What we expect:

Ld A1  Ld A2  ......  Ld A7  Ld A8

Time
Measure Latency of Multiple Accesses

• HW Prefetcher + Out-of-order execution

\[ T1 = \text{rdtsc()} \]
\[ \text{Dummy1=ld(Addr1)} \]
\[ \ldots \]
\[ \text{Dummy8=ld(Addr8)} \]
\[ T2 = \text{rdtsc()} \]
Latency = T2 - T1

What we expect:

Ld A1  Ld A2  ......  Ld A7  Ld A8

What actually will happen:

Ld A1  Ld A2  ......  Ld A7  Ld A8

6.888 L5-Non-transient Side Channels
Out-of-Order Processor

Fetch → Decode → RegRead → Execute → Writeback (Commit)
Out-of-Order Processor

Check whether the register to read is ready.
Out-of-Order Processor

Check whether the register to read is ready.

Fetch → Decode → RegRead → Execute → Writeback (Commit)

Ld A1 → Ld A2 → Ld A7 → Ld A8

Time

6.888 L5-Non-transient Side Channels
Out-of-Order Processor

Check whether the register to read is ready.

Question: How to serialize data accesses?
Serialize Data Accesses

• A special instruction “mfence”

https://www.felixcloutier.com/x86/mfence
Serialize Data Accesses

- A special instruction “mfence”
- Add data dependency by creating a linked list

```
Dummy1 = Ld(Addr1)
Addr2 = Ld(Addr1)
```

https://www.felixcloutier.com/x86/mfence
Serialize Data Accesses

• A special instruction “mfence”  
  https://www.felixcloutier.com/x86/mfence

• Add data dependency by creating a linked list
Serialize Data Accesses

• A special instruction “mfence”

• Add data dependency by creating a linked list

- Double linked list to access addresses in reverse order

https://www.felixcloutier.com/x86/mfence
Handle Noise
Handle Noise

• A real-world example: Square-and-Multiply Exponentiation

What you generally see in papers:

```plaintext
for i = n-1 to 0 do
    r = sqr(r) mod n
    if e_i == 1 then
        r = mul(r, b) mod n
    end
end
```
The Multiply Function

```c
471 mpi_limb_t
472 mpihelp_mul( mpi_ptr_t prodp, mpi_ptr_t up, mpi_size_t usize,
473       mpi_ptr_t vp, mpi_size_t vsize)
474 {
475     mpi_ptr_t prod_endp = prodp + usize + vsize - 1;
476     mpi_limb_t cy;
477     struct karatsuba_ctx ctx;
478     if( vsize < KARATSUBA_THRESHOLD ) {
479         mpi_size_t t;
480         mpi_limb_t v_limb;
481         if( !vsize )
482             return 0;
483         /* Multiply by the first limb in V separately, as the result can be */
484         /* stored (not added) to PROD. We also avoid a loop for zeroing. */
485         v_limb = vp[0];
486         if( v_limb <= 1 ) {
487             if( v_limb == 1 )
488                 MPN_COPY( prodp, up, usize );
489             else
490                 MPN_ZERO( prodp, usize );
491             cy = 0;
492         }
493         else
494             cy = mpihelp_mul_1( prodp, up, usize, v_limb );
495         prodp[usize] = cy;
496         prodp++;
500     }
501     for( i = 1; i < vsize; ++i ) {
502         v_limb = vp[i];
503         if( v_limb <= 1 )
504             cy = 0;
505         else
506             cy = mpihelp_add_n(prodp, prodp, up, usize);
507         prodp[usize] = cy;
508         prodp++;
518     }
519     return cy;
529 }
530 memset( &ctx, 0, sizeof ctx );
531 mpihelp_mul_karatsuba_case( prodp, up, usize, vp, vsize, &ctx );
532 mpihelp_release_karatsuba_ctx( &ctx );
533 return *prod_endp;
```

6.888 L5-Non-transient Side Channels
The Multiply Function

```c
471 mpi_limb_t
472 mpihelp_mul( mpi_ptr_t prodp, mpi_ptr_t up, mpi_size_t usize,
473     mpi_ptr_t vp, mpi_size_t vsize)
474 {
475     mpi_ptr_t prod_endp = prodp + usize + vsize - 1;
476     mpi_limb_t cy;
477     struct karatsuba_ctx ctx;
478     if( vsize < KARATSUBA_THRESHOLD ) {
479         mpi_size_t t;
480         mpi_limb_t v_limb;
481         /* Multiply by the first limb in V separately, as the result can be
482          * stored (not added) to PROD. We also avoid a loop for zeroing. */
483         v_limb = vp[0];
484         if( v_limb <= 1 ) {
485             if( v_limb == 1 )
486                 MPN_COPY( prodp, up, usize );
487             else
488                 MPN_ZERO( prodp, usize );
489             cy = 0;
490         } else
491             cy = mpihelp_mul_1( prodp, up, usize, v_limb );
492     prodp[usize] = cy;
493     prodpp++;
501     /* For each iteration in the outer loop, multiply one limb from
502     * U with one limb from V, and add it to PROD. */
503     for( i = 1; i < vsize; i++ ) {
504         v_limb = vp[i];
505         if( v_limb <= 1 )
506             cy = 0;
507         else
508             cy = mpihelp_add_n(prodp, prodp, up, usize);
509         prodp[usize] = cy;
510         prodpp++;
511     }
512     return cy;
513 }
521     memset( &ctx, 0, sizeof ctx );
522     mpihelp_mul_karatsuba_case( prodp, up, usize, vp, vsize, &ctx );
523     mpihelp_release_karatsuba_ctx( &ctx );
524     return *prod_endp;
525 }
```

6.888 L5-Non-transient Side Channels

24
Access latencies measured in the probe operation in Prime+Probe. A sequence of "0101011011001" can be deduced as part of the exponent.
There may exist other problems

- Tips for lab assignment
  - Build the attack step-by-step
  - Recommend to read “Last-Level Cache Side-Channel Attacks are Practical”
  - Ask questions via Piazza
Defenses
Micro-architecture Side Channels

Kiriansky et al. DAWG: a defense against cache timing attacks in speculative execution processors. MICRO’18
Micro-architecture Side Channels

Kiriansky et al. DAWG: a defense against cache timing attacks in speculative execution processors. MICRO’18

6.888 L5-Non-transient Side Channels
Micro-architecture Side Channels

Kiriansky et al. DAWG: a defense against cache timing attacks in speculative execution processors. MICRO’18
Micro-architecture Side Channels

secret-dependent execution

Victim

A Channel
(a micro-architecture structure)

Attacker

Kiriansky et al. DAWG: a defense against cache timing attacks in speculative execution processors. MICRO’18
Micro-architecture Side Channels

Kiriansky et al. DAWG: a defense against cache timing attacks in speculative execution processors. MICRO’18
Micro-architecture Side Channels

Defenses:

Block creation of signals:
Oblivious execution, speculative execution defenses, etc.

Victim

secret-dependent execution

A Channel
(a micro-architecture structure)

Attacker

Kiriansky et al. DAWG: a defense against cache timing attacks in speculative execution processors. MICRO’18
Micro-architecture Side Channels

Defenses:

- Block creation of signals: Oblivious execution, speculative execution defenses, etc.
- Close the channel: Isolation, etc.

Kiriansky et al. DAWG: a defense against cache timing attacks in speculative execution processors. MICRO’18

6.888 L5-Non-transient Side Channels
Micro-architecture Side Channels

Defenses:

Block creation of signals: Oblivious execution, speculative execution defenses, etc.

Close the channel: Isolation, etc.

Block detection of signals: Randomization, etc.

Kiriansky et al. DAWG: a defense against cache timing attacks in speculative execution processors. MICRO’18
Defense Design Considerations

Security

Performance

Portability

6.888 L5-Non-transient Side Channels
The Problem: The ISA Abstraction

- Interface between HW and SW: ISA
  - Advantage: HW optimizations without affecting usability/portability

ISA (instruction set architecture)

Software (branch, arithmetic instruction, load/store)

Hardware (caches, DRAM, TLBs, etc.)
DEC — Decrement by 1

<table>
<thead>
<tr>
<th>Opcode</th>
<th>Instruction</th>
<th>Op/En</th>
<th>64-Bit Mode</th>
<th>Compat/LEG Mode</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>FE /1</td>
<td>DEC r/m8</td>
<td>M</td>
<td>Valid</td>
<td>Valid</td>
<td>Decrement r/m8 by 1.</td>
</tr>
<tr>
<td>REX + FE /1</td>
<td>DEC r/m8*</td>
<td>M</td>
<td>Valid</td>
<td>N.E.</td>
<td>Decrement r/m8 by 1.</td>
</tr>
<tr>
<td>FF /1</td>
<td>DEC r/m16</td>
<td>M</td>
<td>Valid</td>
<td>Valid</td>
<td>Decrement r/m16 by 1.</td>
</tr>
<tr>
<td>FF /1</td>
<td>DEC r/m32</td>
<td>M</td>
<td>Valid</td>
<td>Valid</td>
<td>Decrement r/m32 by 1.</td>
</tr>
<tr>
<td>REX.W + FF /1</td>
<td>DEC r/m64</td>
<td>M</td>
<td>Valid</td>
<td>N.E.</td>
<td>Decrement r/m64 by 1.</td>
</tr>
<tr>
<td>48+rw</td>
<td>DEC r/16</td>
<td>O</td>
<td>N.E.</td>
<td>Valid</td>
<td>Decrement r/16 by 1.</td>
</tr>
<tr>
<td>48+rd</td>
<td>DEC r/32</td>
<td>O</td>
<td>N.E.</td>
<td>Valid</td>
<td>Decrement r/32 by 1.</td>
</tr>
</tbody>
</table>

* In 64-bit mode, r/m8 cannot be encoded to access the following byte registers if a REX prefix is used: AH, BH, CH, DH.

Instruction Operand Encoding

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>M</td>
<td>ModRM.t/m</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>O</td>
<td>opcode + rd</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
</tbody>
</table>

Description

Subtracts 1 from the destination operand, while preserving the state of the CF flag. The destination operand can be a register or a memory location. This instruction allows a loop counter to be updated without disturbing the CF flag. (To perform a decrement operation that updates the CF flag, use a SUB instruction with an immediate operand of 1.)

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically.

In 64-bit mode, DEC r16 and DEC r32 are not encodable (because opcodes 48H through 4FH are REX prefixes). Otherwise, the instruction’s 64-bit mode default operation size is 32 bits. Use of the REX.R prefix permits access to additional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits.

See the summary chart at the beginning of this section for encoding data and limits.

Operation

DEST ← DEST − 1;
The Problem: The ISA Abstraction

- Interface between HW and SW: ISA

- ISA specifies functionality, not performance/timing
  - Compare Intel Ivy Bridge and Cascade Processor

Example:

```
DEC [addr]
```
Data Oblivious/“Constant time” Programming

Write program w/o data-dependent behavior
Data Oblivious/“Constant time” Programming

Write program w/o data-dependent behavior

Original:

```c
if (secret)
    a = *(addr1);
else
    a = *(addr2);
```

`secret` = confidential
`addr1` = public
`addr2` = public
Data Oblivious/“Constant time” Programming

Write program w/o data-dependent behavior

**Original:**

```c
if (secret)
    a = *(addr1);
else
    a = *(addr2);
```

**Data Oblivious:**

```c
a ← load (addr1);
b ← load (addr2);
cmov a = (secret) ? a : b;
```

`secret` = confidential
`addr1` = public
`addr2` = public
Data Oblivious/“Constant time” Programming

Write program w/o data-dependent behavior

Original:
if (secret)
    a = *(addr1);
else
    a = *(addr2);

secret = confidential
addr1 = public
addr2 = public

Data Oblivious:
a ← load (addr1);
b ← load (addr2);
cmov a = (secret) ? a : b;

secret

6.888 L5-Non-transient Side Channels
Programming in Circuit Abstraction

• Program = DAG (“circuit”)
• Operations = nodes (“gates”)
• Data transfers = edges (“wires”)

• Topology must be confidential data-[independent]
• Each gate’s execution must hide its inputs
• Each wire must hide the value it carries
What assumptions underpin the model?

if (secret)
  a = *(addr1);
else
  a = *(addr2);

secret = confidential
addr1 = public
addr2 = public

```
a ← load addr1
b ← load addr2

cmov secret, b, a
```
What assumptions underpin the model?

• Rule 1: instruction/gate execution = confidential data-independent

if (secret)
    a = *(addr1);
else
    a = *(addr2);

secret = confidential
addr1 = public
addr2 = public
What assumptions underpin the model?

• Rule 1: instruction/gate execution = confidential data-independent
• Rule 2: data transfer/wire = confidential data-independent

```c
if (secret)
    a = *(addr1);
else
    a = *(addr2);
```

secret = confidential
addr1 = public
addr2 = public

```c
a ← load addr1
```
What assumptions underpin the model?

- **Rule 1:** instruction/gate execution = confidential data-independent
- **Rule 2:** data transfer/wire = confidential data-independent
- **Rule 3:** circuit/program topology = fixed

- secret = confidential
- addr1 = public
- addr2 = public

if (secret)
  a = *(addr1);
else
  a = *(addr2);

a ← load addr1
b ← load addr2

cmov secret, b, a
Today’s machines can violate these assumptions

Violations due to:

Data-dependent instruction optimizations

(e.g., zero-skip, early exit, microcode, silent stores, ...)

• **Rule 1:** instruction/gate execution = confidential data-independent
• **Rule 2:** data transfer/wire = confidential data-independent
• **Rule 3:** circuit/program topology = fixed

```
a ← load addr1

b ← load addr2

cmov secret, b, a
```
Today's machines can violate these assumptions

Violations due to:

Data at rest optimizations
(e.g., compression in register file/uop fusion, cache, page tables, ...)

• Rule 1: instruction/gate execution = confidential data-independent
• Rule 2: data transfer/wire = confidential data-independent
• Rule 3: circuit/program topology = fixed
Today’s machines can violate these assumptions

Violations due to:

Speculative/OoO execution

- Rule 1: instruction/gate execution = confidential data-independent
- Rule 2: data transfer/wire = confidential data-independent
- **Rule 3**: circuit/program topology = fixed
HW Resource Partition

• Security v.s. Quality of Service (QoS)
  • Intel Cache Allocation Technology (CAT)
HW Resource Partition

- Security v.s. Quality of Service (QoS)
- Intel Cache Allocation Technology (CAT)
- Temporal Partition v.s. Spatial Partition
HW Resource Partition

• Security v.s. Quality of Service (QoS)
  • Intel Cache Allocation Technology (CAT)
• Temporal Partition v.s. Spatial Partition

• Challenges nowadays:
  • Security domain determination is tricky nowadays
  • Scalability: what is #domains > #partitions
  • How to partition inside cores?
  • Why not execute applications on a single node?
Randomization/Fuzzing

- Introduce noise to time measurement/Make time measurement coarse-grained
  - Pros and cons?
Randomization/Fuzzing

• Introduce noise to time measurement/Make time measurement coarse-grained
  • Pros and cons?
    + Simple and no performance overhead
    + Effective towards a group of popular attacks
    ......
    - Not effective to attacks that do not measure time
    - Not effective to victims that cause big timing difference
    - Affect usability if benign application needs to use a fine-grained timer
Randomization/Fuzzing

• Introduce noise to time measurement/Make time measurement coarse-grained
  • Pros and cons?
    + Simple and no performance overhead
    + Effective towards a group of popular attacks
    ......
    - Not effective to attacks that do not measure time
    - Not effective to victims that cause big timing difference
    - Affect usability if benign application needs to use a fine-grained timer

• Randomize cache mapping functions
  • Pros and cons?
Randomization/Fuzzing

• Introduce noise to time measurement/Make time measurement coarse-grained
  • Pros and cons?
    + Simple and no performance overhead
    + Effective towards a group of popular attacks
    ......
    - Not effective to attacks that do not measure time
    - Not effective to victims that cause big timing difference
    - Affect usability if benign application needs to use a fine-grained timer

• Randomize cache mapping functions
  • Pros and cons?
    + Generally low performance overhead (still allow cache to be shared)
    - Difficult to reason about security
    +/- Can reduce attack bandwidth, but unlikely to eliminate attacks
Next Lecture: Transient Side Channels