## **Non-transient Side Channels**

Mengjia Yan

Fall 2020





### **Lab Assignment**

- Handout on course website
- Each (regular) student will receive an email
  - Solo or 2-person group
  - Individual GitHub repo
  - Info about accessing a server machine
- Listeners can send us an email if you want to try the lab
- Advice:
  - Start early. The first step is not to implement the attack, but to reverse engineer the machine.











Receive "1" = 8 accesses  $\rightarrow$  1 miss

## **Analogy: Bucket/Ball**



that can hold 8 balls

## **Analogy: Bucket/Ball**

How many cache lines in total in the system? Sender Receiver Sender's address s address Rec # ways Cache Set **Shared Cache** Each cache set is a bucket that can hold 8 balls

## **Analogy: Bucket/Ball**



How many cache lines in total in the system? How to find the bucket used by the sender?



# **Practical Cache Side Channels**





Can think cache mapping as a hash table with limited size

Physical Address:



- Can think cache mapping as a hash table with limited size
- Linear cache set mapping using modular arithmetic

Physical Address: 31

Address: 0



- Can think cache mapping as a hash table with limited size
- Linear cache set mapping using modular arithmetic

Physical Address:

Set Index = (Addr / Block Size) % Number of Sets



- Can think cache mapping as a hash table with limited size
- Linear cache set mapping using modular arithmetic Assuming byte-addressable

Physical Address:

| 31                | 9 | 8 6       | 5 | 0           |
|-------------------|---|-----------|---|-------------|
| Tag               |   | Set Index |   | Line offset |
| (high order bits) |   | (3 bits)  |   | (6 bits)    |



- Can think cache mapping as a hash table with limited size
- Linear cache set mapping using modular arithmetic Assuming byte-addressable

Physical Address:

| 31    |             | 9 | 8     | 6     | 5  | 0         |
|-------|-------------|---|-------|-------|----|-----------|
|       | Tag         |   | Set I | ndex  | Li | ne offset |
| (high | order bits) |   | (3 k  | oits) |    | (6 bits)  |



- Can think cache mapping as a hash table with limited size
- Linear cache set mapping using modular arithmetic Assuming byte-addressable

Physical Address:





- Can think cache mapping as a hash table with limited size
- Linear cache set mapping using modular arithmetic Assuming byte-addressable







- Can think cache mapping as a hash table with limited size
- Linear cache set mapping using modular arithmetic Assuming byte-addressable







- Can think cache mapping as a hash table with limited size
- Linear cache set mapping using modular arithmetic

#### 2-way cache

Physical Address:

| 31                | 9 | 8 | 6        | 5 | 0           |
|-------------------|---|---|----------|---|-------------|
| Tag               |   |   | Index    |   | Line offset |
| (high order bits) |   |   | (3 bits) |   | (6 bits)    |



- Can think cache mapping as a hash table with limited size
- Linear cache set mapping using modular arithmetic

#### 2-way cache

Physical Address:

| Tag Index Line offset                 | 31                       | 9 | 8             | 6 | 5 | 0               |
|---------------------------------------|--------------------------|---|---------------|---|---|-----------------|
| (Tilgit Order Dits) (3 Dits) (6 Dits) | Tag<br>(high order bits) |   | Indo<br>(3 bi |   |   | offset<br>bits) |



Question: How to decide which way to use?

- Can think cache mapping as a hash table with limited size
- Linear cache set mapping using modular arithmetic

#### 2-way cache

Physical Address:

| 31 9              | 8 | 6        | 5           | U |
|-------------------|---|----------|-------------|---|
| Tag               | Т | Index    | Line offset | t |
| (high order bits) |   | (3 bits) | (6 bits)    |   |



Question: How to decide which way to use?

**Answer: Cache replacement policy.** 

- Can think cache mapping as a hash table with limited size
- Linear cache set mapping using modular arithmetic

#### 2-way cache

Physical Address:



Find eviction set

==

Find addresses with the same set index bits

Question: How to decide which way to use?

**Answer: Cache replacement policy.** 



## **Address Translation (4KB page)**

Programmer's view Virtual Address (48bit):



system's view Physical Address (32bit):



## **Address Translation (4KB page)**



## **Address Translation (4KB page)**

Programmer's view Virtual Address (48bit):



system's view Physical Address (32bit):

















#### **Huge Pages**

- Huge page size: 2MB or 1GB
  - Number of bits for page offset?

#### **Huge Pages**

- Huge page size: 2MB or 1GB
  - Number of bits for page offset?

Virtual Address: 4KB page



Virtual Address : 2MB page



# **Huge Pages**

- Huge page size: 2MB or 1GB
  - Number of bits for page offset?

Virtual Address: 4KB page

| 48 |                     | 12 | 11 |                          | 0 |
|----|---------------------|----|----|--------------------------|---|
|    | Virtual page number |    |    | Page offset<br>(12 bits) |   |

Virtual Address : 2MB page

Cache mapping: (256 sets)





- Motivation:
  - A memory cannot be large and fast. Add level of cache to reduce miss penalty



- Motivation:
  - A memory cannot be large and fast. Add level of cache to reduce miss penalty

A typical configuration of Intel Ivy Bridge. Configurations are different with processor types.

| core      | core      |
|-----------|-----------|
| I-L1 D-L1 | I-L1 D-L1 |
| L2        | L2        |
|           | LLC       |

|                        | L1-I/D cache | L2 cache | L3 cache (LLC) | DRAM |
|------------------------|--------------|----------|----------------|------|
| Size                   | 32KB         | 256KB    | 1MB/core       | 16GB |
| Associativity (# ways) | 4 or 8       | 8        | 16             | N/A  |
| Latency<br>(cycles)    | 1-5          | 12       | ~40            | ~150 |

- Motivation:
  - A memory cannot be large and fast. Add level of cache to reduce miss penalty
- LLC is generally divided into multiple slices





- Motivation:
  - A memory cannot be large and fast. Add level of cache to reduce miss penalty
- LLC is generally divided into multiple slices







- Motivation:
  - A memory cannot be large and fast. Add level of cache to reduce miss penalty
- LLC is generally divided into multiple slices







- Motivation:
  - A memory cannot be large and fast. Add level of cache to reduce miss penalty
- LLC is generally divided into multiple slices
  - Conflict happens if addresses map to the same slice and the same set



core core

I-L1 D-L1 I-L1 D-L1 ...

L2 L2



# **Eviction Set Construction Algorithm**



# **Eviction Set Construction Algorithm**



# **Eviction Set Construction Algorithm**



- Self-eviction due to replacement policy
  - An LRU (least recently used) example

| nitial: |  |  |  |  |  |  |  |  |
|---------|--|--|--|--|--|--|--|--|
|---------|--|--|--|--|--|--|--|--|

- Self-eviction due to replacement policy
  - An LRU (least recently used) example

Prime: 1 2 3 4 5 6 7 8

- Self-eviction due to replacement policy
  - An LRU (least recently used) example

| Initial: |
|----------|
|----------|

Prime: 1 2 3 4 5 6 7 8

Victim access: 9 2 3 4 5 6 7 8

- Self-eviction due to replacement policy
  - An LRU (least recently used) example



- Self-eviction due to replacement policy
  - An LRU (least recently used) example

Initial:

Prime:

- A small trick:
  - Access addresses in reverse order

Victim access: 9 2 3 4 5 6 7 8

Probe: 9 2 3 4 5 6 7 8

Which to evict?

# Measure Latency of Multiple Accesses

• HW Prefetcher + Out-of-order execution

```
T1 = rdtsc()
Dummy1=Ld(Addr1)
.....

Dummy8=Ld(Addr8)

T2 = rdtsc()
Latency = T2-T1
```

# Measure Latency of Multiple Accesses

• HW Prefetcher + Out-of-order execution

#### What we expect:





# Measure Latency of Multiple Accesses

HW Prefetcher + Out-of-order execution

# T1 = rdtsc() Dummy1=Ld(Addr1) ..... Dummy8=Ld(Addr8) T2 = rdtsc() Latency = T2-T1

#### What we expect:











A special instruction "mfence"

https://www.felixcloutier.com/x86/mfence

A special instruction "mfence"

https://www.felixcloutier.com/x86/mfence

Add data dependency by creating a linked list

```
Dummy1 = Ld(Addr1)

Addr2 = Ld(Addr1)
```

A special instruction "mfence"

- https://www.felixcloutier.com/x86/mfence
- Add data dependency by creating a linked list



A special instruction "mfence"

- https://www.felixcloutier.com/x86/mfence
- Add data dependency by creating a linked list



Double linked list to access addresses in reverse order



# **Handle Noise**

## **Handle Noise**

A real-world example: Square-and-Multiply Exponentiation

What you generally see in papers:

**for** 
$$i = n-1 \text{ to } 0 \text{ do}$$

$$r = sqr(r) \mod n$$

if 
$$e_i == 1$$
 then

$$r = mul(r, b) \mod n$$

end

end

# The Multiply Function

```
471 mpi_limb_t
472 mpihelp_mul( mpi_ptr_t prodp, mpi_ptr_t up, mpi_size_t usize,
                     mpi_ptr_t vp, mpi_size_t vsize)
473
474 {
       mpi_ptr_t prod_endp = prodp + usize + vsize - 1;
475
       mpi_limb_t cy;
476
477
        struct karatsuba_ctx ctx;
478
       if( vsize < KARATSUBA_THRESHOLD ) {</pre>
479
       mpi_size_t i;
       mpi_limb_t v_limb;
481
482
483
        if( !vsize )
           return 0;
485
        /* Multiply by the first limb in V separately, as the result can be
487
         * stored (not added) to PROD. We also avoid a loop for zeroing. */
        v_{limb} = vp[0];
489
       if( v_limb <= 1 ) {</pre>
           if(v_{limb} == 1)
491
           MPN_COPY( prodp, up, usize );
492
            else
           MPN_ZERO( prodp, usize );
           cy = 0;
496
        else
497
            cy = mpihelp_mul_1( prodp, up, usize, v_limb );
        prodp[usize] = cy;
        prodp++;
```

```
/* For each iteration in the outer loop, multiply one limb from
         * U with one limb from V, and add it to PROD. */
        for( i = 1; i < vsize; i++ ) {</pre>
            v_{limb} = vp[i];
            if( v_limb <= 1 ) {
507
            cy = 0;
            if( v_limb == 1 )
509
               cy = mpihelp_add_n(prodp, prodp, up, usize);
510
511
            else
512
            cy = mpihelp_addmul_1(prodp, up, usize, v_limb);
513
514
            prodp[usize] = cy;
515
            prodp++;
516
517
518
        return cy;
519
520
521
        memset( &ctx, 0, sizeof ctx );
522
        mpihelp_mul_karatsuba_case( prodp, up, usize, vp, vsize, &ctx );
523
        mpihelp_release_karatsuba_ctx( &ctx );
524
        return *prod_endp;
525 }
```

# The Multiply Function

```
471 mpi_limb_t
472 mpihelp_mul( mpi_ptr_t prodp, mpi_ptr_t up, mpi_size_t usize,
                     mpi_ptr_t vp, mpi_size_t vsize)
473
474 {
       mpi_ptr_t prod_endp = prodp + usize + vsize - 1;
475
       mpi_limb_t cy;
476
477
        struct karatsuba_ctx ctx;
478
       if( vsize < KARATSUBA_THRESHOLD ) {</pre>
479
       mpi_size_t i;
       mpi_limb_t v_limb;
481
482
483
        if( !vsize )
           return 0;
485
        /* Multiply by the first limb in V separately, as the result can be
487
         * stored (not added) to PROD. We also avoid a loop for zeroing. */
        v_{limb} = vp[0];
489
       if( v_limb <= 1 ) {</pre>
           if(v_{limb} == 1)
491
           MPN_COPY( prodp, up, usize );
492
            else
           MPN_ZERO( prodp, usize );
           cy = 0;
496
        else
497
            cy = mpihelp_mul_1( prodp, up, usize, v_limb );
        prodp[usize] = cy;
        prodp++;
```

```
/* For each iteration in the outer loop, multiply one limb from
        * U with one limb from V. and add it to PROD. */
       for( i = 1; i < vsize; i++ ) {
           v_{limb} = vp[i];
           if( v_limb <= 1 ) {
            cy = 0;
           if( v_limb == 1 )
               cy = mpihelp_add_n(prodp, prodp, up, usize);
510
511
            else
512
            cy = mpihelp_addmul_1(prodp, up, usize, v_limb);
513
514
            prodp[usize] = cy;
515
            prodp++;
516
517
518
        return cy;
519
520
521
        memset( &ctx, 0, sizeof ctx );
522
        mpihelp_mul_karatsuba_case( prodp, up, usize, vp, vsize, &ctx );
523
        mpihelp_release_karatsuba_ctx( &ctx );
524
        return *prod_endp;
525 }
```

## **Raw Trace**



Access latencies measured in the probe operation in Prime+Probe.

A sequence of "01010111011001" can be deduced as part of the exponent.

# There may exist other problems

- Tips for lab assignment
  - Build the attack step-by-step
  - Recommend to read "Last-Level Cache Side-Channel Attacks are Practical"
  - Ask questions via Piazza

# **Defenses**





## Micro-architecture Side Channels



A Channel (a micro-architecture structure)



## Micro-architecture Side Channels



















## **Defense Design Considerations**



#### The Problem: The ISA Abstraction

- Interface between HW and SW: ISA
  - Advantage: HW optimizations without affecting usability/portability



#### **DEC** — Decrement by 1

| Opcode       | Instruction    | Op/En | 64-Bit Mode | Compat/Leg Mode | Description             |
|--------------|----------------|-------|-------------|-----------------|-------------------------|
| FE /1        | DEC r/m8       | M     | Valid       | Valid           | Decrement $r/m8$ by 1.  |
| REX + FE /1  | DEC $r/m8^*$   | M     | Valid       | N.E.            | Decrement $r/m8$ by 1.  |
| FF /1        | DEC r/m16      | M     | Valid       | Valid           | Decrement $r/m16$ by 1. |
| FF /1        | DEC r/m32      | M     | Valid       | Valid           | Decrement $r/m32$ by 1. |
| REX.W + FF/1 | DEC r/m64      | M     | Valid       | N.E.            | Decrement $r/m64$ by 1. |
| 48+rw        | DEC <i>r16</i> | О     | N.E.        | Valid           | Decrement $r16$ by 1.   |
| 48+rd        | DEC <i>r32</i> | О     | N.E.        | Valid           | Decrement $r32$ by 1.   |

<sup>\*</sup> In64-bitmode,r/m8cannotbeencodedtoaccessthefollowingbyteregistersifaREXprefixisused:AH,BH,CH,DH.

#### **Instruction Operand Encoding** ¶

| Op/En | Operand 1          | Operand 2 | Operand 3 | Operand 4 |
|-------|--------------------|-----------|-----------|-----------|
| M     | ModRM:r/m (r, w)   | NA        | NA        | NA        |
| О     | opcode + rd (r, w) | NA        | NA        | NA        |

#### **Description**

Subtracts 1 from the destination operand, while preserving the state of the CF flag. The destination operand can be a register or a memory location. This instruction allows a loop counter to be updated without disturbing the CF flag. (To perform a decrement operation that updates the CF flag, use a SUB instruction with an immediate operand of 1.)

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically.

In 64-bit mode, DEC r16 and DEC r32 are not encodable (because opcodes 48H through 4FH are REX prefixes). Otherwise, the instruction's 64-bit mode default operation size is 32 bits. Use of the REX.R prefix permits access to additional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits.

See the summary chart at the beginning of this section for encoding data and limits.

**Operation** ¶

From https://www.felixcloutier.com/x86/index.html

#### The Problem: The ISA Abstraction

Interface between HW and SW: ISA

ISA specifies functionality, not performance/timing

Compare Intel Ivy Bridge and Cascade Processor

Software (branch, arithmetic instruction, load/store)

Example:

DEC [addr]

ISA \_\_\_\_
(instruction set architecture)

Hardware (caches, DRAM, TLBs, etc.)

Write program w/o data-dependent behavior

Write program w/o data-dependent behavior

#### **Original:**

```
if (secret)
    a = *(addr1);
else
    a = *(addr2);

secret = confidential
addr1 = public
addr2 = public
```

Write program w/o data-dependent behavior

#### **Original:**

#### **Data Oblivious:**

```
if (secret)
    a = *(addr1);
else
    a = *(addr2);

secret = confidential
addr1 = public
addr2 = public
```

```
a ← load (addr1);
b ← load (addr2);
cmov a = (secret) ? a : b;
```

Write program w/o data-dependent behavior

#### **Original:**

#### Original

```
if (secret)
    a = *(addr1);
else
    a = *(addr2);

secret = confidential
addr1 = public
addr2 = public
```

#### **Data Oblivious:**

```
a ← load (addr1);
b ← load (addr2);
cmov a = (secret) ? a : b;
```

# a ← load addr1 b ← load addr2 cmov secret, b, a

#### **Programming in Circuit Abstraction**

- Program = DAG ("circuit")
- Operations = nodes ("gates")
- Data transfers = edges ("wires")
- Topology must be confidential data-independent
- Each gate's execution must hide its inputs
- Each wire must hide the value it carries



```
if (secret)
    a = *(addr1);
else
    a = *(addr2);

secret = confidential
addr1 = public
addr2 = public
```



```
if (secret)
    a = *(addr1);
else
    a = *(addr2);

secret = confidential
addr1 = public
addr2 = public
```



• **Rule 1:** instruction/gate execution = confidential data-independent

```
if (secret)
    a = *(addr1);
else
    a = *(addr2);

secret = confidential
addr1 = public
addr2 = public
```



- **Rule 1:** instruction/gate execution = confidential data-independent
- Rule 2: data transfer/wire = confidential data-independent

```
if (secret)
    a = *(addr1);
else
    a = *(addr2);

secret = confidential
addr1 = public
addr2 = public
```



- **Rule 1:** instruction/gate execution = confidential data-independent
- Rule 2: data transfer/wire = confidential data-independent
- Rule 3: circuit/program topology = fixed

## Today's machines can violate these assumptions

#### **Violations due to:**

Data-dependent instruction optimizations

(e.g., zero-skip, early exit, microcode, silent stores, ...)



- **Rule 1:** instruction/gate execution = confidential data-independent
- Rule 2: data transfer/wire = confidential data-independent
- Rule 3: circuit/program topology = fixec

#### Today's machines can violate these assumptions

#### **Violations due to:**

Data at rest optimizations

(e.g., compression in register file/uop fusion, cache, page tables, ...)



- Rule 1: instruction/gate execution = confidential data-independent
- Rule 2: data transfer/wire

- = confidential data-independent
- Rule 3: circuit/program topology

## Today's machines can violate these assumptions

#### **Violations due to:**

Speculative/OoO execution



- Rule 1: instruction/gate execution = confidential data-independent
- Rule 2: data transfer/wire = confidential data-independent
- Rule 3: circuit/program topology = fixed

#### **HW Resource Partition**

- Security v.s. Quality of Service (QoS)
  - Intel Cache Allocation Technology (CAT)



#### **HW Resource Partition**

- Security v.s. Quality of Service (QoS)
  - Intel Cache Allocation Technology (CAT)
- Temporal Partition v.s. Spatial Partition



#### **HW Resource Partition**

- Security v.s. Quality of Service (QoS)
  - Intel Cache Allocation Technology (CAT)
- Temporal Partition v.s. Spatial Partition



- Challenges nowadays:
  - Security domain determination is tricky nowadays
  - Scalability: what is #domains > #partitions
  - How to partition inside cores?
  - Why not execute applications on a single node?

- Introduce noise to time measurement/Make time measurement coarse-grained
  - Pros and cons?

- Introduce noise to time measurement/Make time measurement coarse-grained
  - Pros and cons?

- + Simple and no performance overhead
- + Effective towards a group of popular attacks

••••

- Not effective to attacks that do not measure time
- Not effective to victims that cause big timing difference
- Affect usability if benign application needs to use a fine-grained timer

- Introduce noise to time measurement/Make time measurement coarse-grained
  - Pros and cons?

- + Simple and no performance overhead
- + Effective towards a group of popular attacks

.....

- Not effective to attacks that do not measure time
- Not effective to victims that cause big timing difference
- Affect usability if benign application needs to use a fine-grained timer
- Randomize cache mapping functions
  - Pros and cons?

- Introduce noise to time measurement/Make time measurement coarse-grained
  - Pros and cons?

- + Simple and no performance overhead
- + Effective towards a group of popular attacks

.....

- Not effective to attacks that do not measure time
- Not effective to victims that cause big timing difference
- Affect usability if benign application needs to use a fine-grained timer
- Randomize cache mapping functions
  - Pros and cons?

- + Generally low performance overhead (still allow cache to be shared)
- Difficult to reason about security
- +/- Can reduce attack bandwidth, but unlikely to eliminate attacks

## Next Lecture: Transient Side Channels



