Cache Side Channel Lab

Introduction
Part 1: Timing Analysis (20%)
Part 2: Dead Drop: An Evil Chat Client (50%)
- Part 2.1: Transmitting a Single Bit Across Processes (Optional)
- Part 2.2: Transmitting An 8-bit Integer Across Processes
Part 3: Capture the Flag (30%)

Lab Details

The Cache Attack lab was released on February 15, and is due on March 9.

Collaboration Policy

Our full Academic Honesty policy can be found on the Course Information page of our website. As a reminder, all 6.S983 labs should be completed individually. You may discuss the lab at a high level with a classmate, but you may not work on code together or share any of your code.

Getting Started

Log in to our lab machine unicorn.csail.mit.edu. To connect via ssh run ssh YourKerberosName@unicorn.csail.mit.edu.

We are using git for assignment version management and submission for this lab. Instructions for setting up git can be found on the course website. You need to redo the first time setup on your assigned machine.

Introduction

This lab explores the practical exploitation of hardware side channels. In Part 1, you will sample the access latency of different cache levels in the memory subsystem. In Part 2 Dead Drop, you will bypass process isolation by creating a covert channel to send information between two processes. In Part 3 Capture the Flag, you will fine-tune the concepts you used in Dead Drop to steal a secret from a victim program.

This lab will help you to leverage the concepts of the attacks that you learnt from the class to build practical ones that work on real processors. In this process, you will tackle some challenges that we have not covered in the class and get a better understanding of some design details in modern system software and hardware. We designed discussion questions and provided many hints below to help you complete this lab.

Do not use VS Code’s remote ssh plugin to connect to the server! This plugin can introduce a large degree of noise, and may cause your attack to fail.

C/C++ and Assembly

C/C++ are low-level languages that give you more control over the hardware than high level languages do. The programs written in those languages can be compiled to machine code and directly execute on the hardware without other abstraction layers. When working on microarchitectural attacks, having a high degree of control over the exact instructions being executed is essentially a requirement. If you are not familiar with C/C++, our first recitation goes over the basics. You can also get yourself familiar with the syntax of C/C++ using the recitation materials.

In this lab, we will not ask you to write assembly code. Instead, we use inline assembly to wrap assembly code as C/C++ functions which you can use directly. We provide Python scripts for you to automate the process of launching attacks and plotting results.

Part 1: Timing Analysis (20%)

This part guides you through some common techniques and instructions that we can use to measure latency on processors, which is a prerequisite for timing covert/side channels.

Setting Up Your Environment

You are assigned two CPUs on the lab machine to use. After logging into the lab machine and cloning your repo, you need to modify the SENDER_CPU and RECEIVER_CPU variables in cpu.mk to your assigned CPUs. The two CPUs you have been assigned are a pair of SMT (aka Hyperthreading) thread contexts which run on the same physical core and share multiple hardware resources associated with that core such as private L1 and L2 caches.

You must set SENDER_CPU and RECEIVER_CPU to your assigned values before running code for any portion of this lab! Only after you have configured these variables, you can remove the $error$ line from cpu.mk.

Part 1.1: Determine the Machine Architecture

Before exploiting any hardware side channels, you must first understand the machine architecture. There are a few commands that can help with this on Linux.

lscpu: Provides information on the type of processor and some summary information about the architecture in the machine.
less /proc/cpuinfo: Provides detailed information about each logical processor in the machine.
getconf -a | grep CACHE: Displays the system configuration related to the cache. This will provide detailed information about how the cache is structured. The numbers that are reported using this command use Bytes (B) as the unit.

In addition, WikiChip provides a lot of information specific to each processor and architecture. You can find detailed architecture description of lab machine (an Intel Cascade Lake processor) here, which additionally provides the raw latency number for accessing different levels of caches.

1-1 Discussion Question

Caches in modern processors are generally set-associative. Use the commands above and what you learned in 6.1910[6.004] to fill in the blanks in the following table. The L1 line size has been filled in for you.

Cache Cache Line Size Total Size Number of Ways (Associativity) Number of Sets Raw Latency

L1 64 Bytes

L2

L3

You should be able to directly obtain the information for the first 3 blank columns using commands above and you will need to derive the number of sets from the previous columns. Raw latency can be obtained from the WikiChip document.

Part 1.2: Timing a Memory Access

In this part, you will time and report the latency of accessing a cache line that locates in the (a) L1 cache, (b) L2 cache, (c) L3 cache or (d) the DRAM. We will guide you through this process through the following exercise.

At a high level, your code should perform the following:

Put a cache line into a given structure (L1, L2, L3 cache, or DRAM).
Measure the access latency using measure_one_block_access_time.

To help you to make sure that you are on the right track, we have provided a reference implementation in the executable file reference. To see the reference output, use command make run-reference.

Warm-Up on C Syntax

Before we actually get started with timing measurements, you need to familiarize yourself with C syntax and several useful x86 instructions. This section gives a brief explanation for the code in utility.h and main.c. Please read the code in the file utility.h and understand the the following functions.

rdtscp and rdtscp64: Read the current timestamp counter of the processor.
lfence: Perform a serializing operation.
measure_one_block_access_time: Measure the latency of performing one memory access to a given address.
clflush: Flush a given address from the cache - later accesses to the address will load from DRAM.
print_results and print_results_for_python: Print the collected latency data in different formats.

Further details about common C/C++ constructs can be found in the C/C++ Recitation.

Measuring L1 Latency

We have provided example code in main.c to measure L1 latency. The idea is to:

Perform a load operation on a target address to bring the corresponding cache line into the L1 cache, and then
Measure the access latency by counting the number of cycles it takes to re-access the same target address.

You can compile the starter code using the command make, and then run it with make run.

1-2 Exercise

Compile and run the starter code and report the L1 latency measurements.

Measuring DRAM Latency

You will now need to figure out how to measure the DRAM access latency. Place an address into DRAM and then use measure_one_block_access_time to measure its latency.

You can leverage the instruction clflush to achieve this goal.

1-3 Exercise

Fill in the code in main.c to populate the array dram_latency and report the medium DRAM latency. Compare your results with the reference implementation.

Measuring L2 and L3 Latencies

Next, you should move on to measure L2 and L3 cache latency. Similarly, you need to put a target address to the L2 cache and L3 cache. Simply accessing the target address will make the address reside in the L1 cache. Therefore, you need to access other addresses to evict the target address from the L1 cache.

1-4 Discussion Question

How many cache lines do you need to access to fill up the entire L1 cache and L2 cache respectively? You can derive the answer from the table you filled in Discussion Question 1.

1-5 Exercise

Fill in the code in main.c to populate the array l2_latency and array l3_latency. Report the medium L2 and L3 latency. Compare your results with the reference implementation.

You should be careful with the the mismatch of access granularities. The smallest operational unit in cache is the cache line size. Accessing two integers that fall into the same line (more precisely, fall within the same cache line size aligned region of memory) will result in a cache hit, and won’t cause an eviction.

You may not be able to reliably evict the target address due to noise. This is common due to the following reasons.

Recall that the cache is set-associative, sometimes you may not get enough addresses that are mapped to the same cache set as the target address. It is possible because the cache mapping uses physical address, while the software uses virtual address.

The cache replacement policy in modern processors is more advanced than what we learnt from the class. It may smartly decide to keep the target address instead of evicting it.

To bypass the two scenarios above, you could try (1) use a buffer at the size that is 1.5x of your calculated size; (2) access the eviction buffer multiple times.

Part 1.3: Visualizing Latency Distributions

Micro-architectural side channels are notoriously noisy. The noise comes from various sources and it is unlikely to come up with a complete list. Fortunately, there exist several effective methods to combat noise, and the most commonly used method is repetition. Specifically, the attacker can repeat the measurement many times to collect a large amount of data and use statistical methods to accurately decode the secret as long as the noise is not substantial. Such statistical methods can range from simply computing the mean of the collected data, signal processing techniques to even machine learning methods.

We have provided two Python scripts run.py and graph.py to show you an example of how to automatically launch multiple measurements and visualize these measurements.

run.py - A python script that will generate 100 runs from the reference binary. It will create a folder (if one doesn’t already exist) called data, and it will store all the generated samples there in json format. The script will overwrite the folder if it already exists.

graph.py - A python script that will plot the histogram of the samples collected from run.py. It will read the data from the folder data and generate a pdf file of the histogram in a folder called graph.

There exist several knobs - feel free to adjust them however you want.

The macro SAMPLES in utility.h is the number of samples to collect for each run.
The variable executable_filename in run.py is the name of the binary file to be launched. You can change this variable to ['make', 'run'] to run your own code. Note that your code must print samples using the print_results_for_python function.
The variable num_runs in run.py is the number of times to run the binary file.

Note that the Cascade Lake architecture has very similar L1 and L2 latency; hence it may be difficult to distinguish L1 vs L2. However, the {L1, L2} vs L3 vs DRAM must be clearly distinguishable!

1-6 Exercise

Generate the histogram pdf file using your binary code and compare it with the reference implementation. Include both histograms generated by the reference implementation and your implementation in your report.

1-7 Discussion Question

Based on the generated histogram, report two thresholds, one to distinguish between L2 and L3 latency and the other to distinguish between L3 and DRAM latency.

Submission and Grading

This part is graded manually based on the following submitted materials: code and a pdf file.

Code: You will need to submit code to your assigned Github repository. Your code should be able to reproduce the histogram you submitted. Due to noise, we will run your code multiple times (5 times) and grade based on the best results. You should feel comfortable to submit your code as long as it can generate the expected results most of the time.
Solutions PDF: A PDF submitted to gradescope. In this PDF file, include the histograms from Part 1.3. Note that this file will also be used by you to provide answers for the questions in Part 2 and Part 3.

Part 2: Dead Drop: An Evil Chat Client (50%)

With all the preparation in Part 1, now you are ready to build Dead Drop: an evil chat client that can send messages between two processes running on the same machine. To make your life easier, you only need the ability to send integers from 0-255 (inclusive).

We suggest you build a Prime+Probe covert channel targeting the L2 cache. The reason is that (1) according to the analysis in Part 1, we have already observed that L2 and L3 latency can be clearly distinguished; (2) The L2 cache is private to each core, and thus is relatively easier to exploit compared to the L3 cache which is shared across cores. You are also welcome to be creative to try other types of covert channels, which we do not recommend for beginners.

Allowed Code

Unlike Part 1, in Part 2 you will have a lot of freedom in how you perform your attack. There are only a few rules for the evil chat client.

The sender and receiver must be different processes.
The sender and receiver may only use syscalls and the functions accessible from the provided util.h except for system(). There is no way to set up a shared address space between the sender and receiver, nor is there a way to call any obviously-useful-for-chat functions such as Unix sockets.
You may use any x86 instruction in your final submission except for clflush. Because you are not allowed to create a shared memory space between the two processes, traditional Flush+Reload or Evict+Reload attacks cannot be used. Instead, you can explore the possibility of using a Prime+Probe style attack.
If you would like to use some convenience code that stays within the spirit of the lab, please contact the course staff. Obviously, you may not use pre-packaged code from online for building covert channels (e.g., mastik).

Expected Behaviour

The Dead Drop client should behave in the following way. Using tmux, screen, or simply two SSH connections, we can have two different terminals running on the same machine and run the following commands:

Terminal B: $ make run_receiver    // you start the receiver process in a terminal
Terminal B: Please press enter.    // the receiver prompts you to press enter to start listening for messages

Terminal A: $ make run_sender      // you start the sender in another terminal
Terminal A: Please type a message.

Terminal B: $                      // you press Enter in the receiver's terminal
Terminal B: Receiver now listening.

Terminal A: $ 47                  // you type a message and hit enter in the sender's terminal

Terminal B: 47                    // receiver should generate the same message as you entered on the sender's side

Getting Started

The files included for the Dead Drop portion of the lab are under Part2-DeadDrop:

sender.c, receiver.c - Template code for your sender and receiver respectively
util.h, util.c - Utility functions and library imports you are permitted to use (except for clflush)

Similar to Part 1, you need to compile your code using command make. To test your chat client, you can open up two terminals on the lab machine and run make run_sender in one and make run_receiver in the other to demo your exploit. With these two commands, your processes will be running on two specific CPUs, SENDER_CPU and RECEIVER_CPU, a pair of SMT (aka Hyperthreading) thread contexts and thus sharing private L1 and L2 caches. Your implementation will only need to work on your assigned cores. This is because a successful exploit across any arbitrary pair of cores would require a resource shared between all cores on the machine which makes this problem significantly more challenging.

Part 2.1: Transmitting a Single Bit Across Processes (Optional)

In case you have no clue on how to get started to build a covert channel to communicate an 8-bit integer, this section provides you a tip: start with communicating a 1-bit value first. You should find it useful reading this section to understand how to design a communication protocol and several common pitfalls when you actually implement a protocol. Note that this is optional, and you will receive full credit of Part 2 if you directly implement an 8-bit chat client (Part 2.2). While, if your 8-bit chat client does not work properly, you can get partial credit (10% of this lab) by completing a single-bit chat client.

Communication Protocol

To build a chatting client, we suggest you first construct a communication protocol, which includes:

Encode Step: What is the sender’s action for sending bit 1 and bit 0?
Receive Step What is the timing measurement strategy used by the receiver, such as the number of addresses to access?
Decode Step: What is the threshold to be used to distinguish between bit 1 and bit 0?

You should come up with a tentative plan for (1) and (2), and then experimentally derive the threshold used in (3). If you are unable to derive the threshold, meaning that there does not exist a threshold to decode the bit, you will need to revisit your sender and receiver strategy.

To build a single-bit chatting client, you can use the whole L2 cache as the communication media. The receiver can measure the latency of accessing multiple cache lines.

Exercise (Optional)

Derive the threshold for the decode operation. Implement a single-bit chatting client.

Common Pitfalls

If the receiver needs to measure the latency of multiple memory accesses, you should pay attention to the following features that can introduce substantial noise to your communication channel.

System Calls: Be careful issuing system calls (such as printf) while measuring latencies. System calls have a large amount of overhead, both adding additional latency, and potentially causing signficant cache evictions during execution.
Cache line size: Be careful with the mismatch of the size of an integer and a cache line. Repeatedly accessing the same cache line will result in cache hits and will not help you to detect the actual cache states.
Hardware prefetch: Modern processors have hardware that can prefetch data into the cache before it is required as it attempts to anticipate future accesses. This may cause spurious cache evictions or mask evictions caused by the victim. For example, if an address has been evicted by the sender and is supposed to locate in L3, but it is prefetched by the hardware to the L2, your measurement will mistakenly consider this cache line has not been evicted before.

Fortunately, the hardware prefetcher is not that smart. Generally, in most processors, it is only triggered by linear, fixed-stride access patterns within a page. In this lab, we disable the prefetcher of the machine to simplify your implementation
Cache replacement policy: While least recently used (LRU) is the most common example of a cache replacement policy, practical processor implementations often use much more complex policies. You may need to experiment a bit to roughly figure out how eviction and measurement works on your processor.

Furthermore, be sure to carefully consider how you are accessing your cache lines. With Prime+Probe attacks, it is common to encounter cache-thrashing or self-conflicts, a situation in which the attacker primes the cache set and evicts their own data with subsequent accesses while probing. One way to avoid this problem is to prime the cache by accessing the eviction set in one direction and probe in the reverse direction, as described by Tromer et al.
Noisy or inconsistent results: Because side channels exploit shared resources in unintended ways, the resulting covert channels can be quite noisy. Modern processors often contain optimizations that make them behave differently from the simplified architectures taught in class. This lab requires experimentation to find working approaches and values. You should not expect your solution to work on the first attempt, so be sure to incrementally build up your solutions and verify that each step is working before proceeding.

Part 2.2: Transmitting An 8-bit Integer Across Processes

To transmit an 8-bit integer across two isolated processes, you can build a covert channel using the L2 cache since the sender and receiver will share the L2 via SMT. There is more than one way to do so, and we describe two possible methodologies here. Feel free to explore other methodologies.

A naïve approach is to transmit an 8-bit integer by sending each bit sequentially. If you have already learned how to transmit a single bit in the above subsection, you can re-use your method to transmit 8 bits one at a time. In case you decide to go along this path, we have provided a string to binary conversion function and a binary to string conversion function in utils.h and utils.c.

However, such a naïve approach requires the two processes to synchronize. From our past experience and according to the feedback we received from past semesters, we believe the synchronization implementation to be non-trivial.
The second approach is to encode an 8-bit integer using set-indices in the L2 cache. Since an 8-bit integer can have any value between 0-255, you can encode all possible values using 256 independent cache sets in the L2. This approach does not require non-trivial synchronization between the sender and the receiver, but on the other hand, requires that you implement a prime-evict-probe sequence at set-granularity between the sender and receiver. Recollecting from our previous experiences, we would like to recommend this approach.

Implementing set-granularity prime-evict-probe sequences requires a deeper understanding of the the virtual address abstraction and the usage of huge pages. In case you decide to go along this second path, we have also provided an allocation of a huge page in the starter code in sender.c. We next briefly discuss the virtual address abstraction and then given a primer on huge pages. You can skip the discussion on huge pages if you do not plan to use this approach.
There exist many other approaches. Several communication protocols have been discussed in An Exploration of L2 Cache Covert Channels in Virtualized Environments.

Linux Hugepages

Keep in mind that the addresses you are dealing with in your C code are virtual addresses. However, physical addresses (which you will not have access to within your code) may be used when indexing into lower level caches. You may need to consider how address translation interacts with cache indexing/tagging as well as run some experiments to see if this presents a problem for your covert channel implementation. For a review of virtual memory, please refer 6.191’s (6.004’s) lecture on virtual memory.

The default page size used by most operating systems is 4K (2¹²) bytes. You may find it useful to use a larger page size to guarantee consecutive physical addresses. In particular, Linux supports Huge TLB pages for private anonymous pages, which allow programs to allocate 2 MB of contiguous physical memory, ensuring 2²¹ bytes consecutive physical addresses.

void *buf= mmap(NULL, BUFF_SIZE, PROT_READ | PROT_WRITE, 
                MAP_POPULATE | MAP_ANONYMOUS | MAP_PRIVATE | MAP_HUGETLB, 
                -1, 0);

if (buf == (void*) - 1) {
  perror("mmap() error\n");
  exit(EXIT_FAILURE);
}

*((char *)buf) = 1; // dummy write to trigger page allocation

You can use the command man mmap to understand the semantics for each argument used by this systemcall function.

Monitoring Hugepage Usage

You can see if your huge page is being allocated or not by watching the status of /proc/meminfo. Namely, if you run cat /proc/meminfo | grep HugePages_, you should see the number of HugePages_Free decrease by 1 when your code is using one.

Discussion Question (Optional)

Given a 64-bit virtual address, fill in the table below. The table is intended to help you to figure out how to find addresses that map to the same L2 cache set.

Page Size 4KB 2MB

Page Offset Bits

Page Number Bit

L2 Set Index Bit

L2 Set Index Fully Under Control

2-1 Discussion Question

Describe your communication protocol.

2-2 Exercise

Implement a chat client that can comunicate an 8-bit integer. Make sure to refer to the “Common Pitfalls” section when debugging.

Submission, Grading and Checkoff

You will submit the following items for this part.

Push the final version of your code to your assigned Github repository.
In your report, include your answers for the discussion questions.

As mentioned before, if the 8-bit chat client does not work, we will give partial credit (10% of the whole lab) if you have completed a single-bit chat client.

Part 3: Capture the Flag (30%)

In this part of the lab, you will be attempting to extract a secret value from a victim program. You will get a taste of a Capture the Flag (CTF) problem. The future labs will follow a similar pattern. Note that, we have deliberately designed the Part 3 to be a relatively simple CTF problem. If you finished Part 2, this part should not take you a lot of time.

To keep the lab manageable, we have created a simple program, victim-N.c, that performs a secret-dependent memory access (pseudocode below). The N will depend on how many accesses the victim makes to the cache set with an index of the flag. For now, we explain the victim code assuming the victim accesses 4 addresses. We explain at the end the other victim binaries. Your task will be to write a program, attacker.c that determines this secret value.

// Set flag to random integer in the 
// range [0, NUM_L2_CACHE_SETS)
int flag = random(0, NUM_L2_CACHE_SETS);
printf(flag);

// Allocate a large memory buffer
char *buf = get_buffer();

// Find four addresses in buf that all map to the cache set 
// with an index of flag to create a partial eviction set
char *eviction_set[N];
get_partial_eviction_set(eviction_set, flag);

// Main loop
while (true) {
    for (int i = 0; i < N; i++) {
        // Access the eviction address
        (*(eviction_set[i]))++;
    }
}

3-1 Exercise

Complete the code in attacker.c to successfully extract the secret values from victim-[2-4].

Using tmux, screen, or simply two SSH connections, you should be able to run make run_victim-N in one terminal and make run_attacker in another terminal. If you have problems running the victim binary, you may need to run chmod +x victim-N within your lab directory.

Submission and Grading

As with Dead Drop, you will upload your code to your assigned Github repository.

For Part 3, we have provided 3 different binaries of the victim, namely victim-2, victim-3, and victim-4. The binaries access different number of eviction addresses that map to the same cache set as the index of the flag. For example, victim-4 will access 4 eviction addresses, while victim-2 will access only 2 eviction addresses. Hence, the number of accesses the victim makes will act as a proxy to the difficulty of Part 3 of the lab. You will get 75% partial credit for making your attacker successfully capture the flag for victim-4 and full credit (100%) for victim-3. If you’d like an additional (ungraded) challenge, you can also try tackling victim-2.

Acknowledgements

Contributors: Miles Dai, Weon Taek Na, Joseph Ravichandran, Mengjia Yan, Peter Deutsch.

The original Dead Drop lab (Part 2 of this lab) was developed by Christopher Fletcher for CS 598CLF at UIUC. The starting code and lab handout are both heavily adapted from his work.

Page Size	4KB	2MB
Page Offset Bits
Page Number Bit
L2 Set Index Bit
L2 Set Index Fully Under Control

Cache Side Channel Lab

Table of Contents

Lab Details

Collaboration Policy

Getting Started

Introduction

C/C++ and Assembly

Part 1: Timing Analysis (20%)

Setting Up Your Environment

Part 1.1: Determine the Machine Architecture

Part 1.2: Timing a Memory Access

Warm-Up on C Syntax

Measuring L1 Latency

Measuring DRAM Latency

Measuring L2 and L3 Latencies

Part 1.3: Visualizing Latency Distributions

Submission and Grading

Part 2: Dead Drop: An Evil Chat Client (50%)

Allowed Code

Expected Behaviour

Getting Started

Part 2.1: Transmitting a Single Bit Across Processes (Optional)

Communication Protocol

Common Pitfalls

Part 2.2: Transmitting An 8-bit Integer Across Processes

Linux Hugepages

Monitoring Hugepage Usage

Submission, Grading and Checkoff

Part 3: Capture the Flag (30%)

Submission and Grading

Acknowledgements