Lab 7: RISC-V Processor with DRAM and Caches

Lab 7 due date: November, Friday 10 , 2017, at 11:59:59 PM EST.

Your deliverables for Lab 7 are:

Introduction

Now, you have a 6-stage pipelined RISC-V processor with branch target and direction predictors (a BTB and a BHT). Unfortunately, your processor is limited to running programs that can fit in fast and small memory (In BRAM). This works fine for the small benchmark programs we are running, such as a 250-item quicksort, but most interesting applications are (much) larger than the currently 256 KB. In this lab we will see how to plug a bigger but slower memory (that has a slightly different interface), and then we will cache the accesses to the slow memory with the fast small memory. Indeed, big memory (DDR) is great for storing large programs, but this may hurt the performance since DRAM has comparatively long read latencies.

This lab will focus on using DRAM instead of block RAM for main program and data storage to store larger programs and adding caches to reduce the performance penalty from long-latency DRAM loads.

First, you will write a translator module that translates CPU memory requests into DRAM requests. This module vastly expands your program storage space, but your program will run much more slowly because you read from DRAM in almost every cycle. Next, you will implement a cache to reduce the amount of times you need to read from the DRAM, therefore improving your processors performance.

DRAM Interface

DDR3 memory has a 64-bit wide data bus, but eight 64-bit chunks are sent per transfer, so effectively it acts like a 512-bit-wide memory. DDR3 memories have high throughput, but they also have high latencies for reads.

We give you a DDR3 memory controller, we can connect to it through the MemoryClient interface. The typedefs provided for you in this lab use types from BSV's Memory package (see BSV reference guide or source code at $BLUESPECDIR/BSVSource/Misc/Memory.bsv). Here are some of the typedefs related to DDR3 memory in src/includes/MemTypes.bsv:

typedef 24 DDR3AddrSize;
typedef Bit#(DDR3AddrSize) DDR3Addr;
typedef 512 DDR3DataSize;
typedef Bit#(DDR3DataSize) DDR3Data;
typedef TDiv#(DDR3DataSize, 8) DDR3DataBytes;
typedef Bit#(DDR3DataBytes) DDR3ByteEn;
typedef TDiv#(DDR3DataSize, DataSize) DDR3DataWords;

// The below typedef is equivalent to this:
// typedef struct {
//     Bool        write;
//     Bit#(64)    byteen;
//     Bit#(24)    address;
//     Bit#(512)   data;
// } DDR3_Req deriving (Bits, Eq);
typedef MemoryRequest#(DDR3AddrSize, DDR3DataSize) DDR3_Req;

// The below typedef is equivalent to this:
// typedef struct {
//     Bit#(512)   data;
// } DDR3_Resp deriving (Bits, Eq);
typedef MemoryResponse#(DDR3DataSize) DDR3_Resp;

// The below typedef is equivalent to this:
// interface DDR3_Client;
//     interface Get#( DDR3_Req )  request;
//     interface Put#( DDR3_Resp ) response;
// endinterface;
typedef MemoryClient#(DDR3AddrSize, DDR3DataSize) DDR3_Client;

DDR3_Req

The requests for DDR3 reads and writes are different than requests for FPGAMemory. The biggest difference is the byte enable signal, byteen.

DDR3_Resp

DDR3 memory only sends responses for reads, just like FPGAMemory. The memory response type is a structure -- so instead of directly receiving a Bit#(512) value, you will have to access the data field of the response in order to get the Bit#(512) value.

Example Code

Here is some example code showing how to construct the FIFOs for a DDR3 memory interface along with the initialization interface for DDR3. This example code is provided in src/DDR3Example.bsv.

import GetPut::*;
import ClientServer::*;
import Memory::*;
import CacheTypes::*;
import WideMemInit::*;
import MemUtil::*;
import Vector::*;

// other packages and type definitions

(* synthesize *)
module mkProc#(Fifo#(2,DDR3_Req) ddr3ReqFifo, Fifo#(2,DDR3_Resp) ddr3RespFifo)(Proc);
// Those two fifos are the interface with the real DRAM
	Ehr#(2, Addr)  pcReg <- mkEhr(?);
	CsrFile         csrf <- mkCsrFile;
	// other processor stats and components
	//...
	
	// wrap DDR3 to WideMem interface
	WideMem           wideMemWrapper <- mkWideMemFromDDR3( ddr3ReqFifo, ddr3RespFifo );
	// split WideMem interface to two (use it in a multiplexed way) 
	// This spliter only take action after reset (i.e. memReady && csrf.started)
	// otherwise the guard may fail, and we get garbage DDR3 resp
	Vector#(2, WideMem)     wideMems <- mkSplitWideMem( memReady && csrf.started, wideMemWrapper );
	// Instruction cache should use wideMems[1]
	// Data cache should use wideMems[0]
	
	// some garbage may get into ddr3RespFifo during soft reset
	// this rule drains all such garbage
	rule drainMemResponses( !csrf.started );
		ddr3RespFifo.deq;
	endrule
	
	// other rules
	
	method ActionValue#(CpuToHostData) cpuToHost if(csrf.started);
		let ret <- csrf.cpuToHost;
		return ret;
	endmethod
	
	// add ddr3RespFifo empty into guard, make sure that garbage has been drained
	method Action hostToCpu(Bit#(32) startpc) if ( !csrf.started && memReady && !ddr3RespFifo.notEmpty );
		csrf.start(0); // only 1 core, id = 0
		pcReg[0] <= startpc;
	endmethod
	
endmodule

In the above example code, ddr3ReqFifo and ddr3RespFifo serve as interfaces to the real DDR3 DRAM. In simulation, those fifos are hooked up to a module a module called mkSimMem to simulate the DRAM, which we are instantiating and you don't have to worry about.

In the example code, we use module mkWideMemFromDDR3 to translate DDR3_Req and DDR3_Resp types to a more friendly WideMem interface defined in src/includes/CacheTypes.bsv.

Sharing the DRAM Interface

The example code exposes a single interface with the DRAM, but you have two modules that will be using it: an instruction cache and a data cache. If they both send requests to ddr3ReqFifo and they both get responses from ddr3RespFifo, it is possible for their responses to get mixed up. To handle this, you need a separate FIFO to keep track of the order the responses should come back in. Each load request is paired with an enqueue into the ordering FIFO that says who should get the response.

To simplify this for you, we have provided module mkSplitWideMem to split the DDR3 FIFOs into two WideMem interfaces. This module is defined in src/includes/MemUtils.bsv. To prevent mkSplitWideMem from taking action to early and exhibiting unexpected behavior, we set its first parameter to memReady && csrf.started to freeze it before the processor is started. This also avoids scheduling conflicts with initialization of DRAM contents.

Migrating Code from Previous Lab

The provided code for this lab is very similar, but there are a few differences to note. Most of the differences are displayed in the provided example code src/DDR3Example.bsv.

Modified Proc Interface

The Proc interface now does not have any initialization interface. The initialization of the DRAM is done outside and you do not need to worry about it.

Empty Files

The two processor implementations for this lab: src/WithoutCache.bsv and src/WithCache.bsv are initially empty. You should copy over the code from either SixStageBHT.bsv or SixStageBonus.bsv as a starting point for these processors. src/includes/Bht.bsv is also empty, so you will have to copy over the code from the previous lab for that too.

New Files

Here is the summary of new files provided under the src/includes folder:

FilenameDescription
Cache.bsvAn empty file in which you will implement cache modules in this lab.
CacheTypes.bsvA collection of type and interface definitions about caches.
MemUtil.bsvA collection of useful modules and functions about DDR3 and WideMem.
SimMem.bsvDDR3 memory used in simulation. It has a 10-cycle pipelined access latency, but extra glue logic may add more to the total delay of accessing DRAM in simulation.
WideMemInit.bsvModule to initialize DDR3.

There are also changes in MemTypes.bsv.

WithoutCache.bsv -- Using the DRAM Without a Cache

Exercise 1 (10 Points): Implement a module mkTranslator in Cache.bsv that takes in some interface related to DDR3 memory (WideMem for example) and returns a Cache interface (see CacheTypes.bsv).

This module should not do any caching, just translation from MemReq to requests to DDR3 (WideMemReq if using WideMem interfaces) and translation from responses from DDR3 (CacheLine if using WideMem interfaces) to MemResp. This will require some internal storage to keep track of which word you want from the cache line that comes back from main memory. Integrate mkTranslator into a six stage pipeline in the file WithoutCache.bsv (i.e. you should no longer use mkFPGAMemory here). You can build this processor by running

$ make build.bluesim VPROC=WITHOUTCACHE

and you can test this processor by running

$ ./run_asm.sh

and

$ ./run_bmarks.sh

Discussion Question 1 (5 Points): Record the results for ./run_bmarks.sh withoutcache. What IPC do you see for each benchmark?

WithCache.bsv -- Using the DRAM With a Cache

By running the benchmarks with simulated DRAM, you should have noticed that your processor slows down a lot. You can speed up your processor again by remembering previous DRAM loads in a cache as described in class.

Exercise 2 (20 Points): Implement a module mkCache to be a direct mapped cache that allocates on write misses and writes back only when a cache line is replaced.

This module should take in a WideMem interface (or something similar) and expose a Cache interface. Use the typedefs in CacheTypes.bsv to size your cache and for the Cache interface definition. You can use either vectors of registers or register files to implement the arrays in the cache, but vectors of registers are easier to specify initial values. Incorporate this cache in the same pipeline from WithoutCache.bsv and save it in WithCache.bsv. You can build this processor by running

$ make build.bluesim VPROC=WITHCACHE

and you can test this processor by running

$ ./run_asm.sh

and

$ ./run_bmarks.sh

Discussion Question 2 (5 Points): Record the results for ./run_bmarks.sh withcache. What IPC do you see for each benchmark?

Running Large Programs

By adding support for DDR3 memory, your processor can now run larger programs than the small benchmarks we have been using. Unfortunately, these larger programs take longer to run, and in many cases, it will take too long for simulation to finish. Now is a great time to try FPGA synthesis. By implementing your processor on an FPGA, you will be able to run these large programs much faster since the design is running in hardware instead of software.

Exercise 3 (0 Points, but you should still totally do this): Before synthesizing for an FPGA, let's try looking at a program that takes a long time to run in simulation. The program ./run_mandelbrot.sh runs a benchmark that prints a square image of the Mandelbrot set using 1's and 0's. Run this benchmark to see how slow it runs in real time. Please don't wait for this benchmark to finish, just kill it early using Ctrl-C.

Now we are all ready to try FPGA build and stuff. To be continued.