Lab 6: RISC-V Processor with Six-Stage Pipeline and Branch Prediction

Lab 6 due date: Monday, November 7, 2016, at 11:59:59 PM EDT.

Your deliverables for Lab 6 are:


This lab is your introduction to realistic RISC-V pipelines and branch prediction. At the end of this lab, you will have a six-stage RISC-V pipeline with multiple address and branch predictors working together.

Note: In this lab, we use one-bit global epochs (instead of unbounded distributed epochs) to kill wrong path instructions. Please study the Global Epoch slides: [pptx] [pdf] to understand the global epoch scheme. The content of the slides will also be covered in tutorial.

Additions to the lab infrastructure

New included files

The following files appear in src/includes/:

FPGAMemory.bsvA wrapper for block RAM commonly found on FPGAs. This has an identical interface as the DelayedMemory from the previous lab.
SFifo.bsvThree searchable FIFO implementations: one based off of a pipeline FIFO, one based off of a bypass FIFO, and the other based off of a conflict-free FIFO. All implementations assume search is done immediately before enq.
Scoreboard.bsvThree scoreboard implementations based off of searchable FIFOs. The pipeline scoreboard uses a pipeline searchable FIFO, the bypass scoreboard uses a bypass searchable FIFO, and the conflict-free scoreboard uses a conflict-free searchable FIFO.
Bht.bsvAn empty file in which you will implement a branch history table (BHT).

New assembly tests

The following file appears in programs/assembly/src:

bpred_j_noloop.SAn assembly test similar to bpred_j.S, but the outer loop is removed.

New source files

The following files appear in src/:

TwoStage.bsvAn initial two-stage pipelined RISC-V processor that uses a BTB for address prediction. Compile with twostage target.
SixStage.bsvAn empty file in which you will extend the two-stage pipeline into a six-stage pipeline. Compile with sixstage target.
SixStageBHT.bsvAn empty file in which you will integrate a BHT into the six-stage pipeline. Compile with sixstagebht target.
SixStageBonus.bsvAn empty file in which you can improve the previous processor for bonus credit. Compile with sixstagebonus target.

Testing improvements

In the previous lab, the command build -v <proc_name> (run from the scemi/sim/ directory) was used to build bsim_dut and tb. In this lab, this command builds <proc_name>_dut instead of bsim_dut so switching between processor types does not delete other processor builds.

Simulation scripts now require you to specify the target processor:

$ ./ <proc_name>
$ ./ <proc_name>

Simulating a single test requires you to run the correct simulation executable:

$ cp ../../programs/build/{assembly,benchmarks}/vmh/<test_name>.riscv.vmh mem.vmh
$ ./<proc_name>_dut > out.txt &
$ ./tb

Two-stage pipeline: TwoStage.bsv

TwoStage.bsv contains a two-stage pipelined RISC-V processor. This processor differs from the processor you built in the previous lab because it reads register values in the first stage and there is data hazard.

Discussion Question 1 (10 Points): Debugging practice!

If you replace the BTB with a simple pc + 4 address prediction, the processor still works, but it does not perform as well. If you replace it with a really bad predictor that predicts pc is the next instruction for each pc, it should still work but have even worse performance because each instruction would require redirection (unless the instruction loops back to itself). If you actually set the prediction to pc, you will get errors in the assembly tests; the first one will be from cache.riscv.vmh.

You do not actually have to fix this bug, just answer the questions. (Hint: look at the addr field of ExecInst structure.)

Six-stage pipeline: SixStage.bsv

The six-stage pipeline should be divided into the following stages:

IMemory and DMemory instances should be replaced with instances of FPGAMemory to enable later implementation on FPGA.

Exercise 1 (20 Points): Starting with the two-stage implementation in TwoStage.bsv, replace each memory with FPGAMemory and extend the pipeline into a six-stage pipeline in SixStage.bsv. In simulation, benchmark qsort may take longer time (21 seconds on TA's desktop, and it may take even longer on the vlsifarm machines).

Notice that the two-stage implementation uses a conflict-free register file and scoreboard. However, you could use the pipelined or bypassed versions of these components for better performance. Also, you may want to change the size of scoreboard.

Discussion Question 2 (5 Points): What evidence do you have that all pipeline stages can fire in the same cycle?

Discussion Question 3 (5 Points): In your six-stage pipelined processor, how many cycles does it take to correct a mispredicted instruction?

Discussion Question 4 (5 Points): If an instruction depends on the result of the instruction immediately before it in the pipeline, how many cycles is that instruction stalled?

Discussion Question 5 (5 Points): What IPC do you get for each benchmark?

Adding a branch history table: SixStageBHT.bsv

The branch history table (BHT) is a structure that keeps track of the history of branches and is used in direction prediction. Your BHT should be indexed by a parameterized number of bits taken from the program counter -- typically bit n+1 down to bit 2 since bits 1 and 0 will always be zero. Each index should have a two-bit saturating counter. Do not include any valid bits or tags in the BHT; we are not concerned about aliasing in our predictions.

Exercise 2 (20 Points): Implement a branch history table in Bht.bsv that uses a parameterizable number of bits as an index into the table.

Discussion Question 6 (10 Points): Planning!

One of the hardest things about this lab is properly training and integrating the BHT into the pipeline. There are many mistakes that can be made while still seeing decent results. By having a good plan based on the fundamentals of direction prediction, you will avoid many of those mistakes.

For this discussion question, state your plan for integrating the BHT into the pipeline. The following questions should help guide you:

Exercise 3 (20 Points): Integrate a 256-entry (8-bit index) BHT into the six-stage pipeline from SixStage.bsv, and put the results in SixStageBHT.bsv.

Discussion Question 7 (5 Points): How much improvement do you see in the bpred_bht.riscv.vmh test over the processor in SixStage.bsv?

Exercise 4 (10 Points): Move address calculation for JAL up to the decode stage and use the redirect logic used by the BHT to redirect for these instructions too.

Discussion Question 8 (5 Points): How much improvement do you see in the bpred_j.riscv.vmh and bpred_j_noloop.riscv.vmh tests over the processor in SixStage.bsv?

Discussion Question 9 (5 Points): What IPC do you get for each benchmark? How much improvement is this over the original six-stage pipeline?

Discussion Question 10 (Optional): How long did it take you to complete this lab?

Remember to push your code with git push when you're done.

Bonus improvements: SixStageBonus.bsv

This section looks at two ways to speed up indirect jumps to addresses stored in registers (JALR).

Exercise 5 (10 Bonus Points): JALR instructions have known target addresses in the register fetch stage. Add a redirection path for JALR instructions in the register fetch stage and put the results in SixStageBonus.bsv. The bpred_ras.riscv.vmh test should give slightly better results with this improvement.

Most JALR instructions found in programs are used as returns from function calls. This means the target address for such a return was written into the return address register x1 (also called ra) by a previous JAL or JALR instruction that initiates the function call.

To make better prediction of JALR instructions, we can introduce the return address stack (RAS) to our processor. According to RISC-V ISA, a JALR instruction with rd=x0 and rs1=x1 is commonly used as the return instruction from a function call. Besides, a JAL or JALR instruction with rd=x1 is commonly used as the jump to initiate a function call. Therefore, we should push the RAS for JAL/JALR instruction with rd=x1, and pop the RAS for JALR instruction with rd=x0 and rs1=x1.

Exercise 6 (10 Bonus Points): Implement a return address stack and integrate it into the Decode stage of your processor (SixStageBonus.bsv). An 8 element stack should be enough. If the stack fills up, you could simply discard the oldest data. The bpred_ras.riscv.vmh test should give an even better result with this improvement. If you implemented the RAS in a separate BSV file, make sure to add it to the git repository for grading.