#### Constructive Computer Architecture:

### **Pipelined Processors**

Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology

October 6, 2014

#### Single-Cycle SMIPS: *Clock Speed*



 $t_{Clock} > t_{M} + t_{DEC} + t_{RF} + t_{ALU} + t_{M} + t_{WB}$ 

We can improve the clock speed if we execute each instruction in two clock cycles  $t_{Clock} > max \{t_{M}, (t_{DEC} + t_{RF} + t_{ALU} + t_{M} + t_{WB})\}$ However, this may not improve the performance because each instruction will now take two cycles to execute October 6, 2014 http://csg.csail.mit.edu/6.175

### Structural Hazards

- Sometimes multicycle implementations are necessary because of resource conflicts, aka, structural hazards
  - Princeton style architectures use the same memory for instruction and data and consequently, require at least two cycles to execute Load/Store instructions
  - If the register file supported less than 2 reads and one write concurrently then most instructions would take more than one cycle to execute

 Usually extra registers are required to hold values between cycles



October 6, 2014

http://csg.csail.mit.edu/6.175

L11-4

#### Two-Cycle SMIPS

```
module mkProc(Proc);
Reg#(Addr) pc <- mkRegU;
RFile rf <- mkRFile;
IMemory iMem <- mkIMemory;
DMemory dMem <- mkDMemory;
Reg#(Data) f2d <- mkRegU;
Reg#(State) state <- mkReg(Fetch);</pre>
```

```
rule doFetch (state == Fetch);
    let inst = iMem.req(pc);
    f2d <= inst;
    state <= Execute;
endrule</pre>
```

#### Two-Cycle SMIPS

rule doExecute(stage==Execute); let inst = f2d; let dInst = decode(inst); let rVal1 = rf.rd1(validRegValue(dInst.src1)); let rVal2 = rf.rd2(validRegValue(dInst.src2)); let eInst = exec(dInst, rVal1, rVal2, pc); if(eInst.iType == Ld) eInst.data <- dMem.reg(MemReg{op: Ld, addr: eInst.addr, data: ?}); **else if**(eInst.iType == St) let d <- dMem.req(MemReq{op: St, addr:</pre> eInst.addr, data: eInst.data}); if (isValid(eInst.dst)) rf.wr(validRegValue(eInst.dst), eInst.data); pc <= eInst.brTaken ? eInst.addr : pc + 4;</pre>

state <= Fetch; endrule endmodule no change from single-cycle





In any given clock cycle, lot of unused hardware !

October 6, 2014

# Problems in Instruction pipelining



- Control hazard: Inst<sub>i+1</sub> is not known until Inst<sub>i</sub> is at least decoded. So which instruction should be fetched?
- Structural hazard: Two instructions in the pipeline may require the same resource at the same time, e.g., contention for memory in Princeton-style architecture
- Data hazard: Inst<sub>i</sub> may affect the state of the machine (pc, rf, dMem) Inst<sub>i+1</sub>must be fully cognizant of this change none of these hazards were present in the FFT pipeline

## Arithmetic versus Instruction pipelining

 The data items in an arithmetic pipeline, e.g., FFT, are independent of each other



The entities in an instruction pipeline affect each other

 This causes pipeline stalls or requires other fancy tricks to avoid stalls

> Processor pipelines are significantly more complicated than arithmetic pipelines

October 6, 2014

#### The power of computers comes from the fact that the instructions in a program are *not* independent of each other

 $\Rightarrow$  must deal with hazard

October 6, 2014

#### **Control Hazards**



Inst<sub>i+1</sub> is not known until Inst<sub>i</sub> is at least decoded. So which instruction should be fetched?

- General solution speculate, i.e., predict the next instruction address
  - requires the next-instruction-address prediction machinery; can be as simple as pc+4
  - prediction machinery is usually elaborate because it dynamically learns from the past behavior of the program
- What if speculation goes wrong?
  - machinery to kill the wrong-path instructions, restore the correct processor state and restart the execution at the correct pc

#### **Two-stage Pipelined SMIPS**



October 6, 2014

http://csg.csail.mit.edu/6.175

L11-12

### Two-stage Pipelined SMIPS



- f2d must contain a Maybe type value because sometimes the fetched instruction is killed
- Fetch2Decode type captures all the information that needs to be passed from Fetch to Decode, i.e.
  - Fetch2Decode {pc:Addr, ppc: Addr, inst:Inst}

## Pipelining Two-Cycle SMIPS – single rule



### Inelastic versus Elastic pipeline

- The pipeline presented is inelastic, that is, it relies on executing Fetch and Execute together or atomically
- In a realistic machine, Fetch and Execute behave more asynchronously; for example memory latency or a functional unit may take variable number of cycles
- If we replace ir by a FIFO (f2d) then it is possible to make the machine more elastic, that is, Fetch keeps putting instructions into f2d and Execute keeps removing and executing instructions from f2d.

#### An elastic Two-Stage pipeline

```
rule doFetch ;
       let inst = iMem.req(pc);
       let ppc = nextAddr(pc); pc <= ppc;</pre>
       f2d.eng(Fetch2Decode{pc:pc, ppc:ppc,
                                               inst:inst});
    endrule
                                                 Can these rules
                                                 execute concurrently
    rule doExecute;
                                                 assuming the FIFO
        let x = f2d.first; let inpc = x.pc;
                                                 allows concurrent eng,
        let ppc = x.ppc; let inst = x.inst;
                                                 deg and clear?
       let dInst = decode(inst);
       ... register fetch ...;
       let eInst = exec(dInst, rVal1, rVal2, inpc, ppc);
       ...memory operation ...
       ... rf update ...
                                                     no -
       if (eInst.mispredict)
                                           begin
                                                     double writes in pc
            pc <= eInst.addr; f2d.clear; end</pre>
       else f2d.deq;
    endrule
October 6, 2014
                           http://csg.csail.mit.edu/6.175
                                                                    L11-16
```

#### An elastic Two-Stage pipeline: for concurrency make pc into an EHR

```
rule doFetch ;
       let inst = iMem.req(pc[0]);
       let ppc = nextAddr(pc[0]); pc[0] <= ppc;</pre>
       f2d.enq(Fetch2Decode{pc:pc[0], ppc:ppc, inst:inst});
    endrule
                                                 These rules can
                                                 execute concurrently
    rule doExecute;
                                                 assuming the FIFO has
        let x = f2d.first; let inpc = x.pc;
                                                 (eng CF deg) and
        let ppc = x.ppc; let inst = x.inst;
                                                 (eng < clear)
       let dInst = decode(inst);
       ... register fetch ...;
       let eInst = exec(dInst, rVal1, rVal2, inpc, ppc);
        ...memory operation ...
       ... rf update ...
       if (eInst.mispredict)
                                           begin
            pc[1] <= eInst.addr; f2d.clear; end</pre>
       else f2d.deq;
    endrule
October 6, 2014
                           http://csg.csail.mit.edu/6.175
                                                                    111 - 17
```

#### Correctness issue



Once Execute redirects the PC,

- no wrong path instruction should be executed
- the next instruction executed must be the redirected
  - one

#### This is true for the code shown because

- Execute changes the pc and clears the FIFO
- atomically
- Fetch reads the pc and enqueues the FIFO atomically

# Conflict-free FIFO with a Clear method

module mkCFFifo(Fifo#(2, t)) provisos(Bits#(t, tSz)); Ehr#(3, t) da <- mkEhr(?);If there is only one Ehr#(2, Bool) va <- mkEhr(False); element in the FIFO it Ehr#(2, t) db < - mkEhr(?);resides in da Ehr#(3, Bool) vb <- mkEhr(False); rule canonicalize if(vb[2] && !va[2]); da[2] <= db[2]; va[2] <= True; vb[2] <= False; endrule **method Action** enq(t x) **if**(!vb[0]); db[0] <= x; vb[0] <= True; endmethod **method Action** deq **if** (va[0]); first CF enq va[0] <= False; endmethod</pre> deq CF enq **method** t first **if**(va[0]); first < deq return da[0]; endmethod enq < clear **method Action** clear; va[1] <= False ; vb[1] <= False endmethod</pre> endmodule

Canonicalize must be the last rule to fire!

October 6, 2014

http://csg.csail.mit.edu/6.175

db da

## Why canonicalize must be last rule to fire



Consider rule foo. If p is false then canonicalize must fire after deq for proper concurrency.

If canonicalize uses EHR indices between deq and clear, then canonicalize won't fire when p is false

first CF enq deq CF enq first < deq enq < clear