

# **Debugging Techniques**

- Deficiency about \$display()
  - Everything shows up together
- Distinct log file for each module: write to file
  - Also see src/unit\_test/sc-test/Tb.bsv

```
Ehr#(2, File) file <- mkEhr(InvalidFile);
Reg#(Bool) opened <- mkReg(False);</pre>
```

```
rule doOpenFile(!opened);
let f <- $fopen("a.txt", "w");
if (f == InvalidFile) $finish;
file[0] <= f; opened <= True;
endrule
```

```
rule doPrint;
   $fwrite(file[1], "Hello world\n");
endrule
```

Dec 2. 2016

Writing to InvalidFile will cause segfault.

Use EHR if the logic will call \$fwrite() in the first cycle

## **Debugging Techniques**

- Deficiency about cycle counter
  - Rule for printing cycle may be scheduled before/after the rule we are interested in
  - Don't want to create a counter in each module
- Use simulation time

- \$display("%t: evict cache line", \$time);
- \$time() returns Bit#(64) representing time
- In SceMi simulation, \$time() outputs: 10, 30, ...

## **Debugging Techniques**

- Add sanity checks
- Example 1
  - Parent is handling upgrade request
  - No other child has incompatible state
  - Parent decides to send upgrade response
  - Check: parent is not waiting for any child (waitc)
- Example 2
  - D\$ receives upgrade response from memory
  - Check: must be in WaitFillResp state
  - Process the upgrade response
  - Check: if in I state, then data in response must be valid, otherwise data must be invalid (data field is Maybe type in the lab)

# Coherence Protocol: Differences From Lecture

- In lecture: address type for byte address
  - Implementation: only uses cache line address
  - (addr >> 6) for 64-byte cache line
- In lecture: parent reads data in zero cycles
  - Implementation: read from memory, long latency
- In lecture: voluntary downgrade rule
  - No need in implementation
- In lecture: Parent directory tracks states for all address
  - 32-bit address space  $\rightarrow$  huge directory
  - Implementation: usually parent is an L2 cache, so only track address in L2 cache
  - But we don't have an L2 cache



# Load-Reserve (Ir.w) and Store-Conditional (sc.w)

## New state in D\$

- - Reg#(Maybe#(CacheLineAddr)) la <- mkReg(Invalid);</pre>
  - Cache line address reserved by lr.w
- Load-reserved: lr.w rd, 0(rs1)
  - rd <= mem[rs1]</pre>
  - Make reservation: la <= Valid(getLineAddr(rs1));</p>
- Store-conditional: sc.w rd, rs2, 0(rs1)
  - Check la: la invalid or addresses don't match: rd <= 1</p>
  - Otherwise: get exclusive permission (upgrade to M)
    - Check la again
    - If address match: mem[rs1] <= rs2; rd <= 0
      </pre>
    - Otherwise: rd <= 1</p>
    - If cache hit, no need to check again (address already match)
  - Always clear reservation: la <= Invalid http://csg.csail.mit.edu/6.175

Dec 2, 2016

# Load-Reserve (Ir.w) and Store-Conditional (sc.w)

Cache line eviction

- Due to replacement, invalidation request ...
- May lose track of reserved cache line
  - Then clear reservation
- Compare evicted cache line with la
  - If match: la <= invalid</p>
- This is how an LR/SC pair ensures atomicity



#### **Reference Memory Model**

 Debug interface returned by reference model is passed into every D\$

interface RefDMem; method Action issue(MemReq req); method Action commit(MemReq req, Maybe#(CacheLine) line, Maybe#(MemResp) resp); endinterface

module mkDCache#(CoreID id)(

MessageGet fromMem, MessagePut toMem, RefDMem refDMem, DCache ifc);

D\$ calls the into debug interface refDMem

- Reference model will for coherence violations
- Reference model: src/ref

T08-9

#### **Reference Memory Model**

- issue(MemReq req)
  - Called when req issued to D\$
  - in req() method of D\$
  - Give program order to reference model
- commit(MemReq req, Maybe#(CacheLine) line, Maybe#(MemResp) resp);
  - Called when req() finishes processing (commit)
  - line: cache line accessed by req, set to Invalid if unknown
  - resp: response to the core, set to Invalid if no repsonse
- When commit() is called, reference model checks whether:
  - req can be committed
  - line value is correct (not checked if Invalid)
  - resp is correct

T08-10

## **Adding Store Queue**

- New behavior for memory requests
  - Ld: can start processing when store queue is not empty
  - St: enqueue to store queue
  - Lr, Sc: wait for store queue to be empty
  - Fence: wait for all previous requests to commit (i.e. store queue must be empty)
    - Ordering memory accesses

- Issuing stores from store queue to process
  - Only stall when there is a Ld request

#### **Multicore Programs**

- Run programs on 2-core system
- Single-thread programs
  - Found in programs/assembly, programs/benchmarks
  - core 1 starts looping forever at the very beginning
- Multithread programs
  - Find them in programs/mc\_bench
  - startup code (crt.S): allocate 128KB local stack for each core
  - main() function: fork based on core id

```
int main() {
    int coreid = getCoreId();
    if (coreid == 0) { return core0(); }
    else { return core1(); }
```

| Multico<br>mc_pri                                                  | ore Programs:                          |        |
|--------------------------------------------------------------------|----------------------------------------|--------|
| <ul> <li>Easiest o</li> <li>Two cores</li> <li>Sample o</li> </ul> | ne<br>s print "O" and "1" respectively |        |
|                                                                    | /build/mc_bench/vmh/mc_print.riscv.vmh |        |
|                                                                    |                                        |        |
|                                                                    |                                        |        |
| Dec 2, 2016                                                        | http://csg.csail.mit.edu/6.175         | T08-13 |

|                                         | /lulticore Programs:                                                         |        |  |  |  |
|-----------------------------------------|------------------------------------------------------------------------------|--------|--|--|--|
| M                                       | ic_hello                                                                     |        |  |  |  |
| •                                       | Core 0 passes each character of a string to core 1                           |        |  |  |  |
| •                                       | Core 1 prints each character it receives                                     |        |  |  |  |
| •                                       | <ul> <li>Sample output:</li> <li>(no cycle/inst count printed)</li> </ul>    |        |  |  |  |
| Hello W<br>This me<br>core 1.<br>PASSED | essage has been written to a software FIFO by core 0 and read and printed by | /      |  |  |  |
|                                         |                                                                              |        |  |  |  |
|                                         |                                                                              |        |  |  |  |
|                                         |                                                                              |        |  |  |  |
| Dec 2, 2016                             | http://csg.csail.mit.edu/6.175                                               | T08-14 |  |  |  |

| mc_produce_co                                                                                                    | onsume                                                                                                      |
|------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------|
| <ul> <li>Larger version of mc_he</li> </ul>                                                                      | ello                                                                                                        |
| Core 1 passes each ele                                                                                           | ment of an array to core 0                                                                                  |
| Core 0 checks the data                                                                                           |                                                                                                             |
| Sample output:                                                                                                   |                                                                                                             |
| //programs/build/mc_bench/vmh/r<br>Benchmark mc_produce_consume<br>Cycles (core 0) = xxx<br>Insts (core 0) = xxx | <pre>nc_produce_consume.riscv.vmh</pre>                                                                     |
| Cycles (core 1) = xxx<br>Insts (core 1) = xxx<br>Cycles (total) = xxx<br>Insts (total) = xxx                     | Instruction counts may vary due to variation in busy waiting time, so IPC is not a good performance metric. |

|                                                                                                                                                                                                              | e Programs:<br>ian,vvadd,multipl                                                       | _y}    |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------|--------|
| Core 0 calcu                                                                                                                                                                                                 | el: fork-join style<br>ulates first half results<br>ulates second half results<br>put: |        |
| //programs/bui<br>Benchmark mc_median<br>Cycles (core 0) = xxx<br>Insts (core 0) = xxx<br>Cycles (core 1) = xxx<br>Insts (core 1) = xxx<br>Cycles (total) = xxx<br>Insts (total) = xxx<br>Return 0<br>PASSED | .ld/mc_bench/vmh/mc_median.riscv.vmh                                                   |        |
| Dec 2, 2016                                                                                                                                                                                                  | http://csg.csail.mit.edu/6.175                                                         | T08-16 |

# Multicore Programs:

#### mc\_dekker

- Two cores contend for a mutex (Dekker's algorithm)
- After getting into critical section
  - increment/decrement shared counter, print core ID

#### Sample output:

|    | //programs/build/mc_bench/vmh/mc_dekker.riscv.vmh  |    |  |
|----|----------------------------------------------------|----|--|
|    | Benchm1ark mc_1dekker1                             |    |  |
|    | 100110000                                          |    |  |
|    | Core 0 decrements counter by 600                   |    |  |
|    | Core 1 increments counter by 900                   |    |  |
| ~~ | <sup>-</sup> inal counter value = 300              |    |  |
|    | Cycles (core 0) = xxx                              |    |  |
|    | Insts (core 0) = xxx                               |    |  |
|    | Cycles (core 1) = xxx                              |    |  |
| ~~ | Insts (core 1) = xxx                               |    |  |
|    | Cycles (total) = xxx For implementation with store |    |  |
|    | Insts (total) = xxx queue, a fence is inserted in  |    |  |
|    | Return 0                                           |    |  |
| ~~ | PASSED mc_dekker.                                  |    |  |
| D  | 2, 2016 http://csg.csail.mit.edu/6.1/5 108-        | 17 |  |

# Multicore Programs:

#### mc\_spin\_lock

 Similar to mc\_dekker, but use spin lock implemented by lr.w/sc.w

#### Sample output:

```
---- ../../programs/build/mc_bench/vmh/mc_spin_lock.riscv.vmh ----
 Bench1mark mc1_spin_l1ock
 10101...000
 Core 0 increments counter by 300
 Core 1 increments counter by 600
 Final counter value = 900
 Cycles (core 0) = xxx
 Insts (core 0) = xxx
 Cycles (core 1) = xxx
 Insts (core 1) = xxx
 Cycles (total) = xxx
 Insts (total) = xxx
 Return 0
 PASSED
                                    http://csg.csail.mit.edu/6.175
Dec 2, 2016
                                                                                        T08-18
```

# Multicore Programs:

#### mc\_incrementers

- Similar to mc\_dekker, but use atomic fetch-and-add implemented by lr.w/sc.w
- Core ID is not printed
- Sample output:

---- ../../programs/build/mc\_bench/vmh/mc\_incrementers.riscv.vmh ----Benchmark mc\_incrementers

```
core0 had 1000 successes out of xxx tries
core1 had 1000 successes out of xxx tries
shared_count = 2000
Cycles (core 0) = xxx
Insts (core 0) = xxx
Cycles (core 1) = xxx
Insts (core 1) = xxx
Cycles (total) = xxx
Insts (total) = xxx
Return 0
PASSED
```

Dec 2, 2016

108-19

#### **Some Reminders**

- Use CF regfile and scoreboard
  - Compiler creates a conflict in Sizhuo's implementation with bypass regfile and pipelined scoreboard
- Sign up for project meeting
- Project deadline: 3:00pm Dec 14
- Final presentation (10min)