$\qquad$

# Computer System Architecture <br> 6.823 Quiz \#1 March 6, 2015 <br> Professors Daniel Sanchez and Joel Emer 

## This is a closed book, closed notes exam. 80 Minutes <br> 18 Pages

Notes:

- Not all questions are of equal difficulty, so look over the entire exam and budget your time carefully.
- Please carefully state any assumptions you make.
- Please write your name on every page in the quiz.
- You must not discuss a quiz's contents with other students who have not yet taken the quiz.

| Part A | $\square$ | 27 Points |
| :--- | :--- | :--- |
| Part B | $=$ | 38 Points |
| Part C | $=$ | 35 Points |

$\qquad$

## Part A: Self-Modifying Code (27 Points)

In this problem we will use and extend the EDSACjr instruction set from Handout 1, shown in Table A-1.

| Opcode | Description |
| :--- | :--- |
| ADD $n$ | Accum $\leftarrow$ Accum $+\mathrm{M}[n]$ |
| SUB $n$ | Accum $\leftarrow$ Accum $-\mathrm{M}[n]$ |
| LD $n$ | Accum $\leftarrow \mathrm{M}[n]$ |
| ST $n$ | $\mathrm{M}[n] \leftarrow$ Accum |
| CLEAR | Accum $\leftarrow 0$ |
| OR $n$ | Accum $\leftarrow$ Accum $\mid \mathrm{M}[n]$ |
| AND $n$ | Accum $\leftarrow$ Accum $\& \mathrm{M}[n]$ |
| SHIFTR $n$ | Accum $\leftarrow$ Accum shiftr $n$ |
| SHIFTL $n$ | Accum $\leftarrow$ Accum shiftl $n$ |
| BGE $n$ | If Accum $\geq 0$ then PC $\leftarrow n$ |
| BLT $n$ | If Accum $<0$ then PC $\leftarrow n$ |
| END | Halt machine |

Table A-1. EDSACjr Instruction Set

## Question 1 (15 Points)

Write a program that loops over an n-item array and replaces each item with its absolute value, as shown in the following pseudo-code:

```
for (i = 0; i < n; i++)
    A[i] = |A[i]|
```

Part of the program is already written for you, and to simplify your job you can assume the loop will be executed only once. The memory map on the next page shows the memory contents before the program starts. Array A is stored in memory in a contiguous manner, starting from location A. Memory locations N, I, and ONE hold the values of $n, i$, and 1 , respectively. If you need to, you can use additional memory locations for your own variables. You should label each variable and define its initial value.
$\qquad$

Memory:


Program:

| loop: | LD | I |
| :---: | :---: | :---: |
|  | SUB | N |
|  | BGE | done |
| I1: | LD | A |
|  | BGE | cont |
|  | ST | TMP |
|  | CLEAR |  |
|  | SUB | TMP |
| I2: | ST | A |
| cont: | LD | I1 |
|  | ADD | ONE |
|  | ST | I1 |
|  | LD | I2 |
|  | ADD | ONE |
|  | ST | I2 |

Page 3 of 18
$\qquad$

## Question 2 (12 Points)

Tired of writing self-modifying code, Ben Bitdiddle decides to extend EDSACjr to support indirect addressing. However, because registers are expensive, Ben does not want to add an index register. Instead, he implements the indirect addressing instructions shown in Table A-2. To execute an indirect addressing instruction, the new architecture first reads the target address from memory and then loads/stores the data from/to memory.

| Opcode | Description |
| :--- | :--- |
| ADDind $n$ | Accum $\leftarrow$ Accum $+\mathrm{M}[\mathrm{M}[n]]$ |
| SUBind $n$ | Accum $\leftarrow$ Accum $-\mathrm{M}[\mathrm{M}[n]]$ |
| LDind $n$ | Accum $\leftarrow \mathrm{M}[\mathrm{M}[n]]$ |
| STind $n$ | $\mathrm{M}[\mathrm{M}[n]] \leftarrow$ Accum |

Table A-2. Additional Indirect Addressing Instructions
Using the instructions in Table A-1 and Table A-2, rewrite the program from Question 1 without using self-modifying code. As before, you can use additional memory locations for your own variables. You should label each variable and define its initial value.
$\qquad$


Page 5 of 18
$\qquad$

## Part B: Caches and Virtual Memory (38 pts)

## Question 1 (13 points)

Consider a reference stream that repetitively loops over four addresses, A, B, C, and D (ABCDABCDABCD....). We will study how different replacement policies perform on this reference stream, using a small, 2-entry, fully-associative cache.

1. Find out how the cache performs with LRU replacement. Fill the table below to show the cache contents over time, and note whether each access is a hit or a miss. Then, compute the long-term miss ratio (i.e., discounting cache warm-up). (3 points)

| Access | $\mathbf{0}$ | $\mathbf{1}$ | $\mathbf{2}$ | $\mathbf{3}$ | $\mathbf{4}$ | $\mathbf{5}$ | $\mathbf{6}$ | $\mathbf{7}$ | $\mathbf{8}$ | $\mathbf{9}$ | $\mathbf{1 0}$ | $\mathbf{1 1}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Address | A | B | C | D | A | B | C | D | A | B | C | D |
| Entry 1 | - | A | A | C | C | A | A | C | C | A | A | C |
| Entry 2 | - | - | B | B | D | D | B | B | D | D | B | B |
| Hit? | N | N | N | N | N | N | N | N | N | N | N | N |

What is the long-term miss ratio under LRU? 100\%
2. Find out how the cache performs under optimal replacement. This cache cannot bypass accesses, i.e., on every miss, it must replace an existing block and insert the new block. Fill the time diagram below, and find the long-term hit rate. (5 points)

| Access | $\mathbf{0}$ | $\mathbf{1}$ | $\mathbf{2}$ | $\mathbf{3}$ | $\mathbf{4}$ | $\mathbf{5}$ | $\mathbf{6}$ | $\mathbf{7}$ | $\mathbf{8}$ | $\mathbf{9}$ | $\mathbf{1 0}$ | $\mathbf{1 1}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Address | A | B | C | D | A | B | C | D | A | B | C | D |
| Entry 1 | - | A | A | A | A | A | B | C | C | C | C | C |
| Entry 2 | - | - | B | C | D | D | D | D | D | A | B | B |
| Hit? | N | N | N | N | Y | N | N | Y | N | N | Y | N |

What is the long-term miss ratio under optimal replacement? 66\% (2/3)
$\qquad$
3. In the example, is there a simple policy that, without knowing the future, performs as well as the optimal one? If so, which one? (5 points)

Yes, MRU (Most Recently Used)

## Question 2 (9 points)

Consider a byte addressing system with 16-bit virtual and physical addresses. The system has a cache with the following properties:

- 8 sets with 128 bytes per block
- 4-way set-associative organization
- Virtually-indexed, physically-tagged

1. Suppose we use 256-byte pages. Where in the cache can virtual address $0 x A B C D$ live? Please use crosses ( X ) to mark its possible locations in the diagram below. (The binary representation of $0 x A B C D$ is 101010111100 1101.) (3 points)

| Index | Cache Contents |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | Way 0 | Way 1 | Way 2 | Way 3 |
| 0 |  |  |  |  |
| 1 |  |  |  |  |
| 2 |  |  |  |  |
| 3 |  |  |  |  |
| 4 |  |  |  |  |
| 5 |  |  |  |  |
| 6 |  |  |  | X |
| 7 | X | X | X |  |

bit 150
Virtual address: 1010101111001101
Index: 111 (bit 9-7)
$\qquad$
2. As before, suppose we use 256-byte pages. Where in the cache can physical address 0xABCD live? Please use crosses ( X ) to mark its possible locations. (3 points)

| Index | Cache Contents |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | Way 0 | Way 1 | Way 2 | Way 3 |
| 0 |  |  |  |  |
| 1 | X | X | X | X |
| 2 |  |  |  |  |
| 3 | X | X | X | X |
| 4 |  |  |  |  |
| 5 | X | X | X | X |
| 6 |  |  |  |  |
| 7 | X | X | X | X |

bit 15
0
Physical address: 1010101111001101
Virtual address: xxxx xxxx 11001101
Index: xx1 (bit 9-7)
3. Suppose we use 1024-byte pages instead. Where in the cache can physical address 0xABCD live? Please use crosses ( X ) to mark its possible locations. (3 points)

| Index | Cache Contents |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | Way 0 | Way 1 | Way 2 | Way 3 |
| 0 |  |  |  |  |
| 1 |  |  |  |  |
| 2 |  |  |  |  |
| 3 |  |  |  |  |
| 4 |  |  |  |  |
| 5 |  |  |  |  |
| 6 |  |  |  | X |
| 7 | X | X | X | X |

bit 15
0
Physical address: 1010101111001101
Virtual address: xxxx xx11 11001101
Index: 111 (bit 9-7)
$\qquad$

## Question 3 (16 points)

We'd like our memory system to support two page sizes: 256-byte small pages and 1024-byte large pages. A common approach to support multiple page sizes is to use separate TLBs, one for each page size. Instead, to reduce area overheads, we will use a single TLB to cache translations of both small and large pages, shown in Figure B-1. The TLB has 8 sets and 2 ways. The L bit denotes whether the cached PTE is for a large page.
$\mathrm{V}=$ valid bit $\quad \mathrm{L}=$ large page bit (set to 1 when a large page is stored)
PPN = physical page number

| Way 0 |  |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Way 1 |  |  |  |  |  |  |  |  |
|  | V | L | Tag | PPN | V | L | Tag | PPN |
| 0 |  |  |  |  |  |  |  |  |
| 1 |  |  |  |  |  |  |  |  |
| 2 |  |  |  |  |  |  |  |  |
| 3 |  |  |  |  |  |  |  |  |
| 4 |  |  |  |  |  |  |  |  |
| 5 |  |  |  |  |  |  |  |  |
| 6 |  |  |  |  |  |  |  |  |
| 7 |  |  |  |  |  |  |  |  |

Figure B-1. TLB for multiple page sizes.
Each TLB access consists of three steps. First, the TLB checks for a small-page match, using the tag and index bits shown in Figure B-2. Second, if it does not find a small-page match, it checks for a large-page match, using the tag and index bits in Figure B-3. Third, if the second lookup misses as well, it results in a TLB miss and a page table walk.


Figure B-2. Tag and index bits for small (256-byte) pages.


Figure B-3. Tag and index bits for large (1024-byte) pages.
$\qquad$

Assume virtual address 0xABBA translates to physical address 0x47BA.

1. If virtual address $0 x A B B A$ belongs to a small (256-byte) page, fill in the fields of the TLB entry, and mark all possible TLB locations it can be in. (3 points)

TLB entry

| L | Tag | PPN |
| :---: | :---: | :---: |
| 0 | 10101 | $0 \times 47$ |

$0 x A B B A \quad=1010101110111010$
$0 x 47 B A \quad=0100011110111010$

VPN $=10101011$
PPN $=01000111$
Page offset $=10111010$

Index to TLB $=011$ (from VPN bit 8-10)
Tag $\quad=10101$ (from VPN bit 11-15)

Possible locations

|  | Way 0 | Way 1 |
| :---: | :---: | :---: |
| 0 |  |  |
| 1 |  |  |
| 2 |  |  |
| 3 | X | X |
| 4 |  |  |
| 5 |  |  |
| 6 |  |  |
| 7 |  |  |

2. If virtual address $0 x A B B A$ belongs to a large (1024-byte) page, fill in the fields of the TLB entry, and mark all possible TLB locations it can be in. (3 points)

TLB entry

| L | Tag | PPN |
| :---: | :---: | :---: |
| 1 | 10101 | 010001 |

0xABBA $=1010101110111010$
$0 x 47 B A=0100011110111010$
VPN $=101010$
PPN $\quad=010001$
Page offset $=1110111010$
Index to TLB $=000($ from VPN bit $10+00)$
Tag $\quad=10101$ (from VPN bit 11-15)

Possible locations

|  | Way 0 | Way 1 |
| :---: | :---: | :---: |
| 0 | X | X |
| 1 |  |  |
| 2 |  |  |
| 3 |  |  |
| 4 |  |  |
| 5 |  |  |
| 6 |  |  |
| 7 |  |  |

$\qquad$
3. What is the reach of this TLB? (TLB reach = maximum amount of memory accessible without TLB misses) (4 points)
$2 * 2 * 1 \mathrm{~K}$ page $+6 * 2 * 256$ page $=7 \mathrm{~K}$
4. This TLB has a utilization problem for large pages. Explain why it happens and how to solve it. (6 points)

Problem: Padding zeros limits large pages to locate only in entry 0 and 4.
Do not pad index bits with 00 for large pages but use the first 3 bits from the VPN of large pages.

Fixed reach: $8 * 2 * 1 \mathrm{~K}$ page $=16 \mathrm{~K}$
$\qquad$

## HAL 180 ISA and 6-Stage Pipelined Implementation

Inspired by how the IBM 360 uses condition codes, Ben Bitdiddle designs the HAL 180 architecture, which features two flag registers. Table C-1 describes these flags.

| Name | Description |
| :--- | :--- |
| Sign Flag (SF) | Stores 1 if the result of the last arithmetic or comparison <br> instruction was negative, 0 if it was non-negative |
| Zero Flag (ZF) | Stores 1 if the result of the last arithmetic, logical, or <br> comparison instruction was zero, and 0 if it was non-zero |

Table C-1. HAL 180 status flags.

Table C-2 summarizes the different instruction types and the flags they read or write. The SF and ZF columns have an " $R$ " when the instruction reads the status flag, a "W" if it writes the flag (and does not read it), or a blank if the instruction does not affect the status flag. For example, JL (jump if less than) reads SF; ADD writes all flags; and JMP (unconditional jump) does not affect any flag. Some instructions, like CMP, write the status flags but do not return any result.

| Instruction | Description | SF | ZF |
| :---: | :---: | :---: | :---: |
| Arithmetic Instructions |  |  |  |
| ADD $s 1, s 2$ | $s 1 \leftarrow s 1+s 2$ | W | W |
| SUB $s 1, s 2$ | $s 1 \leftarrow s 1-s 2$ | W | W |
| MUL $s 1, s 2$ | $s 1 \leftarrow s 1 \times s 2$ | W | W |
| Logical Instructions |  |  |  |
| AND $s 1, s 2$ | $s 1 \leftarrow s 1 \& s 2$ |  | W |
| OR s1, $s 2$ | $s 1 \leftarrow s 1 \mid s 2$ |  | W |
| XOR sl, s2 | $s 1 \leftarrow s l^{\wedge} s 2$ |  | W |
| Comparison Instructions |  |  |  |
| CMP s1, s2 | temp $\leftarrow s 1-s 2$ | W | W |
| Jump Instructions |  |  |  |
| JMP target | jump to the address specified by target |  |  |
| JL target | jump to target if $\mathrm{SF}==1$ | R |  |
| JG target | jump to target if SF $==0$ and $\mathrm{ZF}==0$ | R | R |
| Memory Instructions |  |  |  |
| LD s1, s2 | $s 1 \leftarrow \mathrm{M}[s 2]$ |  |  |
| ST $s 1, s 2$ | $\mathrm{M}[s 1] \leftarrow s 2$ |  |  |

Table C-2. HAL 180 instruction set.
$\qquad$

Ben also designs a 6 -stage pipelined implementation of the HAL 180. In this pipeline, the ALU takes three pipeline stages (E1, E2, and E3), and status flags are updated in stage E3. Table C-3 describes each stage, and Figure C-4 shows the datapath of this 6 -stage pipelined architecture, highlighting the differences with a conventional MIPS pipeline. Note that this implementation does not have any data bypass paths.

| Stage | Description |
| :---: | :--- |
| Fetch and Decode | Fetch an instruction from the instruction memory, decode the <br> instruction, and fetch the register values from the register file. <br> The status flag checking for conditional jumps is also done in <br> this stage. |
| Execute Stage 1 (E1) | The first stage of the execution phase. Generate partial results <br> and store them in the pipeline registers. |
| Execute Stage 2 (E2) | The second stage of the execution phase. Generate partial <br> results and store them in the pipeline registers. |
| Execute Stage 3 (E3) | The final stage of the execution phase. Final results are <br> generated and flag registers get updated if necessary. |
| Memory Stage (M) | Perform load/store from/to the data memory if necessary. |
| Writeback Stage (WB) | Write to the register file if necessary. |

Table C-3. HAL 180 pipeline stages.
$\qquad$


Figure C-4. HAL 180 6-Stage pipelined implementation.
$\qquad$

## Part C: Status Flags (35 Points)

## Question 1 (12 points)

Write the HAL 180 assembly for the following program. For maximum credit, use the minimum number of comparison and jump instructions.

```
if (a < b) {
    c = c XOR b;
} else if (a > b) {
        c = c XOR a;
} else {
        c = 0;
}
a = 0;
b = 0;
```

Assume variables $a, b$, and $c$ are stored in registers $\mathbf{R 1}, \mathbf{R 2}$, and $\mathbf{R} 3$ respectively.

|  | CMP | R1, R2 |
| :---: | :---: | :---: |
|  | JL | _L1 |
|  | JG | _L2 |
|  | XOR | R3, R3 |
|  | JMP | _L3 |
| _L1: | XOR | R3, R2 |
|  | JMP | _L3 |
| _L2: | XOR | R3, R1 |
| _L3: | XOR | R1, R1 |
|  | XOR | R2, R2 |

(You get full grades if you have better answer)
$\qquad$

## Question 2 (10 points)

Ben's HAL 180 6-stage pipeline (Figure C-4) stalls to avoid data hazards through registers, but does not yet handle hazards due to status flags. To illustrate why this is problematic, consider the following instruction sequence:

| I0: |  | ADD | R1, R2 | Set | SF =1 | ZF $=0$ |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| I1: |  | JG | _L2 | Not | Taken |  |
| I2: |  | XOR | R1, R3 | Set | ZF $=0$ |  |
| I3: |  | JL | _L2 | Taken |  |  |
| I4: | _L1: | SUB | R1, R2 |  |  |  |
| I5: | _L2: | ADD | R3, R1 |  |  |  |

Assume that when the program start, $\mathrm{R} 1=-1, \mathrm{R} 2=-2, \mathrm{R} 3=-3$, and all the status flags are zero. Fill out the following instruction flow diagram to incur the minimum amount of stalls while maintaining correct operation (i.e., use stalls to respect both data and status flag dependences). Use " X "s to denote pipeline bubbles.

|  | T 0 | T 1 | T 2 | T 3 | T 4 | T 5 | T | T 7 | T 8 | T 9 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| FD | I 0 | I 1 | I 1 | I 1 | I 1 | I 2 | I 2 | I 3 | I 5 | I 5 |
| E 1 |  | I 0 | X | X | X | I 1 | X | I 2 | I 3 | X |
| E 2 |  |  | I 0 | X | X | X | I 1 | X | I 2 | I 3 |
| E 3 |  |  |  | I 0 | X | X | X | I 1 | X | I 2 |
| M |  |  |  |  | I 0 | X | X | X | I 1 | X |
| W |  |  |  |  |  | I 0 | X | X | X | I 1 |

Name $\qquad$

## Question 3 (6 points)

Let's fix Ben's implementation by extending the existing stall control signal, which already works for register hazards, to also stall on status flag hazards.

First, derive the stall conditions for the different jumps: $J M P_{\text {stall }}, J L_{\text {stall }}$, and $J G_{\text {stall }}$. Use Opcode $(Y)$ to indicate the condition when the instruction in $X$ stage is $Y$. $Y$ can be a specific instruction or an instruction class (see Table C-2). For example:

```
            Opcode FD (JG): if the instruction in the FD stage is a JG instruction.
        Opcode}\mp@subsup{E}{E1}{\prime}(Logic): if the instruction in the E1 stage belongs to the logical
        instruction class (e.g. OR).
Opcode e2 (CMP|Arith): if the instruction in the E2 stage is a CMP instruction or
                                    belongs to the arithmetic instruction class.
JMP
```



```
    Opcode&3(logic|Arith|CMP)
JLstall = Opcode E1 (Arith|CMP) | Opcode E2 (Arith|CMP) |
    OpcodeE\mp@code{EArith|CMP)}
```

Finally, write down the new stall signal (stall') by using the old stall signal (stall) and stall conditions you derive.

```
stall'= stall | (Opcode FD(JL) & JLstall) |
    (OpcodeFD}(JG) & JGstall)
```

Name

## Question 4 (7 points)

Does this 6-stage pipeline add more challenges to precise exception handling? If so, please explain.

Yes. Since the status flags are set in E3 stages, you will need some mechanism to roll back in order to handle exceptions after E3 stages.

