# 6.5930/1 Hardware Architectures for Deep Learning

# Kernel Computation - Impact of Memory Hierarchy

February 21, 2024

Joel Emer and Vivienne Sze

Massachusetts Institute of Technology Electrical Engineering & Computer Science



#### **Goals of Today's Lecture**

- Understand impact of memory hierarchy
  - Overview of caches
  - Structuring algorithms to work well in caches using tiling
  - Storage technologies

### Readings for this Week

- Efficient Processing of Deep Neural Networks
  - Chapter 4 of <a href="https://doi.org/10.1007/978-3-031-01766-7">https://doi.org/10.1007/978-3-031-01766-7</a>

.

#### Simple Pipelined µArchitecture



What are consequences of putting large memory (e.g., megabytes) directly in pipeline?

Long latency => dependency stalls Large energy consumption

#### Pipelined µArchitecture with Caches



Instruction cache (I\$) and data cache (D\$) hold memory data for reuse in small energy efficient buffer

#### **Direct Mapped Cache**



February 21, 2024 Sze and Emer

#### **Cache Operation**

Look at data address, search cache tags to find match.

Then if...



Metric: Hit Rate = #Hits / (#Hits + #Misses)

February 21, 2024

#### **Treatment of Writes**

- Cache hit:
  - write through: write both cache & memory
    - · generally higher traffic but simplifies cache in processor pipeline
  - write back: write cache only (memory is written only when the entry is evicted)
    - a dirty bit per block can further reduce the traffic
- Cache miss:
  - no write allocate: only write to main memory
  - write allocate (aka fetch on write): fetch into cache
- Common combinations:
  - write through and no write allocate
  - write back with write allocate

#### **Cache Locality**

Caches **implicitly** try to optimize data movement by trying to exploit two common properties of memory references:

- Spatial Locality: If a location is referenced it is likely that locations near it will be referenced in the near future.
  - Exploited by having block size larger than a word, which also amortizes fill overheads by getting more bytes with one access
- Temporal Locality: If a location is referenced it is likely to be referenced again in the near future.
  - Exploited by holding blocks for future access

### Fully Connected (FC) Computation

#### Predefined constant

#### Impact of spatial locality

- Typical in-pipeline cache size
  - 64K bytes => 16K FP32 words
  - 64 byte blocks => 16 FP32 words/block

Hit rate of long sequential reference streams due to spatial locality?

15/16 => ~94%

#### FC – Data Reference Pattern



February 21, 2024

Sze and Emer

#### FC – Data Reference Pattern



February 21, 2024 Sze and Emer

#### **Amount of temporal locality**

- Typical in-pipeline cache size
  - 64K bytes => 16K FP32 words
  - 64 byte blocks => 16 FP32 words/block
- Typical layer size:

$$- H, W = 256 C = 128$$

Size of input activations?

256x256x128x4 => 32MB

What does this imply for Input activation temporary locality?

No temporal locality since 32MB > 64K bytes

#### **Computational Intensity**

Computational Intensity = 
$$\frac{MACS}{Data\ Words}$$

$$O_m = I_{chw} \times F_{m,chw}$$

Number MACS: M\*CHW = M\*C\*H\*W

### **Computational Intensity – Ideal FC**

Computational Intensity = 
$$\frac{MACS}{Data\ Words}$$

$$O_m = I_{chw} \times F_{m,chw}$$

Number MACS: M\*C\*H\*W Filter weight accesses: M\*CHW = M\*C\*H\*W

Input activation accesses: CHW = C\*H\*W

Output activation accesses: M

Computational Intensity = 
$$\frac{M \times C \times H \times W}{M \times C \times H \times W + C \times H \times W + M} = \frac{1}{1 + \frac{1}{M} + \frac{1}{C \times H \times W}} \sim 1$$

If in one cycle a processor can deliver one word from memory and perform one MAC how well will the machine perform running this code?

#### Computational Intensity – Naïve FC

Computational Intensity = 
$$\frac{MACS}{Data\ Words}$$

```
CHWm = -CHW;
for m in [0, M):
    o[m] = 0;
    CHWm += CHW
    for chw in [0, CHW):
        o[m] += i[chw] * f[CHWm + chw]
```

Number MACS: N

M\*C\*H\*W

Filter weight accesses

M\*C\*H\*W

Input activation accesses

M\*C\*H\*W

Output activation accesses

M

$$\frac{M \times C \times H \times W}{M \times C \times H \times W + M \times C \times H \times W + M} = \frac{1}{2 + \frac{1}{C \times H \times W}} \sim \frac{1}{2}$$

#### **Einsum for strip mined FC**

$$O_m = I_{chw} \times F_{m,chw}$$

$$I_{chw/T,chw\%T} = I_{chw}$$

$$chw1 chw0$$

$$I_{chw/T,chw\%T} = I_{chw}$$
 $F_{m,chw/T,chw\%T} = F_{m,chw}$ 
 $chw1 chw0$ 
 $F_{m,chw/T,chw\%T} = F_{m,chw}$ 

$$O_m = I_{chw1,chw0} \times F_{m,chw1,chw0}$$

### Fully Connected – Strip Mined

```
for m in [0, M):
    for chw in [0, C*H*W):
    o[m] += i[chw] * f[CHW*m + chw]
```

```
// Level 1
for chw1 in [0, CHW1):
    for m in [0, M):

// Level 0
    for chw0 in [0, CHW0):
        chw = CHW0*chw1+chw0
        o[m] += i[chw] * f[CHW*m + chw]

Inner loop working
    set = CHW0
```

Just considering input activations, Less than cache size what value should CHW0 be?

#### FC - Strip Mined Data Reference Pattern



February 21, 2024

Sze and Emer

#### **Computational Intensity – Strip Mined**

```
// Level 1
for chw1 in [0, CHW1):
   for m in [0, M):
   // Level 0
   for chw0 in [0, CHW0):
        chw = CHW0*chw1+chw0
        o[m] += i[chw] * f[CHW*m + chw]
```

Number MACS: M\*C\*H\*W Filter weight accesses: M\*C\*H\*W

Input activation accesses: C\*H\*W

Output activation accesses M

Computational Intensity = 
$$\frac{M \times C \times H \times W}{M \times C \times H \times W + C \times H \times W + M} = \frac{1}{1 + \frac{1}{M} + \frac{1}{C \times H \times W}} \sim \frac{1}{1 + \frac{1}{M} + \frac{1}{C \times H \times W}}$$

### **Matrix-Vector Multiply – Strip Mined**



#### **Associative Cache**



February 21, 2024

Sze and Emer

### **Cache Miss Pipeline Diagram**



MISS





#### **Avoiding Cache Miss Stalls**

- Reorganize code so loads are far ahead of use
  - Requires huge amount of unrolling
  - Consumes lots of registers
- Add 'prefetch' instructions that just load cache
  - Consumes instruction issue slots
- Add hardware that automatically loads cache

#### **Hardware Data Prefetching**

- Prefetch-on-miss:
  - Prefetch b + 1 upon miss on b
- One Block Lookahead (OBL) scheme
  - Initiate prefetch for block b + 1 when block b is accessed
  - Can extend to N block lookahead
- Strided prefetch
  - If observe sequence of accesses to block b, b+N, b+2N, then prefetch b+3N etc.

Example: IBM Power 5 [2003] supports eight independent streams of strided prefetch per processor, prefetching 12 lines ahead of current access

#### **Multi-level Caches**

- A memory cannot be large and fast
- Add level of cache to reduce miss penalty
  - Each level can have longer latency than level above
  - So, increase sizes of cache at each level



#### Metrics:

Local miss rate = misses in cache/ accesses to cache

Global miss rate = misses in cache / CPU memory accesses

Misses per instruction = misses in cache / number of instructions

### **Contemporary CPU Cache Hierarchy**



(a) Memory hierarchy for server



(b) Memory hierarchy for a personal mobile device

# FC Layer – Multichannel



February 21, 2024

Sze and Emer





#### **FC Einsum Notation**

$$O_{n,m} = F_{m,chw} \times I_{n,chw}$$











 After flattening, having a batch size of N turns the matrix-vector operation into a matrix-matrix multiply

How much temporal locality for naïve implementation? None

# **Matrix-Matrix Multiply**



# **Matrix-Matrix Multiply Tiled**





Matrix multiply tiled to fit in cache and computation ordered to maximize reuse of data in cache



Matrix multiply tiled to fit in cache and computation ordered to maximize reuse of data in cache



Matrix multiply tiled to fit in cache and computation ordered to maximize reuse of data in cache

\*Dotted line means partial result



Matrix multiply tiled to fit in cache and computation ordered to maximize reuse of data in cache

### **Einsum for tiled FC**

$$O_m = I_{n,chw} \times F_{m,chw}$$

$$I_{n,chw} \rightarrow I_{n1,chw1,n0,chw1}$$

$$F_{m,chw} \rightarrow F_{m1,chw1,m0,chw0}$$

$$O_{m1,m0} = I_{n1,chw1,n0,chw0} \times F_{m1,chw1,m0,chw0}$$

### Fully-Connected (FC) Layer

Implementation: Matrix Multiplication (GEMM)

• CPU: OpenBLAS, Intel MKL, etc

• GPU: cuBLAS, cuDNN, etc

- Library will note shape of the matrix multiply and select implementation optimized for that shape.
- Optimization usually involves proper tiling to storage hierarchy

# **Tradeoffs in Memories**

#### **Overview of Memories**

Memory consist of arrays of cells that hold a value.

- Types of Memories/Storage
  - Latches/Flip Flops (Registers)
  - SRAM (Register File, Caches)
  - DRAM (Main Memory)
  - Flash (Storage)

### **Elements of Memory Operation**

#### Implementations vary based on:

- How a memory cell holds a value?
- How is a value obtained from a memory cell?
- How is a value set in a memory cell?
- How is array constructed out of individual cells?
- Results in tradeoffs between cost, density, speed, energy and power consumption

## Latches/Flip Flops

- Fast and low latency
- Located with logic





February 21, 2024

## Latches/Flip Flops (< 0.5 kB)

- Fast and low latency
- Located with logic
- Not very dense
  - 10+ transistors per bit
  - Usually use for arrays smaller than 0.5kB



Array of Flip flops



February 21, 2024

# Latches/Flip Flops (< 0.5 kB)

#### Array of Flip flops



February 21, 2024

### SRAM

- Higher density than register
  - Usually, 6 transistors per bit-cell
- Less robust and slower than latches/flip-flop





Bit cell size 0.75um<sup>2</sup> in 14nm



## SRAM (kB – MB)



### **SRAM**



### **SRAM Power Dominated by Bit Line**

#### Measured SRAM Power Breakdown



Larger array → Longer bit-lines

→ Higher capacitance → Higher power

### **DRAM**

- Higher density than SRAM
  - 1 transistor per bit-cell
  - Needs periodic refresh
- Special device process





### DRAM (GB)

- Higher density than SRAM
  - 1 transistor per bit-cell
  - Needs periodic refresh
- Special device process
  - Usually off-chip (except eDRAM which is pricey!)
  - Off-chip interconnect has much higher capacitance



February 21, 2024 Sze and Emer

# Flash (100GB to TB)

- More dense than DRAM
- Non-volatile
  - Needs high powered write (change V<sub>TH</sub> of transistor)



- (a) Avalanche injection.
- (b) Removing programming voltage leaves charge trapped.
- (c) Programming results in higher  $V_T$ .

### Flash Memory



Single Level Cell (SLC)

Multi-levels cell (MLC)

48 layer, Ternary level cell (TLC)

Aug 2015

256 Gb per die (for SSD)



**Threshold Voltage** 

#### Multi-levels cell (MLC)



**Threshold Voltage** 

February 21, 2024

### **Memory Tradeoffs**



Most attributes tend to improve with technology scaling, lower voltage and sometimes smaller capacitors

February 21, 2024

### **Summary**

- Reduce main memory access with caches
  - Main memory (i.e., DRAM) is slow and has high energy consumption
  - Exploits spatial and temporal locality
- Tiling to reduce cache misses
  - Possible since processing order does not affect result (MACs are commutative)
  - Add levels to loop nest to improve temporal locality
  - Size of tile depends on cache size and cache associativity
- Tradeoffs in storage technology
  - Various tradeoffs in cost, speed, energy, capacity...
  - Different technologies appropriate at different spots in the design

#### **Next Lecture: Vectorization**

February 21, 2024

# Thank you!