6.5930/1
Hardware Architectures for Deep Learning

## Accelerator Architecture (continued)

March 11, 2024

Joel Emer and Vivienne Sze

Massachusetts Institute of Technology Electrical Engineering & Computer Science



#### **Operation Sequencing**



#### Multiprocessor



#### **Highly-Parallel Compute Paradigms**

## Temporal Architecture (SIMD/SIMT)



## Spatial Architecture (Dataflow Processing)



#### **Spatial Architecture for DNN**



Accelerator Architecture Temporally Spatially Programmed Programmed CPU **FPGA RAW** GPU AsAP **TRIPS** PicoChip WaveScalar Triggered **DySER** Instructions TTA



#### Field Programmable Gate Arrays



#### Microsoft Project Catapult

#### Configurable Cloud (MICRO 2016) for Azure



Accelerate and reduce latency for

- Bing search
- Software defined network
- Encryption and Decryption

#### Microsoft Brainwave Neural Processor



Source: Microsoft

#### **Heterogeneous Blocks**

- Add specific purpose logic on FPGA
  - Efficient if used (better area, speed, power), wasted if not
- Soft fabric
  - LUT, flops, addition, subtraction, carry logic
  - Convert LUT to memories or shift registers
- Memory block (BRAM)
  - Configure word and address size (aspect ratio)
  - Combine memory blocks to large blocks
  - Significant part for FPGA area
  - Dual port memories (FIFO)
- Multipliers /MACs → DSP
- CPUs and processing elements

| SOFT<br>LOGIC | SOFT<br>LOGIC | Memory<br>Block | MULT | SOFT<br>LOGIC | SOFT<br>LOGIC |
|---------------|---------------|-----------------|------|---------------|---------------|
| SOFT<br>LOGIC | SOFT<br>LOGIC |                 | MULT | SOFT<br>LOGIC | SOFT<br>LOGIC |
| SOFT<br>LOGIC | SOFT<br>LOGIC | Memory<br>Block | MULT | SOFT<br>LOGIC | SOFT<br>LOGIC |
| SOFT<br>LOGIC | SOFT<br>LOGIC |                 | MULT | SOFT<br>LOGIC | SOFT<br>LOGIC |
| SOFT<br>LOGIC | SOFT<br>LOGIC | Memory<br>Block | MULT | SOFT<br>LOGIC | SOFT<br>LOGIC |
| SOFT<br>LOGIC | SOFT<br>LOGIC |                 | MULT | SOFT<br>LOGIC | SOFT<br>LOGIC |



#### **Programmable Accelerators**

Processing Element



Many Programmable Accelerators look like an array of PEs, but have dramatically different architectures, programming models and capabilities



#### **Fixed Operation PEs**

- Each PE hard-wired to one operation
- Purely pipelined operation
  - no backpressure in pipeline

- Attributes
  - High-concurrency
  - Regular design, but
  - Regular parallelism only!
  - Allows for systolic communication

### **Configurable Systolic Array - WARP**



March 11, 2024

Source: WARP Architecture and Implementation, ISCA 1986

#### **Fixed Operation - Google TPU**



Systolic array does 8-bit 256x256 matrix-multiply accumulate

Source: Google



#### **Single Configured Operation - Dyser**



Source: Dynamically Specialized Datapaths for Energy Efficient Computing. HPCA11



March 11, 2024

#### PC-based Control – Wave Computing



Source: Wave Computing, Hot Chips '17



March 11, 2024



March 11, 2024

#### **Guarded Actions**



- Program consists of rules that may perform computations and read/write state
- Each rule specifies conditions (guard) under which it is allowed to fire
- Separates description and execution of data (rule body) from control (guards)
- A scheduler is generated (or provided by hardware) that evaluates the guards and schedules rule execution
- Sources of Parallelism
  - Intra-Rule parallelism
  - Inter-Rule parallelism
  - Scheduler overlap with Rule execution
  - Parallel access to state

#### **Triggered Instructions (TI)**

Restrict guarded actions down to efficient ISA core:



No program counter or branch instructions

#### **Triggered Instruction Scheduler**



- Use combinational logic to evaluate triggers in parallel
- Decide winners if more than one instruction is ready
  - Based on architectural fairness policy
  - Could pick multiple non-conflicting instructions to issue (superscalar)
- Note: no wires toggle unless status changes

6.5930/1
Hardware Architectures for Deep Learning

## Dataflow for DNN Accelerator Architectures (Part 1)

March 11, 2024

Joel Emer and Vivienne Sze

Massachusetts Institute of Technology Electrical Engineering & Computer Science



#### **Goals of Today's Lecture**

- Impact of data movement and memory hierarchy on energy consumption
- Taxonomy of dataflows for CNNs
  - Output Stationary
  - Weight Stationary
  - Input Stationary

#### **Background Reading**

#### DNN Accelerators

- Efficient Processing of Deep Neural Networks
  - Chapter 5 thru 5.7.1
  - Chapter 5 5.8

All these books and their online/e-book versions are available through MIT libraries.

# Dataflow and Memory Hierarchy

#### **Spatial Compute Paradigm**

Spatial Architecture (Dataflow Processing)







Worst Case: all memory R/W are **DRAM** accesses

Example: AlexNet [NeurIPS 2012] has 724M MACs

→ 2896M DRAM accesses required



Under what circumstances will these extra levels help?

Computational intensity > 1



Opportunities: 1 data reuse

## **Types of Data Reuse in DNN**

### **Convolutional Reuse**

CONV layers only (sliding window)



Reuse: Activations
Filter weights

## **Types of Data Reuse in DNN**

### **Convolutional Reuse**

CONV layers only (sliding window)

### **Fmap Reuse**

CONV and FC layers





Reuse: Activations
Filter weights

Reuse: Activations

## **Types of Data Reuse in DNN**

### **Convolutional Reuse**

CONV layers only (sliding window)



Reuse: Activations
Filter weights

### **Fmap Reuse**

CONV and FC layers



Reuse: Activations

### **Filter Reuse**

CONV and FC layers (batch size > 1)

Input Fmaps
Filter

Reuse: Filter weights

### Memory Access is the Bottleneck



Opportunities: 1 data reuse

1 Can reduce DRAM reads of filter/fmap by up to 500×\*\*

\*\* AlexNet CONV layers

### Memory Access is the Bottleneck



- Opportunities: 1 data reuse 2 local accumulation
  - 1 Can reduce DRAM reads of filter/fmap by up to 500×
  - Partial sum accumulation does NOT have to access DRAM

### Memory Access is the Bottleneck



- Opportunities: 1 data reuse 2 local accumulation
  - 1 Can reduce DRAM reads of filter/fmap by up to 500×
  - Partial sum accumulation does NOT have to access DRAM
    - Example: DRAM access in AlexNet can be reduced from **2896M** to **61M** (best case)

### Leverage Parallelism for Higher Performance



### Leverage Parallelism for Spatial Data Reuse



### **Spatial Architecture for DNN**



### **Low-Cost Local Data Access**





Sze and Emer

### **Low-Cost Local Data Access**

How to exploit 1 data reuse and 2 local accumulation with *limited* low-cost local storage?



### **Low-Cost Local Data Access**

How to exploit 1 data reuse and 2 local accumulation with *limited* low-cost local storage?

specialized processing dataflow required!



ШiТ

### How to Map the Dataflow?

### **CNN Convolution**



Goal: Increase reuse of input data (input activations and weights) and local partial sums accumulation

# Spatial Architecture (Dataflow Processing)



# **Dataflow Taxonomy**

- Output Stationary (OS)
- Weight Stationary (WS)
- Input Stationary (IS)

[Chen et al., ISCA 2016]

# **Output Stationary (OS)**



- Minimize partial sum R/W energy consumption
  - maximize local accumulation
- Broadcast/Multicast filter weights and reuse activations spatially across the PE array

### OS Example: ShiDianNao

### **Top-Level Architecture**

#### ShiDianNao: IB: Decoder Inst. NBin: Bank #0 NFU: Py Input **Buffer Controller** Bank #2Py-1 (Column) **NBout:** Px Bank #0 Px\*Py Input Bank #2Py-1 (Row) SB: Bank #0 Kernel Bank #Py-1 Px\*Py ALU Output

### **PE Architecture**



- Inputs streamed through array
- Weights broadcast
- Partial sums accumulated in PE and streamed out

[Du et al., ISCA 2015]

# **OS Example: KU Leuven**



[Moons et al., VLSI 2016, ISSCC 2017]

### 1-D Convolution Einsum

$$O_q = I_{q+s} \times F_s$$

Operational definition of Einsum says traverse all valid values of "q" and "s"... but in what order....

Traversal order (fastest to slowest): S, Q

Which "for" loop is outermost?

### 1-D Convolution



What dataflow is this?

Is it easy to tell dataflow from the loop nest?

**Output stationary** 

Yes, from outermost loop index

† Assuming: 'valid' style convolution

# **Output Stationary - Movie**

```
Tensor: F[S]
        Rank: S
Tensor: I[W]
Rank: W
   Tensor: O[Q]
   Rank: Q
```

## **Output Stationary – Spacetime View**



# **CONV-layer Einsum**

$$O_{m,p,q} = I_{c,p+r,q+s} \times F_{m,c,r,s}$$

Traversal order (fastest to slowest): S, R, Q, P

Parallel Ranks: C, M

Can you write the loop nest?

I hope so 😊

### **CONV Layer OS Loop Nest**

```
int i[C,H,W]; # Input activations
int f[M,C,R,S]; # Filter weights
int o[M,P,Q]; # Output activations
for p in [0, P):
 for q in [0, Q):
   for r in [0, R):
      for s in [0, S):
        parallel-for c in [0, C):
          parallel-for m in [0, M):
            o[m,p,q] += i[c,p+r,q+s]*f[m,c,r,s]
```







M=8 C=3 R=2 S=2 H=3 W=3 P=2

Q=2

output fmap



Filter overlay











output fmap

























































Start processing next output feature activations

















































### Next:

### More dataflows