Vector Processors

Joel Emer
Computer Science & Artificial Intelligence Lab
M.I.T.
Supercomputers

Definition of a supercomputer:
• Fastest machine in the world at given task
• A device to turn a compute-bound problem into an I/O bound problem
• Any machine costing $30M+
• Any machine designed by Seymour Cray

CDC6600 (Cray, 1964) regarded as first supercomputer
Supercomputer Applications

Typical application areas:
• Military research (nuclear weapons, cryptography)
• Scientific research
• Weather forecasting
• Oil exploration
• Industrial design (car crash simulation)
• Bioinformatics

All involve huge computations on large data sets

*In 70s-80s, Supercomputer ≡ Vector Machine*
Loop Unrolled Code Schedule

for (i=0; i<N; i++)

<table>
<thead>
<tr>
<th>Loop</th>
<th>Int1</th>
<th>Int 2</th>
<th>M1</th>
<th>M2</th>
<th>FP+</th>
<th>FPx</th>
</tr>
</thead>
<tbody>
<tr>
<td>B</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

```
for (i=0; i<N; i++)
```

```
loop:   ld f1, 0(r1)
        ld f2, 8(r1)
        ld f3, 16(r1)
        ld f4, 24(r1)
        add r1, 32
        fadd f5, f0, f1
        fadd f6, f0, f2
        fadd f7, f0, f3
        fadd f8, f0, f4
        sd f5, 0(r2)
        sd f6, 8(r2)
        sd f7, 16(r2)
        sd f8, 24(r2)
        add r2, 32
        bne r1, r3, loop
```
Vector Supercomputers

Epitomized by Cray-1, 1976:

• Scalar Unit
  – Load/Store Architecture

• Vector Extension
  – Vector Registers
  – Vector Instructions

• Implementation
  – Hardwired Control
  – Highly Pipelined Functional Units
  – No Data Caches
  – Interleaved Memory System
  – No Virtual Memory
Cray-1 (1976)
Cray-1 (1976)

Single Port Memory
- 16 banks of 64-bit words + 8-bit SECDED
- 80MW/sec data load/store
- 320MW/sec instruction buffer refill
- memory bank cycle 50 ns

64 Element Vector Registers

80MW/sec data load/store

320MW/sec instruction buffer refill

processor cycle 12.5ns (80MHz)
Vector Programming Model

**Scalar Registers**
- r15
- r0

**Vector Registers**
- v15
- v0

**Vector Length Register (VLR)**

**Vector Arithmetic Instructions**
- ADDV v3, v1, v2

- v1
- v2
- v3

- [0] [1] [2] [VLRMAX-1]
Vector Programming Model

Scalar Registers
- r0
- r15

Vector Registers
- v0
- v1
- v2
- [0] to [VLRMAX-1]

Vector Length Register (VLR)

Vector Load and Store Instructions

LV v1, r1, r2

Base, r1
Stride, r2
Compiler-based Vectorization

Scalar code

Vector code

for (i=0; i<N; i++)
    C[i] = A[i] + B[i];

Compiler recognizes independent operations with loop dependence analysis
## Vector Code Example

<table>
<thead>
<tr>
<th># C code</th>
<th># Scalar code</th>
<th># Vector code</th>
</tr>
</thead>
</table>
| for (i=0; i<64; i++)
C[i] = A[i] + B[i]; | LI R4, 64
loop:
L.D F0, 0(R1)
L.D F2, 0(R2)
ADD.D F4, F2, F0
S.D F4, 0(R3)
DADDIU R1, 8
DADDIU R2, 8
DADDIU R3, 8
DSUBIU R4, 1
BNEZ R4, loop | LI VLR, 64
LV V1, R1
LV V2, R2
ADDV.D V3, V1, V2
SV V3, R3 |

How many instructions?
Vector ISA Attributes

- **Compact**
  - One short instruction encodes N operations

- **Expressive, tells hardware that these N operations:**
  - Are independent
  - Use the same functional unit
  - Access disjoint elements in vector registers
  - Access registers in same pattern as previous instructions
  - Access a contiguous block of memory (unit-stride load/store)
  - Access memory in a known pattern (strided load/store)
Vector ISA Hardware Implications

• Large amount of work per instruction
  → Less instruction fetch bandwidth requirements
  → Allows simplified instruction fetch design

• No data dependence within a vector
  → Amenable to deeply pipelined/parallel designs

• Disjoint vector element accesses
  → Banked rather than multi-ported register files

• Known regular memory access pattern
  → Allows for banked memory for higher bandwidth
Vector Arithmetic Execution

- Use deep pipeline (=> fast clock) to execute element operations
- Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!)

Given 64-element registers, how long does it take to compute V3?

V3 ← V1 * V2

Six-stage multiply pipeline
Vector Instruction Execution

ADDV C, A, B

**Execution using one pipelined functional unit**


**Execution using four pipelined functional units**

Vector Unit Structure

- **Function Unit (Adder)**
- **Vector Registers**
- **Function Unit (Mult)**
- **Lane**

Elements
- 0, 4, 8, ...
- 1, 5, 9, ...
- 2, 6, 10, ...
- 3, 7, 11, ...

Memory Subsystem
Vector Instruction Parallelism

Can overlap execution of multiple vector instructions
- example machine has 32 elements per vector register and 8 lanes

Complete 24 operations/cycle while issuing 1 short instruction/cycle
Vector Chaining

Problem: Long latency for RAW register dependencies

- Vector version of register bypassing
  - introduced with Cray-1

\[
\text{LV } v1 \\
\text{MULV } v3, v1, v2 \\
\text{ADDV } v5, v3, v4
\]
Vector Chaining Advantage

- Without chaining, must wait for last element of result to be written before starting dependent instruction.

- With chaining, can start dependent instruction as soon as first result appears.
Vector Memory System

Cray-1: 16 banks, 4 cycle bank busy time, 12 cycle latency

- Bank busy time: Cycles between accesses to same bank
- Allows 16 parallel accesses (if data in different banks)
Vector Stripmining

Problem: Vector registers have finite length
Solution: Break loops into pieces that fit in registers, "Strip mining"

for (i = 0; i < N; i++)
    C[i] = A[i] + B[i];

A   B   C

Remainder

64 elements

\[
\begin{align*}
\text{ANDI} & \quad \text{R1, N, 63} & \# \ N \ mod \ 64 \\
\text{MTC1} & \quad \text{VLR, R1} & \# \ Do \ remainder \\
\text{loop:} & \quad \\
\text{LV} & \quad \text{V1, RA} \\
\text{DSLL} & \quad \text{R2, R1, 3} & \# \ Multiply \ by \ 8 \\
\text{DADDU} & \quad \text{RA, RA, R2} & \# \ Bump \ pointer \\
\text{LV} & \quad \text{V2, RB} \\
\text{DADDU} & \quad \text{RB, RB, R2} \\
\text{ADDV.D} & \quad \text{V3, V1, V2} \\
\text{SV} & \quad \text{V3, RC} \\
\text{DADDU} & \quad \text{RC, RC, R2} \\
\text{DSUBU} & \quad \text{N, N, R1} & \# \ Subtract \ elements \\
\text{LI} & \quad \text{R1, 64} \\
\text{MTC1} & \quad \text{VLR, R1} & \# \ Reset \ full \ length \\
\text{BGTZ} & \quad \text{N, loop} & \# \ Any \ more \ to \ do? \\
\end{align*}
\]
Vector Conditional Execution

Problem: Want to vectorize loops with conditional code:

```c
for (i = 0; i < N; i++)
    if (A[i] > 0) then
        A[i] = B[i];
```

Solution: Add vector mask (or flag) registers
- vector version of predicate registers, 1 bit per element

...and maskable vector instructions
- vector operation becomes NOP at elements where mask bit is clear

Code example:

```c
CVM # Turn on all elements
LV vA, rA # Load entire A vector
SGTVS.D vA, F0 # Set bits in mask register where A>0
LV vA, rB # Load B vector into A under mask
SV vA, rA # Store A back to memory under mask
```
Masked Vector Instructions

Simple implementation
- execute all N operations, turn off result writeback according to mask

Density-time implementation
- scan mask vector and only execute elements with non-zero masks

\[C[i] = A[i] + B[i]\]
Vector Scatter/Gather

Want to vectorize loops with indirect accesses:

```c
for (i = 0; i < N; i++)
    A[i] = B[i] + C[D[i]]
```

Indexed load instruction (Gather)

```plaintext
LV vD, rD       # Load indices in D vector
LVI vC, rC, vD  # Load indirect from rC base
LV vB, rB       # Load B vector
ADDV.D vA, vB, vC # Do add
SV vA, rA       # Store result
```

*Is this a correct translation?*
Vector Scatter/Gather

Scatter example:

```
for (i = 0; i < N; i++)
    A[B[i]]++;
```

Is the following a correct translation?

```
LV vB, rB       # Load indices in B vector
LVI vA, rA, vB  # Gather initial A values
ADDV vA, vA, 1 # Increment
SVI vA, rA, vB  # Scatter incremented values
```
A Later-Generation Vector Super:
NEC SX-6 (2003)

- CMOS Technology
  - 500 MHz CPU, fits on single chip
  - SDRAM main memory (up to 64GB)

- Scalar unit
  - 4-way superscalar
  - with out-of-order and speculative execution
  - 64KB I-cache and 64KB data cache

- Vector unit
  - 8 foreground VRegs + 64 background VRegs (256x64-bit elements/vReg)
  - 1 multiply unit, 1 divide unit, 1 add/shift unit, 1 logical unit, 1 mask unit
  - 8 lanes (8 GFLOPS peak, 16 FLOPS/cycle)
  - 1 load & store unit (32x8 byte accesses/cycle)
  - 32 GB/s memory bandwidth per processor

- SMP structure
  - 8 CPUs connected to memory through crossbar
  - 256 GB/s shared memory bandwidth (4096 interleaved banks)
NEC Vector Machines

<table>
<thead>
<tr>
<th>System</th>
<th>Introduction</th>
<th>Max. CPUs</th>
<th>Peak CPU double precision GFLOPS</th>
<th>Peak system GFLOPS</th>
<th>Max. main memory</th>
<th>System memory B/W (GB/s)</th>
<th>Memory B/W per CPU (GB/s)</th>
</tr>
</thead>
<tbody>
<tr>
<td>SX-7</td>
<td>2002</td>
<td>32</td>
<td>8.83</td>
<td>282</td>
<td>256 GB</td>
<td>1129</td>
<td>35.3</td>
</tr>
<tr>
<td>SX-8[17][11]</td>
<td>2005</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>SX-8R[17][11]</td>
<td>2006</td>
<td>8</td>
<td>35.2</td>
<td>281.6</td>
<td>256 GB</td>
<td>563.2</td>
<td>70.4</td>
</tr>
<tr>
<td>SX-ACE[17][11]</td>
<td>2013</td>
<td>1</td>
<td>256</td>
<td>256</td>
<td>1 TB</td>
<td>256</td>
<td>256</td>
</tr>
<tr>
<td>SX-Aurora TSUBASA[17][11]</td>
<td>2017</td>
<td>8</td>
<td>2450</td>
<td>19600</td>
<td>8×48GB</td>
<td>8×1200</td>
<td>1200</td>
</tr>
<tr>
<td>SX-Aurora TSUBASA Type 20[17][11]</td>
<td>2021</td>
<td>8</td>
<td>3070</td>
<td>24560</td>
<td>8×48GB</td>
<td>8×1530</td>
<td>1530</td>
</tr>
</tbody>
</table>

Multimedia/SIMD Extensions

- Short vectors added to existing general-purpose ISAs
- Idea first used on Lincoln Labs TX-2 computer in 1957, with 36b datapath split into 2x18b or 4x9b
- Recent incarnations initially reused existing registers
  - e.g., 64-bit registers split into 2x32bits or 4x16bits or 8x8bits
- Trend towards larger vector support in microprocessors
  - e.g. x86:
    - MMX (64 bits)
    - SSEEx (128 bits)
    - AVX (256 bits)
    - AVX-512 (512 bits/masks)

Figure 1. MMX technology data types: packed byte (a), packed word (b), packed doubleword (c), and quadword (d).
Intel SIMD Evolution

Implementations:

- **Intel MMX (1996) – 64bits**
  - Eight 8-bit integer ops, or
  - Four 16-bit integer ops
  - Two 32-bit integer ops

- **Streaming SIMD Extensions (SSE) (1999) – 128bits**
  - Four 32-bit integer ops (and smaller integer types)
  - Four 32-bit integer/fp ops, or
  - Two 64-bit integer/fp ops

- **Advanced Vector Extensions (2010) – 256bits**
  - Four 64-bit integer/fp ops (and smaller fp types)

- **AVX-512 (2017) – 512bits**
  - New instructions: scatter/gather, mask registers
Multimedia Extensions vs Vectors

• Limited instruction set
  – No vector length control
  – Usually no masks
  – Up until recently, no strided load/store or scatter/gather
  – Unit-stride loads must be aligned to 64/128-bits

• Limited vector register length
  – requires superscalar dispatch to keep units busy
  – loop unrolling to hide latencies increases register pressure

• Trend towards fuller vector support
  – Better support for misaligned memory accesses
  – Support of double-precision (64-bit floating-point)
  – Support for masked operations
Knights Landing (KNL) CPU

- 2-wide decode/retire
- 6-wide execute
- 72-entry ROB
- 64B cache ports
- 2 load/1 store ports
- Fast unaligned access
- Fast scatter/gather
- OoO int/fp RS
- In-order mem RS
- 4 thread SMT
- Many shared resources
  - ROB, rename buffer, RS: dynamically partitioned
- Several thread choosers

Knights Landing (KNL) Mesh

Mesh of Rings
- Rows/columns (half) ring
- YX routing
- Message arbitration on:
  - Injection
  - Turns

Cache Coherent Interconnect
- MESIF protocol
- Distributed directory
  - to filter snoops

Partitioning modes
- All-to-all
- Quadrant
- Sub-NUMA

Thank you!

Next Lecture: GPUs