### 6.5930/1

Hardware Architectures for Deep Learning

# Vectorized Kernel Computation 

February 26, 2024

Joel Emer and Vivienne Sze

Massachusetts Institute of Technology Electrical Engineering \& Computer Science

## Goals of Today's Lecture

- Understand parallelism and improved efficiency through:
- loop unrolling, and
- vectorization


## Background Reading

- Vector architectures
- Computer Architecture: A Quantitative Approach, 6th edition, by Hennessy and Patterson
- Ch 4: p282-310, App G
- Ch 4: p262-288, App G

These books and their online/e-book versions are available through MIT libraries.

## Fully Connected Computation



## Fully Connected Computation



## Fully-Connected (FC) Layer - Flattened

- Matrix-Vector Multiply:
- Multiply all inputs in all channels by a weight and sum


Input fmaps


## Filter Memory Layout



## Flattened FC Loops



## Fully-Connected (FC) Layer

Filters
$\longleftarrow \mathrm{CHW} \longrightarrow$
Input fmaps



$$
\text { chw }=0
$$

## Fully-Connected (FC) Layer

Filters
$\longleftarrow \mathrm{CHW} \longrightarrow$

$$
\text { chw }=0
$$

## Fully-Connected (FC) Layer

Filters
$\longleftarrow \mathrm{CHW} \longrightarrow$

$$
\text { chw }=0
$$

## Fully-Connected (FC) Layer

Filters
CHW
Input fmaps


$$
\text { chw }=0
$$

## Fully-Connected (FC) Layer

Filters
$\longleftarrow$ CHW $\longrightarrow$


Input fmaps


$$
\text { chw }=1
$$

## Fully-Connected (FC) Layer

Filters


Input fmaps


$$
\text { chw }=1
$$

## Fully-Connected (FC) Layer

Filters


Input fmaps


$$
\text { chw }=1
$$

## Fully-Connected (FC) Layer

Filters


Input fmaps


$$
\text { chw }=1
$$

## Fully-Connected (FC) Layer

Filters
Input fmaps
Output fmaps
$\leftarrow 1 \rightarrow$


$$
\text { chw }=\mathrm{C}^{*} \mathrm{H}+\mathrm{W}-1
$$

## Fully-Connected (FC) Layer

Filters


Input fmaps


$$
\text { chw }=\mathrm{C}^{*} \mathrm{H}+\mathrm{W}-1
$$

## Fully-Connected (FC) Layer

Filters
$\longleftarrow \mathrm{CHW} \longrightarrow$


Input fmaps


$$
\text { chw }=\mathrm{C}^{*} \mathrm{H}+\mathrm{W}-1
$$

## Fully-Connected (FC) Layer

Filters
$\longleftarrow$ CHW $\longrightarrow$


Input fmaps


$$
\text { chw }=\mathrm{C}^{*} \mathrm{H}+\mathrm{W}-1
$$

## Fully-Connected (FC) Layer

Filters
$\longleftarrow \mathrm{CHW} \longrightarrow$
Input fmaps



$$
\text { chw }=0
$$

## Fully-Connected (FC) Layer

Filters


Input fmaps


$$
\text { chw }=0
$$

## Fully-Connected (FC) Layer

Filters


Input fmaps


$$
\text { chw }=0
$$

## Fully-Connected (FC) Layer

Output fmaps
$\longleftarrow 1 \longrightarrow$


$$
\text { chw }=0
$$

## Fully-Connected (FC) Layer

Filters
$\longleftarrow \mathrm{CHW} \longrightarrow$


Input fmaps


$$
\text { chw }=1
$$

## Fully-Connected (FC) Layer

Filters
$\longleftarrow \mathrm{CHW} \longrightarrow$


Input fmaps


$$
\text { chw }=1
$$

Output fmaps
$\longleftarrow 1 \longrightarrow$


## Fully-Connected (FC) Layer

Filters
$\longleftarrow \mathrm{CHW} \longrightarrow$


Input fmaps


$$
\text { chw }=1
$$

## Fully-Connected (FC) Layer

Filters
$\longleftarrow$ CHW $\longrightarrow$


Input fmaps


$$
\text { chw }=1
$$

## Fully-Connected (FC) Layer

Filters
$\longleftarrow$ CHW $\longrightarrow$


Input fmaps


$$
\text { chw }=\mathrm{C}^{*} \mathrm{H}^{*} \mathrm{~W}-1
$$

## Fully-Connected (FC) Layer

Filters


Input fmaps


$$
\text { chw }=\mathrm{C}^{*} \mathrm{H}^{*} \mathrm{~W}-1
$$

## Fully-Connected (FC) Layer

Filters


Input fmaps


$$
\text { chw }=\mathrm{C}^{*} \mathrm{H}^{*} \mathrm{~W}-1
$$

## Fully-Connected (FC) Layer

Filters


Input fmaps


$$
\text { chw }=\mathrm{C}^{*} \mathrm{H}^{*} \mathrm{~W}-1
$$

## Flattened FC Loops



## Loop Iteration Overhead



How many MACs/cycle (ignoring stalls)?
Where is a major source of overhead?

## FC scalar computation



## Loop Unrolling (2chw)



## Fully Connected - Unrolled



## Fully Connected - Unrolled

| mloop: | ```mv r1, 0 mul r3, r1, C*H*W mv r2, 0 mv r8, 0``` | \# r1 holds m <br> \# r3 holds m*CHW <br> \# r2 holds chw <br> \# r8 holds psum (o[m]) |
| :---: | :---: | :---: |
| xloop: | ```ld r4, i(r2) add r5, r2, r3 ld r6, f(r5) mac r8, r4, r6 ld r7, i+1(r2)``` |  |
|  | ```ld r9, f+1(r5) mac r8, r7, r9 add r2, r2, 2 blt r2, C*W*H, xloop st r8,o(r1) add r1, r1, 1 blt r1, M, mloop``` |  |

## Fully Connected - Unrolled

```
mv r1, 0 # r1 holds m
mloop: mul r3, r1, C*H*W # r3 holds m*CHW
mv r2, 0
mv r8, 0
xloop: ld r4, i(r2)
add r5, r2, r3
ld r6, f(r5)
mac r8, r4, r6
ld r7, i+1(r2)
ld r9, f+1(r5)
mac r8, r7, r9
add r2, r2, 2
blt r2, C*W*H, xloop
st r8, o(r1)
add r1, r1, 1
blt r1, M, mloop
```

```
# r2 holds chw
```


# r2 holds chw

# r8 holds psum (o[m])

# r8 holds psum (o[m])

# r4 = i[chw]

# r4 = i[chw]

# r5 = CHMm + chw

# r5 = CHMm + chw

# r6 = f[CHWm + chw]

# r6 = f[CHWm + chw]

# r8 += i[chw] * f[CHWm+chw]

# r8 += i[chw] * f[CHWm+chw]

# r7 = i[chw + 1]

# r7 = i[chw + 1]

# r9 = f[CHWm + chw + 1]

# r9 = f[CHWm + chw + 1]

# r11 += i[chw] * f[CHWm + chw +1]

# r11 += i[chw] * f[CHWm + chw +1]

# r2 = chw + 1

```
# r2 = chw + 1
```

How many MACs/cycle (ignoring stalls)?

## Fully-Connected (FC) Layer

Filters
CHW $\longrightarrow$


Input fmaps


Output fmaps
$\longleftarrow 1 \longrightarrow$


$$
\text { chw }=0,1
$$

## Fully-Connected (FC) Layer

Filters
CHW $\longrightarrow$


Input fmaps
$\leftarrow 1 \rightarrow$


Output fmaps
$\longleftarrow 1 \longrightarrow$


$$
\text { chw }=0,1
$$

## Fully-Connected (FC) Layer

Filters
CHW $\longrightarrow$


Input fmaps
$\leftarrow 1 \rightarrow$


Output fmaps
$\longleftarrow 1 \longrightarrow$


$$
\text { chw }=0,1
$$

## Fully-Connected (FC) Layer

Filters
CHW $\longrightarrow$


Input fmaps
$\leftarrow 1 \rightarrow$


Output fmaps
$\longleftarrow 1 \longrightarrow$


$$
\text { chw }=0,1
$$

## Fully-Connected (FC) Layer



## Fully-Connected (FC) Layer



## Fully-Connected (FC) Layer



## Fully-Connected (FC) Layer

Filters


Input fmaps


$$
\text { chw }=C^{*} H^{*} W-2, C^{*} H^{*} W-1
$$

## Fully-Connected (FC) Layer

Filters


Input fmaps


$$
\text { chw }=C^{*} H^{*} W-2, C^{*} H^{*} W-1
$$

## Fully-Connected (FC) Layer

Filters


Input fmaps


$$
\text { chw }=C^{*} H^{*} W-2, C^{*} H^{*} W-1
$$

## Fully-Connected (FC) Layer



## Fully-Connected (FC) Layer



## Fully-Connected (FC) Layer



## Fully-Connected (FC) Layer

Filters


Input fmaps


$$
\text { chw }=2,3
$$

## Fully-Connected (FC) Layer

Filters


Input fmaps


$$
\text { chw }=2,3
$$

## Fully-Connected (FC) Layer

Filters


Input fmaps


$$
\text { chw }=2,3
$$

## Fully-Connected (FC) Layer

Filters


Input fmaps


$$
\text { chw }=2,3
$$

## Fully-Connected (FC) Layer

Filters
$\longleftarrow \mathrm{CHW} \longrightarrow$


Input fmaps

chw $=C^{*} H^{*} W-2, C^{*} H^{*} W-1$

## Fully-Connected (FC) Layer

Filters


Input fmaps

chw $=\mathrm{C}^{*} \mathrm{H}^{*} \mathrm{~W}-2, \mathrm{C}^{*} \mathrm{H}^{*} \mathrm{~W}-1$

## Fully-Connected (FC) Layer

Filters
$\longleftarrow \mathrm{CHM} \longrightarrow$


Input fmaps


$$
\text { chw }=C^{*} \mathrm{H}^{*} \mathrm{~W}-2, \mathrm{C}^{*} \mathrm{H}^{*} \mathrm{~W}-1
$$

## Fully-Connected (FC) Layer

Filters


Input fmaps

chw $=\mathrm{C}^{*} \mathrm{H}^{*} \mathrm{~W}-2, \mathrm{C}^{*} \mathrm{H}^{*} \mathrm{~W}-1$
Can we incorporate this "pairing" into the architecture?

## Vector Programming Model

```
Scalar Registers
    r15 v15
    v15
```

VLRMAX - number of elements in a vector register VLR - number of elements to use in an instruction

Vector Arithmetic Instructions

ADDV v3, v1, v2


## Vector Programming Model




## Compiler-based Vectorization



## Loop Unrolled



## Parallel with animation



## Fully Connected - Loop Permutation



## Fully Connected - Loop Permutation

```
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
for m in [0, M):
    for chw in [0, C*H*W, 2):
        o[m] += i[chw] * f[CHW*m + chw]
        o[m] += i[chw + 1] * f[CHW*m + chw + 1]
```


## Fully Connected - Loop Permutation



## FC - Permuted/Unrolled

```
// Loops permuted
for chw in [0, C*H*W):
    for m in [0, M):
        o[m] += i[chw] * f[CHW*m + chw]
```

```
// Unrolled inner loop
for chw in [0, C*H*W):
    for m in [0, M, 2):
        o[m] += i[chw] * f[CHW*m + chw]
        o[m+1] += i[chw] * f[CHW*(m+1) + chw]
```

Unrolled calculation

## Parallel m animation



## FC - Permuted/Unrolled/Hoisted



## Fully Connection Computation

```
// Loop invariant hosting of i[chw]
for chw in [0, C*H*W):
    i_chw = i[chw];
    for m in [0, M, 2):
        o[m] += i_chw * f[CHW*m + chw]
        o[m+1] += i_chw * f[CHW*(m+1) + chw]
```

$\left[\mathrm{C}_{0} \mathrm{H}_{0} \mathrm{~W}_{0}\right]\left[\mathrm{C}_{0} \mathrm{H}_{0} \mathrm{~W}_{1}\right]$.
$\mathrm{I}\left[\mathrm{C}_{0} \mathrm{H}_{1} \mathrm{~W}_{0}\right] \mathrm{I}\left[\mathrm{C}_{0} \mathrm{H}_{1} \mathrm{~W}_{1}\right]$.
$I\left[\mathrm{C}_{0} \mathrm{H}_{2} \mathrm{~W}_{0}\right] \quad \mathrm{I}\left[\mathrm{C}_{0} \mathrm{H}_{2} \mathrm{~W}_{1}\right] \ldots$
$I\left[\mathrm{C}_{1} \mathrm{H}_{0} \mathrm{~W}_{0}\right] \mid\left[\mathrm{C}_{1} \mathrm{H}_{0} \mathrm{~W}_{1}\right] \ldots$
$I\left[\mathrm{C}_{1} \mathrm{H}_{1} \mathrm{~W}_{0}\right] \mid\left[\mathrm{C}_{1} \mathrm{H}_{1} \mathrm{~W}_{1}\right]$
$\mathrm{I}\left[\mathrm{C}_{1} \mathrm{H}_{2} \mathrm{~W}_{0}\right] \mathrm{I}\left[\mathrm{C}_{1} \mathrm{H}_{2} \mathrm{~W}_{1}\right] \ldots$

Weights needed together are far apart.
What can we do?
$\mathrm{F}\left[\mathrm{M}_{1} \mathrm{C}_{0} \mathrm{H}_{0} \mathrm{~W}_{0}\right] \mathrm{F}\left[\mathrm{M}_{1} \mathrm{C}_{0} \mathrm{H}_{0} \mathrm{~W}_{1}\right]$.
$\mathrm{F}\left[\mathrm{M}_{1} \mathrm{C}_{0} \mathrm{H}_{1} \mathrm{~W}_{0}\right] \mathrm{F}\left[\mathrm{M}_{1} \mathrm{C}_{0} \mathrm{H}_{1} \mathrm{~W}_{1}\right]$.
$\mathrm{F}\left[\mathrm{M}_{1} \mathrm{C}_{0} \mathrm{H}_{2} \mathrm{~W}_{0}\right] \quad \mathrm{F}\left[\mathrm{M}_{1} \mathrm{C}_{0} \mathrm{H}_{2} \mathrm{~W}_{1}\right] \ldots$

## FC - Layered Loops



## Einsum Rank Splitting

$$
\begin{gathered}
O_{m}=I_{c h w} \times F_{m, c h w} \\
O_{m} \rightarrow O_{m 1 \times V L+m 0} \rightarrow O_{m 1, m 0} \\
F_{m, c h w} \rightarrow F_{m 1 \times V L+m 0, c h w} \rightarrow F_{m 1, m 0, c h w} \\
O_{m}=I_{c h w} \times F_{m, c h w} \\
-
\end{gathered}
$$

## FC - Layered Loops

```
// Level 2 loops
for chw in [0, C*H*W):
    for m1 in [0, M/VL):
    // Level 1 loops
    parallel_for m0 in [0, VL):
        o[m1][m0] += i[chw] * f[m1][m0][chw]
```

Flatten data structures

```
// Level 2 loops
for chw in [0, C*H*W):
    for m1 in [0, M/VL):
    // Level 1 loops
    parallel_for m0 in [0, VL):
            o[m1*VL+m0] += i[chw] * f[VL*CWH*m1+CWH*m0+chw]
```


## FC - Layered Loops

```
// Level 2 loops
for chw in [0, C*H*W): Hoist Loop
    i_chw \(=\mathrm{i}[\mathrm{chw}]\) Invariant!
    for m1 in [0, M/VL):
// Level 1 loops
        parallel_for m0 in [0, VL):
            o[m1*VL+m0] += i_chw * f[VL*CWH*m1 +CWH*m0+chw]
            Invariant in inner loop!
// Level 1 loop
        \(m 1 \mathrm{VL}=\mathrm{m1} * \mathrm{VL}\)
        CHWVLm1_chw \(=\mathrm{CHW}\) VL \(* m 1+c h w\)
        parallel_for m0 in [0, VL):
            o[m1VL+m0] += i_chw * f[CHWVLm1_chw + CHW*m0]
```


## FC - Layered Loops



## FC - Layered Loops



## FC - Layered Loops



## FC - Layered Loops



## FC - Layered Loops



## Full Connected - Vectorized

```
    mv r1, 0 # r1 holds chw
    add r4, 0 # r4 holds CHWVLm1_chw
    xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
    mv r2, 0 # r2 holds m1VL
    mloop: ldv v3, f(r4), CWH # v3 holds f[]
    ldv v5, o(r2), 1 # v5 holds o[]
    macv v5, v1, v3 # multiply f[] * i[]
    stv v5,o(r2), 1 # store o
    add r2, r2, VL # update m1VL
    add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
```


## Full Connected - Vectorized

```
    mv r1, 0 # r1 holds chw
    add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
    mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
    ldv v5, o(r2), 1 # v5 holds o[]
    macv v5, v1, v3 # multiply f[] * i[]
    stv v5,o(r2), 1 # store o
    add r2, r2, VL # update m1VL
    add r4, r4, CHWVL # update CHWVLm1_chw
    blt r2, M, mloop Strength reduced
    add r1, r1, 1 # update chw
    add r4, r4, r1 # update CHWVLm1_chw
    blt r1, CWH, xloop
```


## Full Connected - Vectorized

```
    mv r1, 0 # r1 holds chw
    add r4, 0 # r4 holds CHWVLm1_chw
    xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
    mv r2, 0 # r2 holds m1VL
    mloop: ldv v3, f(r4), CWH # v3 holds f[]
    ldv v5, o(r2), 1 # v5 holds o[]
    macv v5, v1, v3 # multiply f[] * i[]
    stv v5, o(r2), 1 # store o
    add r2, r2, VL # update m1VL
    add r4, r4, CHWVL # update CHWVLm1_chw
    blt r2, M, mloop Strength reduced
    add r1, r1, 1 # update chw
    add r4, r4, r1 # update CHWVLm1_chw
    blt r1, CWH, xloop
```

How many MACs/cycle (ignoring stalls)?
Can we unroll this to get even more?

## FC - Layered Loops

```
// Level 2 loops
for chw in [0, C*H*W):
    for m1 in [0, M/VL):
// Level 1 loops
        parallel_for m0 in VL):
            o[m1*VL+m0] += i[chw] * f[VL*CWH*m1+CWH*m0+chw]
```


## FC - Layered Loops



## Vector ISA Attributes

- Compact
- one short instruction encodes $\mathbf{N}$ operations
- many implicit bookkeeping/control operations
- Expressive, hardware knows the $\mathbf{N}$ operations:
- are independent
- use the same functional unit
- access disjoint registers
- access registers in same pattern as previous instructions
- access a contiguous block of memory (unit-stride load/store)
- access memory in a known pattern (strided load/store)

Vector instructions make "explicit" many things that are "implicit" with standard instructions

## Vector ISA Hardware Implications

- Large amount of work per instruction
-> Less instruction fetch bandwidth requirements
-> Allows simplified instruction fetch design
- Architecturally defined bookkeeping operations
-> Bookkeeping can run in parallel with main compute
- Disjoint vector element accesses
-> Banked rather than multi-ported register files
- No data dependence within a vector
-> Amenable to deeply pipelined/parallel designs
- Known regular memory access pattern
-> Allows for banked memory for higher bandwidth


## Vector Arithmetic Execution

- Use deep pipeline (=> fast clock) to execute element operations
- Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!)


V3 <- V1 * V2

## Vector Instruction Execution

ADDV $C, A, B$, where $A, B, C$ are registers, e.g., $V 3, V 1$ and $V 2$


## Vector Unit Structure



## Vector Instruction Parallelism

Can overlap execution of multiple vector instructions

- example machine has 32 elements per vector register and 8 lanes
Load Unit
Multiply Unit
Add Unit


Complete 24 operations/cycle while issuing 1 short instruction/cycle

## Vector Instruction Parallelism

Can overlap execution of multiple vector instructions

- example machine has 32 elements per vector register and 8 lanes


Complete 24 operations/cycle while issuing 1 short instruction/cycle

## Vector Instruction Parallelism

Can overlap execution of multiple vector instructions

- example machine has 32 elements per vector register and 8 lanes


Complete 24 operations/cycle while issuing 1 short instruction/cycle

## Vector Instruction Parallelism

Can overlap execution of multiple vector instructions

- example machine has 32 elements per vector register and 8 lanes


Complete 24 operations/cycle while issuing 1 short instruction/cycle

## Vector Instruction Parallelism

Can overlap execution of multiple vector instructions

- example machine has 32 elements per vector register and 8 lanes


Complete 24 operations/cycle while issuing 1 short instruction/cycle

## Vector Instruction Parallelism

Can overlap execution of multiple vector instructions

- example machine has 32 elements per vector register and 8 lanes


Complete 24 operations/cycle while issuing 1 short instruction/cycle

## Vector Instruction Parallelism

Can overlap execution of multiple vector instructions

- example machine has 32 elements per vector register and 8 lanes


Complete 24 operations/cycle while issuing 1 short instruction/cycle

## ISA Datatypes

|  | 1 |  |  | 23 | Range | Accuracy |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| FP32 | S | E |  | M | $10^{-38}-10^{38}$ | .000006\% |
|  | 1 | 5 | 10 |  |  |  |
| FP16 | S | E | M |  | $6 \times 10^{-5}-6 \times 10^{4}$ | .05\% |
|  | 1 |  |  |  |  |  |
| Int32 | S |  |  |  | $0-2 \times 10^{9}$ | 1/2 |
|  | 1 |  | 15 |  |  |  |
| Int16 | S |  | M |  | $0-6 \times 10^{4}$ | $1 / 2$ |
|  | 1 |  |  |  |  |  |
| Int8 |  |  |  |  | 0-127 | 1/2 |

## Intel - MMX/SSE/AVX

|  | Width | Int8 | Int16 | Int 32 | Int64 | FP16 | FP32 | FP64 | Features |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| MMX | 64 | 8 | 4 | 2 | 1 |  |  |  |  |
| SSE | 128 |  |  |  |  |  | 4 |  |  |
| SSE2 | 128 | 16 | 8 | 4 | 2 |  | 4 | 2 |  |
| SSE3 | 128 | 16 | 8 | 4 | 2 |  | 4 | 2 | R |
| AVX | 256 | 32 | 16 | 8 | 4 | 16 | 8 | 4 |  |
| AVX2 | 256 | 32 | 16 | 8 | 4 | 16 | 8 | 4 | GUMR |
| AVX3 | 512 | 64 | 32 | 16 | 8 | $?$ | 16 | 8 | GUMRP |


| G: gather | $R$ : reductions/permutations |
| :--- | :--- |
| $\mathrm{U}:$ unaligned | $P$ : Predicate masks |
| M: MAC |  |

Source: Myriad non-authoritative sources on web

## Python to C++ Chart

| Version | Implementation | Running <br> time (s) | GFLOPS | Absolute <br> speedup | Relative <br> speedup | Fraction <br> of peak |
| :--- | :--- | ---: | ---: | ---: | ---: | ---: |
| 1 | Python | $25,552.48$ | 0.005 | 1 | - | $0.00 \%$ |
| 2 | Java | $2,372.68$ | 0.058 | 11 | 10.8 | $0.01 \%$ |
| 3 | C | 542.67 | 0.253 | 47 | 4.4 | $0.03 \%$ |
| 4 | Parallel loops | 69.80 | 1.969 | 366 | 7.8 | $0.24 \%$ |
| 5 | Parallel divide-and-conquer | 3.80 | 36.180 | 6,727 | 18.4 | $4.33 \%$ |
| 6 | + vectorization | 1.10 | 124.914 | 23,224 | 3.5 | $14.96 \%$ |
| 7 | + AVX intrinsics | 0.41 | 337.812 | 62,806 | 2.7 | $40.45 \%$ |
| 8 | Strassen | 0.38 | 361.177 | 67,150 | 1.1 | $43.24 \%$ |

[Leiserson, There's plenty of room at the top, Science, 2020]

# Next Lecture: Roofline Analysis and Transforms 

## Thank you!

