#### VLIW/EPIC: Statically Scheduled ILP Joel Emer Computer Science & Artificial Intelligence Lab M.I.T. http://www.csg.csail.mit.edu/6.823 #### Little's Law Parallelism = Throughput \* Latency or $\overline{N} = \overline{T} \times \overline{L}$ #### Example Pipelined ILP Machine How much instruction-level parallelism (ILP) required to keep machine pipelines busy? $$\overline{T} = 6$$ $\overline{L} = \frac{(2x1 + 2x3 + 2x4)}{6} = 2\frac{2}{3}$ $\overline{N} = 6 \times 2\frac{2}{3} = 16$ ## Superscalar Control Logic Scaling - For in-order machines, L is related to pipeline latencies - For out-of-order machines, L also includes time spent in instruction buffers (instruction window or ROB) - As W increases, larger instruction window is needed to find enough parallelism to keep machine busy => greater L - $=> Out-of-order\ control\ logic\ grows\ faster\ than\ W^2\ (\sim W^3)$ # Out-of-Order Control Complexity: MIPS R10000 External Interface EXI Instruction Data Cache TLBData Data Cache addr Free Grad Address Unit List Queue Intege Register Datapa Integer Rename Queue FP FPMult [ SGI/MIPS Technologies Inc., 1995 ] Control Logic #### Sequential ISA Bottleneck ## VLIW: Very Long Instruction Word - Multiple operations packed into one instruction - Each operation slot is for a fixed function - Constant operation latencies are specified - Architecture requires guarantee of: - Parallelism within an instruction => no x-operation RAW check - No data use before data ready => no data interlocks #### VLIW Compiler Responsibilities #### The compiler: - Schedules to maximize parallel execution - Guarantees intra-instruction parallelism - Schedules to avoid data hazards (no interlocks) - Typically separates operations with explicit NOPs ## Early VLIW Machines #### FPS AP120B (1976) - scientific attached array processor - first commercial wide instruction machine - hand-coded vector math libraries using software pipelining and loop unrolling #### Multiflow Trace (1987) - commercialization of ideas from Fisher's Yale group including "trace scheduling" - available in configurations with 7, 14, or 28 operations/instruction - 28 operations packed into a 1024-bit instruction word #### Cydrome Cydra-5 (1987) - 7 operations encoded in 256-bit instruction word - rotating register file ## Loop Execution How many FP ops/cycle? 1 fadd / 8 cycles = 0.125 #### Loop Unrolling ``` for (i=0; i<N; i++) B[i] = A[i] + C; Unroll inner loop to perform 4 iterations at once B[i] = A[i] + C; B[i] = A[i] + C; B[i+1] = A[i+1] + C; B[i+2] = A[i+2] + C; B[i+3] = A[i+3] + C; ``` Is this code correct? No, need to handle values of N that are not multiples of unrolling factor with final cleanup loop # Scheduling Loop Unrolled Code Unroll 4 ways How many FLOPS/cycle? 4 fadds / 11 cycles = 0.36 # Software Pipelining 4 fadds / 4 cycles = 1 http://www.csg.csail.mit.edu/6.823 # Software Pipelining vs. Loop Unrolling Software pipelining pays startup/wind-down costs only once per loop, not once per iteration #### What if there are no loops? - Branches limit basic block size in control-flow intensive irregular code - Difficult to find ILP in individual basic blocks #### Trace Scheduling [ Fisher, Ellis] - Pick string of basic blocks, a trace, that represents most frequent branch path - Schedule whole "trace" at once - Add fixup code to cope with branches jumping out of trace How do we know which trace to pick? Use profiling feedback or compiler heuristics to find common branch paths #### Problems with "Classic" VLIW - Knowing branch probabilities - Profiling requires an significant extra step in build process - Object code size - instruction padding wastes instruction memory/cache - loop unrolling/software pipelining replicates code - Scheduling variable latency memory operations - caches and/or memory bank conflicts impose statically unpredictable variability - Scheduling for statically unpredictable branches - optimal schedule varies with branch path - Object-code compatibility - have to recompile all code for every machine, even for two machines in same generation ## **VLIW Instruction Encoding** - Schemes to reduce effect of unused fields - Compressed format in memory, expand on I-cache refill - used in Multiflow Trace - introduces instruction addressing challenge - Provide a single-op VLIW instruction - Cydra-5 UniOp instructions - Mark parallel groups - used in TMS320C6x DSPs, Intel IA-64 # Rotating Register Files Problems: Scheduled loops require lots of register names, Lots of duplicated code in prolog, epilog | | ld r1, () | | _ | |----------|----------------|----------------|----------------| | | add r2, r1, #1 | ld r1, () | | | | st r2, () | add r2, r1, #1 | ld r1, () | | | | st r2, () | add r2, r1, #1 | | | | | st r2, () | | | | | | | Prolog < | ld r1, () | | | | | ld r1, () | add r2, r1, #1 | | | Loop { | ld r1, () | add r2, r1, #1 | st r2, () | | Epilog < | | add r2, r1, #1 | st r2, () | | | | | st r2, () | | | | | | Solution: Allocate new set of registers for each loop iteration ## Rotating Register File Rotating Register Base (RRB) register points to base of current register set. Value added on to logical register specifier to give physical register number. Usually, split into rotating and non-rotating registers. #### Rotating Register File (Previous Loop Example) Four cycle fadd latency Three cycle load latency encoded as difference of 4 encoded as difference of 3 in register specifier in register specifier number (f9 - f5 = 4)number (f4 - f1 = 3) fadd f5, f4, ... Id f1, () sd f9, () bloop | Id P9, () | fadd P13, P12, | sd P17, () | bloop | |-----------|----------------|------------|-------| | ld P8, () | fodd P12, P11, | sd P16, () | bloop | | ld P7, () | fadd P11, P10, | sd P15, () | bloop | | Id P6, () | fadd P10, P9, | sd P14, () | bloop | | ld P5, () | fadd P9, P8, | sd P13, () | bloop | | ld P4, () | fadd P8, P7, | sd P12, () | bloop | | ld P3, () | fadd P7, P6, | 9d P11, () | bloop | | ld P2, () | fadd P6, P5, | sd P10, () | bloop | RRB=8 RRB=7 RRB=6 RRB=5 RRB=4 RRB=3 RRB=2 RRB=1 #### Cydra-5: Memory Latency Register (MLR) Problem: Loads have variable latency Solution: Let software choose desired memory latency - Compiler schedules code for maximum load-use distance - Software sets MLR to latency that matches code schedule - Hardware ensures that loads take exactly MLR cycles to return values into processor pipeline - Hardware buffers loads that return early - Hardware stalls processor if loads return late #### Intel EPIC IA-64 - EPIC is the style of architecture (cf. CISC, RISC) - Explicitly Parallel Instruction Computing - IA-64 is Intel's chosen ISA (cf. x86, MIPS) - IA-64 = Intel Architecture 64-bit - An object-code compatible VLIW - Itanium (aka Merced) is first implementation (cf. 8086) - First customer shipment expected 1997 (actually 2001) - McKinley, second implementation shipped in 2002 #### IA-64 Instruction Format Instruction 2 Instruction 1 Instruction 0 Template 128-bit instruction bundle - Template bits describe grouping of these instructions with others in adjacent bundles - Each group contains instructions that can execute in parallel ## IA-64 Registers - 128 General Purpose 64-bit Integer Registers - 128 General Purpose 64/80-bit Floating Point Registers - 64 1-bit Predicate Registers - GPRs rotate to reduce code size for software pipelined loops #### IA-64 Predicated Execution Problem: Mispredicted branches limit ILP Solution: Eliminate hard to predict branches with predicated execution - Almost all IA-64 instructions can be executed conditionally under predicate - Instruction becomes NOP if predicate register false March 7, 2014 Our basic block Sttp://www.csg.csail.mit.edu/6.823 Sanchez & Emer # Predicate Software Pipeline Stages #### Single VLIW Instruction | (p1) ld r1 | (p2) add r3 | (p3) st r4 | (p1) bloop | |------------|-------------|------------|------------| |------------|-------------|------------|------------| #### Dynamic Execution | (p1) ld r1 | | | (p1) bloop | |------------|-------------|------------|------------| | (p1) ld r1 | (p2) add r3 | | (p1) bloop | | (p1) ld r1 | (p2) add r3 | (p3) st r4 | (p1) bloop | | (p1) ld r1 | (p2) add r3 | (p3) st r4 | (p1) bloop | | (p1) ld r1 | (p2) add r3 | (p3) st r4 | (p1) bloop | | | (p2) add r3 | (p3) st r4 | (p1) bloop | | | | (p3) st r4 | (p1) bloop | Software pipeline stages turned on by rotating predicate registers → Much denser encoding of loops ## Fully Bypassed Datapath Where does predication fit in? ## IA-64 Speculative Execution Problem: Branches restrict compiler code motion Solution: Speculative operations that don't cause exceptions Can't move load above branch because might cause spurious exception Particularly useful for scheduling long latency loads early ## IA-64 Data Speculation Problem: Possible memory hazards limit code scheduling Solution: Instruction-based speculation with hardware monitor to check for pointer hazards Can't move load above store because store might be to same address Requires associative hardware in address check table #### Clustered VLIW Cache/Memory Banks - Divide machine into clusters of local register files and local functional units - Lower bandwidth/higher latency interconnect between clusters - Software responsible for mapping computations to minimize communication overhead - Exists in some superscalar processors, .e.g., Alpha 21264 - Common in commercial embedded processors, examples include TI C6x series DSPs, and HP Lx processor # Limits of Static Scheduling - Unpredictable branches - Variable memory latency (unpredictable cache misses) - Code size explosion - Compiler complexity #### Question: How applicable are the VLIW-inspired techniques to traditional RISC/CISC processor architectures? ## Thank you!