

# 6.823 Pin Optimizations

Adapted from: Prior 6.823 offerings, and Intel's Tutorial at CGO 2010

# From the Video tutorial... What is Instrumentation?



 Instrumentation is a technique that inserts extra code into a program to collect runtime information

PIN does dynamic binary instrumentation

Runtime

No need to re-compile or re-link

Instrumentation: Instruction Count

Let's increment counter by one before every instruction!



**Analysis routine** 

Instrumentation routine

```
counter++;
sub $0xff, %edx
counter++;
cmp %esi, %edx
counter++;
jle <L1>
counter++;
mov $0x1, %edi
counter++;
add $0x10, %eax
```



#### Instrumentation vs. Analysis

- Instrumentation routines define where instrumentation is inserted
  - Occurs immediately before an instruction is executed for the first time.

- Analysis routines define what to do when instrumentation is activated
  - Occurs every time an instruction is executed



#### **How to Write Efficient Pintools**

#### Reducing Instrumentation Overhead



Total Overhead = Pin's Overhead + Pintool's Overhead

- The job of Pin developers to minimize this
- ~5% for SPECfp and ~20% for SPECint

Pintool writers can help minimize this!

# Reducing Pintool's Overhead



Pintool's Overhead

Instrumentation Routines Overhead + Analysis Routines Overhead

Frequency of calling an Analysis Routine x Work required in the Analysis Routine



- Instrumentation with Pin can be done at 3 different granularities:
  - Instruction
  - Basic block
    - A sequence of instructions terminated at a (conditional or unconditional) control-flow changing instruction
    - Single entrance, single exit
  - Trace
    - A sequence of basic blocks terminated at an unconditional control-flow changing instruction
    - Single entrance, multiple exits



Instrumentation with Pin can be done at 3 different

granularities:

- Instruction
- Basic block
  - A sequence of instruct unconditional) control
  - Single entrance, single
- Trace
  - A sequence of basic block
     changing instruction
  - Single entrance, multiple exits

```
sub $0xff, %edx
cmp %esi, %edx
jle <L1>
mov $0x1, %edi
add $0x10, %eax
jmp <L2>
```

W



Instrumentation with Pin can be done at 3 different

6 insts

granularities:

Instruction

- Basic block
  - A sequence of instruct unconditional) control
  - Single entrance, single
- Trace
  - A sequence of basic block
     changing instruction
  - Single entrance, multiple exits

sub \$0xff, %edx
cmp %esi, %edx
jle <L1>
mov \$0x1, %edi
add \$0x10, %eax
jmp <L2>

W



Instrumentation with Pin can be done at 3 different

granularities:

- Instruction
- Basic block
  - A sequence of instruct unconditional) control
  - Single entrance, single
- Trace
  - A sequence of basic block
     changing instruction
  - Single entrance, multiple exits

```
6 insts, 2 basic blocks
sub $0xff, %edx
cmp %esi, %edx
jle <L1>
```

```
mov $0x1, %edi
add $0x10, %eax
jmp <L2>
```

W



Instrumentation with Pin can be done at 3 different

granularities:

- Instruction
- Basic block
  - A sequence of instruct unconditional) control
  - Single entrance, single
- Trace
  - A sequence of basic block
     changing instruction
  - Single entrance, multiple exits

```
6 insts, 2 basic blocks, 1 trace
sub $0xff, %edx
cmp %esi, %edx
jle <L1>
```

```
mov $0x1, %edi
add $0x10, %eax
jmp <L2>
```





```
counter++;
sub $0xff, %edx
counter++;
cmp %esi, %edx
counter++;
jle <L1>
counter++;
mov $0x1, %edi
counter++;
add $0x10, %eax
```





```
counter++;
sub $0xff, %edx
```

Straightforward, but the counting can be more efficient

```
counter++;
mov $0x1, %edi
counter++;
add $0x10, %eax
```

#### **Faster Instruction Count**



```
counter += 3
sub $0xff, %edx
     %esi, %edx
cmp
jle
     <L1>
                        basic blocks (bbl)
counter += 2
      $0x1, %edi
mov
     $0x10, %eax
add
```

```
#include <stdio.h>
#include "pin.H"
UINT64 icount = 0;
                                                        analysis routine
void docount(INT32 c) { icount += c; }
void Trace(TRACE trace, void *v) {
   for (BBL bbl = TRACE BblHead(trace);
        BBL Valid(bbl); bbl = BBL Next(bbl)) {
        BBL InsertCall(bbl, IPOINT BEFORE, (AFUNPTR) docount,
                       IARG UINT32, BBL NumIns(bbl), IARG END);
                                                  instrumentation routine
void Fini(INT32 code, void *v) {
   fprintf(stderr, "Count %lld\n", icount);
int main(int argc, char * argv[]) {
   PIN Init(argc, argv);
   TRACE AddInstrumentFunction(Trace, 0);
   PIN AddFiniFunction(Fini, 0);
   PIN StartProgram();
   return 0;
 9/24/2021
                                 6.823 Fall 2021
                                                                         16
```





- Key:
  - Instrument at the largest granularity whenever possible:
    - Trace > Basic Block > Instruction

# Reducing Pintool's Overhead



Pintool's Overhead

Instrumentation Routines Overhead + Analysis Routines Overhead

Frequency of calling an Analysis Routine X Work required in the Analysis Routine

# Reducing Pintool's Overhead



Pintool's Overhead

Instrumentation Routines Overhead + Analysis Routines Overhead

Frequency of calling an Analysis Routine X Work required in the Analysis Routine

Work required for transiting to Analysis Routine + Work done inside Analysis Routine

#### Example: Counting Control Flow Edges



9/24/2021

# Example: Counting Control Flow Edges



#### Edge Counting: a Slower Version



```
void docount2 (ADDRINT src, ADDRINT dst, INT32 taken)
  COUNTER *pedg = Lookup(src, dst);
  pedg->count += taken;
void Instruction(INS ins, void *v) {
   if (INS IsBranchOrCall(ins)) {
       INS InsertCall(ins, IPOINT BEFORE, (AFUNPTR) docount2,
                       IARG INST PTR LARG BRANCH TARGET ADDR
                       IARG BRANCH TAKEN, IARG END);
                          1 if taken, 0 if not taken
```



# Inefficiency in Program

- About every 5th instruction executed in a typical application is a branch.
- Edge lookup will be called whenever these instruction are executed
  - significant application slowdown
- Direct vs. Indirect Branches
  - Branch Address in instruction vs. Branch Address in Register
  - Static vs. Dynamic

#### Edge Counting: a Faster Version

```
void docount(COUNTER* pedge, INT32 taken) {
  pedg->count += taken;
void docount2(ADDRINT src, ADDRINT dst, INT32 taken) {
  COUNTER *pedg = Lookup(src, dst);
  pedg->count += taken;
void Instruction(INS ins, void *v) {
  if (INS IsDirectBranchOrCall(ins)) {
       COUNTER *pedg = Lookup(INS Address(ins),
                          INS DirectBranchOrCallTargetAddress(ins));
       INS InsertCall(ins, IPOINT BEFORE, (AFUNPTR) docount,
                   IARG ADDRINT, pedg, IARG BRANCH TAKEN, IARG END);
   } else if (INS IsBranchOrCall(ins))
       INS InsertCall(ins, IPOINT BEFORE, (AFUNPTR) docount2,
                       IARG INST PTR, IARG BRANCH TARGET ADDR,
                       IARG BRANCH TAKEN, IARG END);
                               6.823 Fall 2021
                                                                    24
```





```
void docount(COUNTER* pedge, INT32 taken)
{
   if (!taken)
      return;
   pedg->count++;
}
```

*VS.* 

```
void docount(COUNTER* pedge, INT32 taken)
{
    pedg->count += taken;
}
Can be inlined by Pin
```





- Key:
  - Shifting computation from Analysis Routines to Instrumentation Routines whenever possible



#### Some other optimizations...

- Reduce the number of arguments to analysis routine.
  - For example, instead of passing TRUE/FALSE, create 2 analysis functions.
- If an instrumentation can be inserted anywhere in a basic block:
  - Let Pin know via IPOINT\_ANYWHERE (used in BBL\_InsertCall())
  - Pin will find the best point to insert the instrumentation to minimize register spilling



28

#### Takeaways..

 Reduce frequency of calling analysis routines by instrumenting at the largest granularity whenever possible

 Reduce the amount of work done in analysis routines by shifting computation from Analysis Routines to Instrumentation Routines whenever possible



#### Lab 1 due in a week

- Design 3 different types of caches
  - Virtually Indexed, Virtually Tagged
  - Physically Indexed, Physically Tagged
  - Virtually Indexed, Physically Tagged
- Caches and Virtual Memory covered in Lectures 2-4

Remember to start early!





- Processor works with virtual addresses
  - If we grab the index and tag from the physical address, need address translation -> TLB access
  - Avoid this: virtually-addressed cache



Alternative: place the cache before the TLB



- Now, cache hits are fast
- Problem 1: Consider VA<sub>1</sub>(from process 1) and VA<sub>2</sub>(from process 2)
  - What if we context switch, and  $VA_1 == VA_2$ ?





- Problem 2: Consider VA<sub>1</sub>, VA<sub>2</sub> (not necessarily from different processes)
  - What if  $VA_1 = VA_2$ , but they map to the same physical address?



- Intuition: Physical tags solve Problem 1
- Solves Problem 2 as long as the index bits are the same between VA<sub>1</sub> and VA<sub>2</sub>





#### Tips

- Ask questions on Piazza.
- ssh <athenausername>@vlsifarm-0X.mit.edu or
- ssh <athenausername>@eecs-ath-4X.mit.edu
  - eecs-ath-4X machines are much more powerful
- Suggested reading on caches and virtual memory on the course website.