**Constructive Computer Architecture** 

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Arvind

Computer Science & Artificial Intelligence Lab.

L24-1

Massachusetts Institute of Technology



## Synchronization

needed even in single-processor systems

- The need for synchronization arises whenever there are parallel processes in a system
  - Forks and Joins: A parallel process may want to wait until several events have occurred
  - Producer-Consumer: A consumer process must wait until the producer process has produced data
  - Mutual Exclusion: Operating system has to ensure that a resource is used by only one process at a given time

P2

join

**P1** 

producer

consumer



# A Producer-Consumer

#### Example continued

Producer posting Item x: Load R<sub>tail</sub>, (tail) 1 Store (R<sub>tail</sub>), x R<sub>tail</sub>=R<sub>tail</sub>+1 2 Store tail, R<sub>tail</sub>

Can the tail pointer get updated before the item x is stored?

Consumer: Load  $R_{head}$ , head spin: Load  $R_{tail}$ , tail 3 if  $R_{head} = = R_{tail}$  goto spin Load R,  $(R_{head})$  4  $R_{head} = R_{head} + 1$ Store head,  $R_{head}$ process(R)

Programmer assumes that if 3 happens after 2, then 4 happens after 1.

Problem sequences are:

## Sequential Consistency

A Memory Model



M

"A system is *sequentially consistent* if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program"

Leslie Lamport

Sequential Consistency = arbitrary order-preserving interleaving of memory references of sequential programs

#### Sequential Consistency

Sequential concurrent tasks: T1, T2 Shared variables: X, Y (initially X = 0, Y = 0)



what are the legitimate answers for X' and Y'?

 $(X',Y') \in \{(1,2), (0,0), (1,0), (0,2)\}$ ?

If y is 2 then x cannot be 1

November 23, 2015 http://www.csg.csail.mit.edu/6.175

L24-7

#### Sequential Consistency



## **Store Buffers**



# Violations of SC

Example 1

| Process 1                                 | Process 2                                 |
|-------------------------------------------|-------------------------------------------|
| Store flag <sub>1</sub> ,1;               | Store flag <sub>2</sub> ,1;               |
| Load r <sub>1</sub> , flag <sub>2</sub> ; | Load r <sub>2</sub> , flag <sub>1</sub> ; |

*Question:* Is it possible that  $r_1=0$  and  $r_2=0$ ? • Sequential consistency: No

 Suppose Stores don't leave the store buffers before the Loads are executed: Yes !

Total Store Order (TSO): IBM 370, Sparc's TSO memory model, x86

Initially, all memory locations contain zeros



## Violations of SC

Example 3: Non-Blocking Caches

| Process 1      | Process 2                   |
|----------------|-----------------------------|
| Store a, 1;    | Load r <sub>1</sub> , flag; |
| Store flag, 1; | Load r <sub>2</sub> , a;    |

*Question:* Is it possible that  $r_1=1$  but  $r_2=0$ ? • Sequential consistency: No

- Assuming stores are ordered:
  - Yes because Loads can be reordered

Sparc's RMO, PowerPC's WO, Alpha

L24-12

## Memory Model Issue

- Architectural optimizations that are correct for uniprocessors, often violate sequential consistency and result in a new memory model for multiprocessors
- Memory model issues are subtle and contentious because most ISA specifications (X86, ARM, PowerPC, Sparc, MIPS) are ambiguous

For the rest of the lecture we will assume the architecture is SC and focus on synchronization issues

L24-13



## Locks or Semaphores

E. W. Dijkstra, 1965

Process i lock(s) <critical section> unlock(s) The execution of the critical section is protected by lock s. Only one process can hold the lock.

L24-15

Suppose the lock s can have only two values:

- s=0 means that no process has the lock
- s=1 means that exactly one process has the lock and therefore can access the critical section
- Once a process successfully acquires a lock, it executes the critical section and then sets s to zero by executing unlock(s)
- Implementation of locks is quite difficult using just Loads and Stores. ISAs provide special atomic instructions to implement locks

http://www.csg.csail.mit.edu/6.175

November 23, 2015

# atomic read-modify-write

### instructions

m is a memory location, R is a register

Test&Set m, R: R ← M[m]; *if* R==0 *then* M[m] ← 1;

Location m can be set to one only if it contains a zero



Location m is first read and then set to the new value; the old value is returned in a register

L24-16



# Nonblocking Synchronization

Load-reserve & Store-conditional

Special register(s) to hold reservation flag and address, and the outcome of store-conditional

Load-reserve R, m: <flag, adr>  $\leftarrow$  <1, m>; R  $\leftarrow$  M[m];

Store-conditional m, R: *if* <flag, adr> == <1, m> *then* cancel other procs' reservation on m;  $M[m] \leftarrow R;$ status  $\leftarrow$  succeed; *else* status  $\leftarrow$  fail;

| try:<br>spin:     | Load-reserve $R_{head}$ , head<br>Load $R_{tail}$ , tail<br>if $R_{head} = = R_{tail}$ goto spin<br>Load R, ( $R_{head}$ )<br>$R_{head} = R_{head} + 1$<br>Store-conditional head, $R_{head}$<br>if (status==fail) goto try | The corresponding<br>instructions in RISC<br>V are called Ir and<br>sc, respectively |
|-------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
| November 23, 2015 | process(R)<br>http://www.csg.csail.mit.edu/6.175                                                                                                                                                                            | L24-18                                                                               |

#### Nonblocking Synchronization

Load-reserve R, (m): <flag, adr>  $\leftarrow$  <1, m>; R  $\leftarrow$  M[m];

Store-conditional (m), R: *if* <flag, adr> == <1, m> *then* cancel other procs' reservation on m;  $M[m] \leftarrow R;$ status  $\leftarrow$  succeed; *else* status  $\leftarrow$  fail;

The flag is cleared in other processors on a Store using the CC protocol's invalidation mechanism

- Usually address m is not remembered by Load-reserve; the flag is cleared on any invalidation
  - works as long as the Load-reserve instructions are not used in a nested manner
- These instructions won't work properly if Loads and Stores can be dynamically reordered

## Memory Fences

Instructions to sequentialize memory accesses

Processors with *weak* or non-sequentially-consistent *memory models* need to provide *memory fence* instructions to force the serialization of memory accesses

Producer posting Item x: Load R<sub>tail</sub>, (tail) Store (R<sub>tail</sub>), x Membar<sub>SS</sub> R<sub>tail</sub>=R<sub>tail</sub>+1 Store tail, R<sub>tail</sub>

*ensures that tail ptr is not updated before x has been stored* 

November 23, 2015

Consumer:

Load  $R_{head}$ , (head) spin: Load  $R_{tail}$ , (tail) if  $R_{head} = = R_{tail}$  goto spin Membar<sub>LL</sub> Load R, ( $R_{head}$ )  $R_{head} = R_{head} + 1$ Store head,  $R_{head}$ process(R)

RISC-V has one instruction called "fence"

http://www.csg.csail.mit.edu/6.175