# Sequential Consistency and Cache Coherence Protocols

1

### Joel Emer Computer Science and Artificial Intelligence Lab M.I.T.

## Synchronization

The need for synchronization arises whenever there are parallel processes in a system *(even in a uniprocessor system)* 

- Forks and Joins: A parallel process may want to wait until several events have occurred
- Producer-Consumer: A consumer process must wait until the producer process has produced data
- Exclusive use of a resource: Operating system has to ensure that only one process uses a resource at a given time



## A Producer-Consumer Example



Producer posting Item x: Load  $R_{tail}$ , (tail) Store  $(R_{tail})$ , x  $R_{tail}=R_{tail}+1$ Store tail,  $R_{tail}$ 

The program is written assuming instructions are executed in order.

Consumer: Load  $R_{head}$ , (head) spin: Load  $R_{tail}$ , (tail) if  $R_{head} = = R_{tail}$  goto spin Load R, ( $R_{head}$ )  $R_{head} = R_{head} + 1$ Store head,  $R_{head}$ process(R)

Problems?

5/7/2014

http://www.csg.csail.mit.edu/6.823

# A Producer-Consumer Example

Producer posting Item x:Load  $R_{tail}$ , (tail)1Store  $(R_{tail})$ , x $R_{tail} = R_{tail} + 1$ 2Store tail,  $R_{tail}$ 

*Can the tail pointer get updated before the item x is stored?* 

Consumer: Load  $R_{head}$ , (head) spin: Load  $R_{tail}$ , (tail) 3 if  $R_{head} = R_{tail}$  goto spin Load R, ( $R_{head}$ ) 4  $R_{head} = R_{head} + 1$ Store head,  $R_{head}$ process(R)

Programmer assumes that if 3 happens after 2, then 4 happens after 1.

Problem sequences are: 2, 3, 4, 1 4, 1, 2, 3

### Sequential Consistency A Memory Model



" A system is *sequentially consistent* if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program" *Leslie Lamport* 

Sequential Consistency = arbitrary order-preserving interleaving of memory references of sequential programs

5/7/2014

http://www.csg.csail.mit.edu/6.823

## Memory Consistency in SMPs



Suppose CPU-1 updates A to 200. *write-back:* memory and cache-2 have stale values *write-through:* cache-2 has a stale value

### *Do these stale values matter? What is the view of shared memory for programming?*

April 30, 2014

http://www.csg.csail.mit.edu/6.823

## Write-back Caches & SC



## Write-through Caches & SC



Write-through caches don't preserve sequential consistency either

April 30, 2014

http://www.csg.csail.mit.edu/6.823

## Maintaining Sequential Consistency

*Motivation:* We can do without locks -- SC is sufficient for writing producer-consumer and mutual exclusion codes (e.g., Dekker)

*Problem:* SC requires all processors to see writes occur in the same order, but multiple copies of a location in various caches can cause this to be violated.

To meet the ordering requirement it is sufficient for hardware to ensure:

- Only one processor at a time has write permission for a location
- No processor can load a stale copy of the location after a write

#### $\Rightarrow$ cache coherence protocols

## A System with Multiple Caches



- Modern systems often have hierarchical caches
- Each cache has exactly one parent but can have zero or more children
- Only a parent and its children can communicate directly
- *Inclusion property* is maintained between a parent and its children, i.e.,

$$a \in L_i \qquad \Rightarrow \qquad a \in L_{i+1}$$

## Cache Coherence Protocols for SC

#### write request:

the address is *invalidated* in all other caches *before* the write is performed, *or* 

the address is *updated* in all other caches *after* the write is performed

#### read request:

if a dirty copy is found in some cache, that is the value that must be used, e.g., by doing a write-back and reading the memory or forwarding that dirty value directly to the reader.

> *We will focus on Invalidation protocols as opposed to Update protocols*

## Shared Memory Multiprocessor



# Watch (snoop on) bus to keep all processors' view of memory coherent

http://www.csg.csail.mit.edu/6.823

## Snoopy Cache Goodman 1983

 Idea: Have the cache watch (or snoop upon) data transfers, and then "do the right thing". Thus, memory operations are atomic with respect to all the caches.



Note: Snoopy cache tags have increased demand – often they are dual-ported

## Intervention



When a read-miss for A occurs in cache-2, a read request for A is placed on the bus

Cache-1 needs to supply data

• The memory may respond to the request also! *Does memory know it has stale data?* No, Cache-1 needs to intervene through memory controller to supply correct data to cache-2

## **Snoopy Cache Actions**

| <b>Observed Bus Cycle</b> | Cache State                                                  | Cache Action                               |
|---------------------------|--------------------------------------------------------------|--------------------------------------------|
| Remote Read               | Address not cached<br>Cached, unmodified<br>Cached, modified | No action<br>No action<br>Cache Intervenes |
| Remote Write              | Address not cached<br>Cached, unmodified<br>Cached, modified | No action<br>Cache Purges Copy<br>??????   |

### Cache State Transition Diagram The MSI protocol



April 30, 2014

L22-16

## 2 Processor Example





April 30, 2014

http://www.csg.csail.mit.edu/6.823

## Observation



- If a line is in the M state then no other cache can have a copy of the line!
  - Memory stays coherent,
  - multiple differing copies cannot exist

April 30, 2014

# MESI: An Enhanced MSI protocol increased performance for private data



## 2 Processor Example



state blk addr data0 data1 ... dataN

A cache block contains more than one word and cache-coherence is done at the block-level and not word-level

Suppose  $P_1$  writes word<sub>i</sub> and  $P_2$  writes word<sub>k</sub> and both words have the same block address.

What can happen? The block may be invalidated (ping pong) many times unnecessarily because the addresses are in same block.

## CC and Synchronization Performance Issue - 2



Cache-coherence protocols will cause mutex to *ping-pong* between P1's and P2's caches.

Ping-ponging can be reduced by first reading the mutex location (non-atomically) and executing a swap only if it is found to be zero.

April 30, 2014

http://www.csg.csail.mit.edu/6.823

### CC and Bus Occupancy Performance Issue - 3

In general, an atomic *read-modify-write* instruction requires two memory (bus) operations without intervening memory operations by other processors

In a multiprocessor setting, bus needs to be locked for the entire duration of the atomic read and write operation

- $\Rightarrow$  expensive for simple buses
- $\Rightarrow$  very expensive for split-transaction buses

modern processors use *load-reserve store-conditional* 

## Load-reserve & Store-conditional

Special register(s) to hold reservation flag and address, and the outcome of store-conditional

Load-reserve R, (a): <flag, adr>  $\leftarrow <1$ , a>; R  $\leftarrow$  M[a];

Store-conditional (a), R: *if* <flag, adr> == <1, a> *then* cancel other procs' reservation on a;  $M[a] \leftarrow <R>;$ status  $\leftarrow$  succeed; *else* status  $\leftarrow$  fail;

If the snooper sees a store transaction to the address in the reserve register, the reserve bit is set to 0

- Several processors may reserve `a' simultaneously
- These instructions are like ordinary loads and stores with respect to the bus traffic

The total number of memory (bus) transactions is not necessarily reduced, but splitting an atomic instruction into load-reserve & storeconditional:

- increases bus utilization (and reduces processor stall time), especially in splittransaction buses
- reduces cache ping-pong effect because processors trying to acquire a semaphore do not have to perform stores each time

## 2-Level On-chip Caches



- Inclusion property: entries in L1 must be in L2 invalidation in L2  $\Rightarrow$  invalidation in L1
- Does snooping on L2 affect CPU-L1 bandwidth?
  - yes -- to check if a dirty copy is stored in L1
- How can this be avoided?
  - Write-through L1 cache

April 30, 2014

http://www.csg.csail.mit.edu/6.823

- 1. The memory operations of each individual processor appear to all processors in the order the requests are made to the memory.
  - Provided by cache coherence, which ensures that all processors observe the same order of loads and stores to an address
- 2. Any execution is the same as if the operations of all the processors were executed in some sequential order
  - Provided by enforcing a dependence between each memory operation and the following one.

## SC Data Dependence

- Stall
  - Use in-order execution with blocking cache
    - Cache coherence plus allowing a processor to have only one request in flight at a time will provide SC
- Change architecture ⇒ Relaxed memory models
  - Use OOO and non-blocking caches
    - Cache coherence and allowing multiple requests (different addresses) concurrently gives high performance, then add fence operations to force ordering when needed
- Speculate...

## Sequential Consistency Speculation

- Local load-store ordering uses standard OOO mechanism
- Globally <u>non-speculative</u> stores
  - Stores execute at commit -> stores are in-order!
- Globally <u>speculative</u> loads
  - Guess at issue that the memory location used by a load will not change between issue and commit of the instruction
    - this is equivalent to loads happening in-order at commit
  - **Check** at commit by remembering all loads addresses starting at issue and watching for writes to that location.
  - Data Management for rollback relies on the basic out-of-order speculative data management used for uni-processor rollback and instruction re-execution.

L22-29



Next lecture:

How to design a cache coherence protocol