Issues in Building a Cache-Coherent Distributed Shared Memory


                    Machine using Commercial SMPs


                                       tbd


                                December 7, 1994
                                     Abstract

       Many research groups and companies, past and present, have expended a significant amount
     of effort designing cache-coherent shared memory systems for distributed memory machines.
     Cache-coherent shared memory is a desirable feature, giving the programmer, operating system
     and compiler a uniform view of memory. The effort involved in designing and implementing
     this feature, however, has traditionally been very large and seems to have only gotten larger as
     time goes on. Many of the recent efforts involve building large custom chips, some of which are
     almost as large as the microprocessors they support.
       The StarT-ng project is an attempt at building a parallel machine that supports both fine-
     grained user-level message passing and cache-coherent distributed shared memory (CCDSM)
     using commercial symetric multiprocessors (SMP) as the basic building block. The goal is to
     leverage as much as possible from commercial systems, allowing us to concentrate on the parallel
     aspects of the machine. Since we wish to mimimize the effort of implementing custom hardware,
     we naturally looked for ways to minimize custom hardware for global cache-coherent memory.
     We achieve this by dedicating a stock processor, called the service processor or sP, on each
     SMP to maintain coherency between caches. The service processor works in conjunction with
     a small custom device, called the address capture device (ACD), which sits on the memory bus
     and "captures" accesses to global memory. By having a stock processor do most of the work,
     we drastically reduce the amount of custom hardware needed to offer support for CCDSM.
       This paper describes the implementation of CCDSM on StarT-ng. We discuss some of
     the problems we encountered during the design. Many of these problems are common to all
     CCDSM implementations. We also estimate the CCDSM performance of StarT-ng.
1   Introduction


Researchers generally agree that a global address space simplifies the task of parallel programming.
Having a global address space (shared memory) simply means that a uniform addressing mechanism
can be used by all processors to access both local and non-local data. For programming purposes,
however, one also needs to specify a memory model, i.e., how memory accesses are viewed and
ordered by processors in the system. The most well known memory model is Lamport's strongly
consistent sequential model where only an interleaving of sequential memory reference streams from
processors is permitted.  For improved performance, people have designed "weaker" consistency
models. There is yet another class of memory models, such as I-structures, which are motivated
by concerns for parallelism. Though there is no general agreement on what is the best memory
model for parallel programming, strong sequential consistency is a requirement for running most
operating systems (including Unix) on Symmetric Multiprocessors (SMPs).
   The StarT-ng project is about building an efficient message passing machine which is also
suitable for experimenting with different memory models. An important goal of the project is to


                                        1

maximize the use of commercially available hardware and operating systems. For various technical
and non-technical reasons we have chosen to connect PowerPC based SMP's together by a fast
network of our own design. Following the idea presented in 14], one processor in each SMP
is dedicated to handling memory requests from other SMPs.  Though we expect to implement
additional features required by different memory models in software, each SMP is augmented with
a simple address capture device (ACD) to make the implementation of a global address space
possible.  In this paper we will discuss some of the interesting issues that had to be resolved to
design such a machine.
   The most important issue in the implementation of a memory model is whether several copies of
a memory location are permitted or not. Currently, by far the most popular approach to reducing
memory latency and bus bandwidth is to rely on caches.  Since in a parallel system there are
potentially multiple readers and writers of a cache-line, it is possible for a cache to have an out-of-
date or "stale" copy of that cache-line. Protocols to keep the copies of replicated data consistent
with each other as dictated by a memory model is a major concern in the implementation of any
parallel machine built using commercial microprocessors.
   However, there are machines without caches where each memory reference always goes to the
physical memory location.  In such systems, memory references to a location are automatically
serialized in a fixed manner for all processors. The most notable example of such a machine is Cray
X-MP. Most dataflow and multi-threaded machines also do not cache global data (See for example,
HEP [19], Tera [2], EM-4 [17] Monsoon [15]). Such machines rely on split-phase transactions to
absorb memory latency, and provide effiecient mechanisms to support word-level synchronization
throughout the global memory. The major criticism of machines without caches is that they demand
higher bandwidth memory systems than machines with local caches.
   Some machines, such as the Cray T3D[], pass on the burden of keeping the caches coherent to
the user-level software. This position, however, is not always desirable. The oldest way to keep the
caches coherent comes from bus based multiprocessors (dual processor IBM 370 systems from the
early seventies). The solution, which is also used in all current SMPs, is to "snoop" on the address
bus during a cache miss. On a write operation the data is automatically invalidated from all caches
to maintain consistency. Bus-based shared memory machines, however, are not scalable because
the bus bandwidth is shared between all the processors.
   Cache coherent distributed shared memory (CCDSM) systems have been advanced as a so-
lution to the scalability problem.  CCDSM systems consist of multiple snoopy buses connected
via an interconnection network. Some form of a directory-based cache-coherence protocol is used
to maintain cache coherence across multiple sites.  Machines in this class include the Stanford
DASH[9, 10], the Kendall Square Research KSR[5], and the MIT Alewife[1], which are all op-
erational, and the machines under design such as the Stanford FLASH[8] and the University of
Wisconson Typhoon(a paper architecture). All these machines are characterized by a substantial
amount of custom hardware to support directory based cache coherence protocols.
   The amount of hardware required to implement shared memory depends to a large degree on
the desired performance on a class of applications. At one extreme, the underlying system supports
only message passing. Some have written compilers or libraries that can give the illusion of shared
memory. Those schemes, however, use global memory in a highly stylized manner and generally
avoid coherence issues.  Others[11, 13, 12] have implemented coherent shared memory at page
granularity by modifying the virtual memory manager to allow demand paging from remote nodes.
These schemes work well in limited contexts and are not suitable for building general purpose
parallel programming models.
   The StarT-ng project does provide a minimal amount of custom hardware (i.e., ACD) to
support directory-based cache-coherent shared memory on a network of SMPs. We hope to have


                                        2

extreme flexibility in experimenting with cache coherence protocols by using one of the processors
in each SMP for protocol processing. Our approach will also allow us to explore the virtual memory
management (VMM) aspects of CCDSM, and to pursue language driven memory models such as
I-structures.  Since almost all of the CCDSM support is in software, performance is invariably
going to be compromised. However, at this stage it is not clear how much improvement in overall
performance can be expected from a more aggressive implementation.
   Unlike traditional shared memory machines, where the entire machine runs one copy of the
operating system (OS) which views the whole machine as an SMP, StarT-ng will run a separate
copy of the OS on each SMP site, allowing us to use a stock operating system. We believe that
the hardware resources at each SMP should be managed separately for fault tolerance1 and scala-
bility reasons. Coordination across multiple SMPs, such as for work scheduling or virtual memory
management (VMM), will be done as a layer on top of the per-site OS.
   Although this paper is about CCDSM on StarT-ng, the message passing capability of the
machine is still of fundamental importance. The performance of the message passing system will
have a direct impact on the performance of the CCDSM system.  User-level message passing
also provides additional capability - the arrival of messages can be used as a means of activating
computation or performing sychronization.  Other researchers[7] have pointed out the synergy
between shared memory and message passing. StarT-ng, like Alewife and FLASH, will support
both.
   The rest of this paper is organized as follows. The next section provides some background for the
rest of the paper. Section 3 describes StarT-ng and outlines how we plan to implement CCDSM.
Estimates for the performance of StarT-ng's CCDSM are included in that section. In Section 4,
we discuss the issues that we encountered while implementing CCDSM. Approaches for dealing
with the issues are discussed. Most of these problems are not specific to StarT-ng. The main
contributions of this paper are the exploration of very simple hardware for supporting cache-line
granularity cache-coherence, the identification of the problems of this and other approachs, and the
development of solutions for those problems.
2   CCDSM Concerns


Cache-coherence on distributed systems are usually maintained with directory based schemes. "Di-
rectory" information on every cache-line is kept by a "home" location. Typical directory entries are
Uncached by any site; Shared among one or more sites and thus is read-only at that moment; or
Modified (and temporarily owned) in one site and only that site can read and write that cache-line.
When a memory operation misses in a processor's local cache, the request is forwarded to the home
location which checks its directory. If the cache-line is in the Uncached state, the request is satisfied
and the directory updated. The home site can also satisfy a read request when the cache-line is in
Shared state.
   A Write2 request will, however, require all the current readers to be invalidated. Invalidation
means that all processors caching that cache-line must have that cache-line removed from their
cache. Regardless of the request type, if the cache-line is in the Modified state, the home must
invalidate the cache-line at the current owner's site, forcing the up-to-date cache-line to be returned
to home, which then sends it to the requester.  The directory is updated appropriately during
this process. Though all directory-based cache-coherence schemes follow this basic pattern, there
________________________________1
   Crashes due to the operating system occur more frequently than crashes due to the hardware!
  2Write requests generally come out of a processor as a "read-with-intent-to-modify" request rather than a "write"

because the write may just be to one word in a cache-line.
                                        3

are many variations and optimizations.  For simplicity of discussion, this paper will assume a
sequentially consistent implementation of memory operations when they appear on the memory
bus. The rest of this section gives an overview of the building blocks of a CCDSM system and the
issues related to each component.


2.1  Memory Model


A specific processor implements a particular memory consistency model. Most current microproces-
sors, including PowerPC 620, will reorder memory operations to improve performance; for example,
giving loads higher priority than stores since the instruction stream will stall for want of a loaded
value. To preserve the semantics of sequential execution on a processor, memory transactions from
a processor to the same location proceed in order. However, memory transactions to different loca-
tions do not necessarily have any fixed ordering. This consistency model is commonly called weak
consistency.
   Processors that implement weak consistency invariably provide sequentializing instructions,
such as the sync instruction in PowerPC, that insure that memory transactions go out in a sequen-
tially consistant order. In order to implement sequential consistency, sequentializing instructions
could be inserted at appropiate places in the instruction stream.
   Weak consistency is sufficient for correct execution most of the time. An instance where weak
consistency is in-sufficient is when dealing with memory mapped I/O devices.  Generally, I/O
devices require that writes to their address space are performed in a very specific order. In that
case, it may be important that operations that appear on the bus are in the same order as their
ordering in the instruction stream. Considering the performance hit taken if sequential consistency
is maintained, especially in fast superscalar processors, sequentializing instructions are used only
when absolutely necessary.


2.2  Memory Instruction Semantics


We are using commercial processors to implement a CCDSM system. Because these processors are
designed for the uniprocessor or SMP setting, the semantics of some of their instructions are not
necessarily "scalable". For example, the PowerPC 620 used in StarT-ng has a data cache-block
flush (dcbf) instruction which flushes a specified address out of all processors' caches.  It may
be possible for the exact dcbf semantics to carry over to a parallel machine built from multiple
indirectly connected buses, essentially broadcasting every flush to all processors, but it is certain
that such a scheme would be inefficient.  The basic reason why directory information is kept is
so that invalidations do not have to be broadcast. The alternative is to limit dcbf to flush only
processors on the bus where the dcbf is issued. This semantics will allow us to use the operation
to effect single-site invalidations needed in CCDSM. These and related issues require the designers
of a CCDSM system to examine the instruction set's semantics.


2.3  Memory bus implementation


Like processor cache design, memory buses have gone through refinements to improve performance.
Early memory buses allow bus transaction to occur one at a time. If a processor issued a read,
the memory bus would block until that read was satisfied. Later memory buses allowed pipelining
which increased bus throughput.  A pipelined bus, however, cannot tolerate memory operations
that are satisfied out of order. If one operation takes longer to perform, all of the operations that
follow it must wait until it finishes.

                                        4

   A more general approach is provided in a split-phase bus, which allows replies to return in an
order different from that in which their requests were issued. This approach requires some way of
associating a reply to its request making implementations more complicated but promises the best
performance. Most processors of the future will have split-phase buses. As we shall see later in this
paper, split-phase buses are needed in the CCDSM setting not only to improve performance, but
also to avoid certain types of deadlocks.
   It is important to note that when a processor issues a memory request on the bus, it can be
retried by the memory system multiple times. The processor has to keep trying until the request is
accepted. During this time, the processor has to perform snooping operation and react as though
the request has not occurred. Once a request has been accepted by the memory system, however, the
request is considered to be "committed" and the processor, in general, does not allow cancellation
of the request. The need for such cancellations does not arise in SMPs.
   On a split-phase bus, the time between when a request is accepted to the time when the reply
comes back opens up a window during which the semantics and interaction between new bus
operations and the outstanding operation is not obvious. As an example, after a processor's Read
x request has been accepted, how should the processor respond to an Invalidate x bus operation?
Should it allow the Invalidate to complete because it does not yet have a copy of x in its cache, or
should it prevent the Invalidate from completing until it gets back data x, and remove that from
its cache? Both positions make sense in the SMP setting. The choice, however, makes very big
difference to a CCDSM system as we will see in Section 4.


2.4  Buffers and Networks


Buffer management is important in a CCDSM system as the amount of buffer space in the pro-
cessor, network, and coherency engine for storing the state of in-progress requests is finite. If one
is not careful in the design of the system, concurrent processing of multiple requests can lead to
deadlocks due to insufficient buffers.
   Another major concern for CCDSM is whether the network delivers messages in FIFO order.
If the network is not FIFO, it is important for the coherence protocol to know that and account
for it, or for some lower level software to insure that FIFO ordering is observed by the coherence
protocol. Generally speaking, non-FIFO networks add significant complexity to cache coherence
protocols.
3   The Design of StarT-ng


A block diagram of a StarT-ng site is shown in Figure 1. The commercial building block of StarT-
ng (unshaded in the diagram) is an SMP built from PowerPC 620 processors. The processor card
is modified to include a Network Interface Unit (NIU). An address capture device (ACD) is added
to the coherent interconnect to help support CCDSM. When the machine is running in the mode
that supports global cache-coherence, a 620 from each SMP, which we call the service processor
(sP), is dedicated to this task. The rest of the processors, which are used for computation, are
called data processors (dP). Parallel programs running on StarT-ng can communicate through
both message passing and shared memory, whether they are executing on one site or multiple sites.
   Each 620 in StarT-ng is connected to an NIU through its L2 cache interface.  The NIU is
being designed and implemented by Motorola and its partners and was heavily influenced by the
MSU functional unit of the 88110MP[16]. The NIUs provide access to a Fat-tree network built out
of Arctic[4] routing chips. Currently under development at MIT, the Arctic router is a 4x4 router
that provides 200 megabytes/sec/link bandwidth.


                                        5


                          Figure 1: StarT-ng block diagram


3.1  ACD


The ACD's has two basic functions: (i) "capture" global bus transactions and pass it to the sP
and (ii) allow the sP to return data to the requesting processors. Capturing means to recognize a
global memory transaction, store away a complete copy of the transaction (including the address,
transaction type and bus tag to return a value if necessary) and respond accordingly to the processor
that initiated that transaction. If the ACD buffers are full, i.e., have not been emptied by the sP,
it forces the processor to retry the transaction. Because of limited ACD buffering, the sP should
remove captured transactions as quickly as possible.
   ACD can also be instructed by the sP to selectively flow-control (retry) certain types of trans-
actions. This is needed for buffer management because the ACD hardware buffers are backed by
software buffers under the sP's control.
   The ACD also performs Invalidations (FLUSH bus operation). It has hardware support that
allows it to interleave a Flush operation with the other functions it performs. We will examine the
reasons for having the ACD do the Flush bus operation in Section 4.5.
   It is possible to build the sP into the ACD or provide some dedicated path between the two.
Either design requires substantially more hardware than our design where the sP communicates
with and commands the ACD over the SMP bus. The ACD also has a one bit line going to the
sP's NIU, allowing the sP to poll its NIU for ACD events.


                                        6

3.2  Supporting CCDSM with the ACD and sP


The address space in StarT-ng is divided into four parts: (i) dP local; (ii) dP global; (iii) sP
local; and (iv) ACD command spaces. The dP local space is private to each site, directly accessible
by load/store operations only from the dPs of that site. The dP global space is the address space
supported by CCDSM. Bus operations to this address space are captured by the ACD. The global
physical address that is seen by the ACD is not really a physical address but encodes information
that will allow the sP to get to the real physical address of the cache-line. We call this address
a virsical address. Part of the virsical address indicates the current home site and the rest of the
virsical address is a virtual address in the sP local address space of the home site.  Finally, the
ACD command space is used by the sP to communicate with the ACD.
   The ACD and the sP support CCDSM as follows. When a processor accesses a global location
and misses in its cache, the request propogates to the memory bus and is captured by the ACD.
The ACD then signals the sP by setting a bit in its status word as well as asserting a line connected
to the sP's NIU. The sP either polls the ACD directly or through the NIU.
   The sP then reads out the captured transactions, clearing the transactions from the buffers.
The sP will inform the ACD of bus transactions to be retried, if necessary. We plan to implement
an inclusive L3 cache in software for performance and correctness. The sP will take appropiate
action to satisfy the request by checking its own L3 cache and sending messages to the home site
if necessary.  The sP will also serve as a protocol processor, receiving messages from other sites
requesting cache-lines whose home is its site. It is important, therefore, for an sP to continue to
serve incoming requests even though it has outstanding requests. When the response to the remote
request returns, the sP updates whatever state it needs to, including L3 cache state, and returns
the data through the ACD to the requesting processor.


3.3  Planned Features


We plan to try the following ideas in the implementation of CCDSM on StarT-ng.


Software L3 Cache: An essential idea is a large software L3 cache, implemented in main memory.
The sP can implement an inclusive software cache which stores all remote data brought to the site.
Any sort of caching strategy can be implemented at this level, including automatic prefetching,
presending and update protocols.


Split-phase Cache Accesses: During a split-phase cache access, a processor switches threads
when an access to global memory misses in its cache. The tough part is how to indicate that there
was a cache-miss. We plan to implement this idea by having every load of global locations checked
against a "miss pattern". The miss pattern would be returned by the sP to the requesting processor
after not finding the desired cache-line in its L3 cache. If a load returns a miss pattern, a cache-miss
is assumed and the current task is swapped out. Part of the swap-out would be a request to fetch
the cache-line. A microthread descriptor (an instruction pointer/frame pointer pair), which would
allow the task to restart, would be included in that request. The cache-line could be delivered to
the requesting processor's sP for insertion into the L3 cache, while the desired value sent directly
to the requesting processor along with the microthread discriptor. When the desired value returns
to the requesting processor, the thread wakes up, and continues with the value it requested.  If
another processor requests the same cache-line in the near future, the sP can satisfy the request
directly from the L3 cache.


                                        7


Supporting Coherence from Page Level to Cache-line Level: During the time when a global
page is accessed by only one site, "localizing" the page temporarily allows cache miss processing
to by-pass the ACD and sP, resulting in a lower cache miss penalty. "Localization" is achieved by
incorporating Kai Li's style page-level coherence through the VMM. A global page can either have
its coherence managed at the page-level or cache-line level, with dynamic, and possibly automatic,
switching between the two schemes based on access patterns.


Integrating I-structure Semantics into Cache Coherence Protocols: Our research language
relies heavily on I-structures[3], data-structures with synchronizing semantics. These are write-once,
read-many-times memory locations that allows reads to be issued before they are written. Since
they never need to be invalidated, coherence protocols for these locations may become significantly
easier. We plan on investigating these and other custom protocols tailored for specific applications.


3.4  Penalty of a Cache Miss


Since StarT-ng's support for CCDSM is mostly in software, a cache miss on global data incurs
a significant penalty. The penalties of cache-misses are shown in Figure 2. The times are given in
processor cycles and are our current best estimates. The corresponding numbers for the FLASH,
as reported in [6] are given for comparison.
   To the first order, the global memory miss penalties for StarT-ng are roughly three times those
of FLASH, with the penalty for the case of global memory with "Local" home, and Clean directory
state approaching four times that of FLASH. Our miss penalty for local address space appears to
be faster than FLASH's partly because our machine will run at a slower clock rate, making the
memory latency a fewer number of processor cycles. Though our miss penalties are significantly
higher than those of FLASH, it is important to remember that we only pay these penalties when we
miss in the L1 and L2 cache. We can express the relative efficiency of a particular implementation
by average clocks-per-instruction (CPI) which is computed using the following equation.


          CP Iglobalmemoryops= CP Imemoryopcachehit+ miss - rate x miss - penalty:


CP Iaverage= %non-global-opsxCP Inon-global-ops+%global-memory-opsxCP Iglobalmemoryops


                                        8

   A low miss-rate and/or a low percentage of global memory operations will reduce the impact of
StarT-ng's disadvantage. A recent FLASH paper[6], which measured the performance of FLASH
using a simulator, gives some ideas about the kind of miss rates that one may expect. For the seven
programs (mostly SPLASH programs) that they ran on the simulator, three have miss rates below
0.1%, three have miss rates between 0.1-1.0%, and only one had the rather high miss rate of 6.0%.
   A quick back-of-the-envelope calculation for the Average CPI, using some reasonable estimated
numbers5 for missing components of the Average CPI formula shows that for miss rate of under
0.1%, the difference in miss penalty between StarT-ng and FLASH does not make any significant
difference. At 1.0% miss rate, the difference becomes noticeable, with FLASH about 50% faster.
At about the 1% miss rate, however, the exact miss rate that each machine encounters will have a
greater impact on performance. StarT-ng should encounter lower miss rates than FLASH since
each site is an SMP, making it the home of more data local, our caches will potentially be bigger,
and an L3 is used which should increase the sharing of remote data on a site. Miss rates that are
in the range of 6% is not of very great interest because both machines will be suffering from such
high "parallel slowdown" that neither can handle the program very efficiently.
   Thus, for programs that run well on an aggressive implementation like FLASH, StarT-ng has
the potential of running reasonably. Clearly "embarrassingly local" programs will run equally well
on either machine. Programs that do not run as well on StarT-ng as on FLASH do exist, but
FLASH's range of programs does not go on forever. It will be interesting to see just what kind of
performance we can get out of StarT-ng's CCDSM.
4   Issues in Designing CCDSM


In designing StarT-ng, we encountered a surprisingly large number of complications from us-
ing a processor in the SMP as the sP. The exact nature of these problems is dependent on the
microprocessor bus interface and the design of the snoopy bus itself.  Other problems deal with
buffer management and allocation.  In the following section, we outline the major problems we
encountered and their solutions. We believe that these problems, or some solution to them, are
encountered in most CCDSM implementations today.
   Before we discuss these examples, it should be noted that cache coherence protocols tend to be
quite complex in their actual implementations. That complexity is a compelling reason for doing
them in software instead of hardwiring them.  In this paper, we are not presenting any specific
protocols. However, what the following examples will illustrate is that without some cooperation
from the underlying hardware, implementing CCDSM will be either impossible, or so inefficient as
to be practically uninteresting. The following examples will not describe any complete protocols,
but will just show protocol fragments to illustrate the problems encountered. Except for Section 4.7,
all the other sections assume a FIFO network.


4.1  Need for a Split-phase Bus in SMP


A blocking bus, which locks up the bus until an operation is complete, can cause deadlocks in a
CCDSM system. For example, consider processors PA and PB at sites A and B respectively.


   1.PA has x in Modified state and now wants to read y. The Read y request goes on the bus and is
     accepted.
________________________________5
   We used these estimated numbers: CPI for other ops is 0.8; cache hit CPI = 0.8; % global memory ops = 10%;
miss penalty = 800 for StarT-ng and 266 for FLASH.


                                        9


   2.At the same time, PB, which has y in Modified state wants to read x. It issues a Read x request
     which is accepted on its bus.

   3.Neither can proceed any further because each request needs to get the cache-line from the other bus.


   This problem is due to concurrent events that have no immediate knowledge of each other.
   A snoopy bus protocol used in a CCDSM system should therefore be split-phase allowing
processors to snoop the bus and service bus initiated operations while waiting for its own requests
to be fulfilled.  It is true that as long as it is possible to retry bus operations, it is possible
to implement CCDSM using either a blocking or pipelined snoopy bus protocol.  However, the
implementation complexity and the performance penalty makes either an unreasonable engineering
choice.  This was the main reason for us to select an SMP based on PowerPC 620 rather than
PowerPC 601.


4.2  Interaction of Invalidates and Outstanding Reads


   Suppose a processor retries an Invalidation to x if there is an outstanding bus operation to
x. This processor could deadlock a CCDSM implemention that uses it. For an example of the
problem, consider simple scenario, which is also shown in Figure 3, where an Invalidate x request
arrives at A when there is an outstanding Read x.


   1.PA has x in Shared but replaces it in its cache.


   2.PB wants to write x. Invalidation sent to A


   3.PA reads x again


   4.PA receives Invalidation.
                                       10

   If the processor does not allow the Invalidation to complete until the Read completes, the Inval-
idation of x will deadlock due to the invalidation waiting for the outstanding Read x to complete,
which depends on the Invalidation completing through B's Write x.  In this case, the CCDSM
implementation must track outstanding operations, intercept invalidation requests and send back
a message indicating that an invalidation to a cache-line accessed by outstanding operations has
completed without putting the invalidation onto the bus. Skipping the actual invalidation is legal,
since we know that the cache-line does not exist at the requesting site.
   This problem is solved if the snoopy bus protocol allows an Invalidate x bus operation to
complete even if a processor on the bus has an outstanding Read x.  When the read cache-line
returns, it is the read request is satisfied.


4.3  Complications from Multi-processor Sites


Multiple processors at a site create problem situations in CCDSM systems which an inclusive
cache shared by all site processors eliminiates. A variation of the example in Section 4.2 with two
processors PA1 and PA2 at site A explains why.


   1.PA1 reads x, B is the home site of x. PA1 obtains x and caches it.


   2.Processor PB tries to write x. The directory indicates site A has a copy. An invalidation is
     sent to site A.


   3.Meanwhile, PA2 tries to read x, misses in its cache, and sends a Read x request to the home
     site.


   4.When the Invalidate x request arrives at A, there is an outstanding read request to cache-line
     x.


   Unlike the example in Section 4.2, there is a cache copy of x in PA1 which must be invalidated
by a invalidation bus operation.  If the snoopy bus protocol allows the invalidation to complete
despite PA2's outstanding operation, there is no problem.  Otherwise, an inclusive shared cache
avoids the problem altogether by eliminating the situation where there is an outstanding Read x
from a site and x in some cache of that site.
   An extra cache is not necessary if the snoopy bus protocol supports cache-to-cache transfer of
all "external" data. Cache-to-cache transfer when a Read x occurs while x is in the Modified state
in some processor's cache is currently supported on many snoopy buses. Extending it to happen
when x is in the Shared state will allow all the caches to function as a combined per-site cache.


4.4  Need to Separate Buffers by Functionality


Assuming our protocol is logically correct, another source of deadlock is a finite number of buffers.
The sharing of hardware resources can introduce dependencies between operations that are logically
unrelated.  There is no system that requires unbounded buffering; it is possible to estimate the
number of buffers needed given the size of the machine and the protocol. This limit may even be
acceptable in certain cases. The limit is may not be satisfactory, however, if buffer size is directly
proportional to the number processors per site or the number of sites per machine. We will show
that unless we can estimate the sizes, we must functionally separate the buffers.
   Consider an implementation with four types of buffers: (i) Capture Buffers, used for storing
captured requests until they can be forwarded to the home site; (ii) Home Request Buffers, which

                                       11

hold requests at a home site until they are processed; (iii) Invalidation Buffers, which hold inval-
idation requests until they complete; and (iv) Reply Buffers which allow the protocol to return a
cache-line to the requester. Our example will deal with three sites, A, B and C with processors
PA and PB on sites A and B respectively. Assume each site has two Cpature Buffers, four Home
Request Buffers, one Invalidation Buffers, and one Reply Buffer.


   1.PA writes cache-lines x1, x2, : :,:x10, (all of whose home is site C), caching them in the
     Modified state.


   2.Similarly, PB writes cache-lines y1, y2, : :,:y10, (whose home is also site C), caching them in
     the Modified state in PB.


   3.PA and PB "exchange" the data that they work on, i.e., PA reads cache-lines y1, y2, : :,:
     y10, while PB reads cache-lines x1, x2, : :,:x10. If the bus is split-phase or if there are many
     processors per site, many requests can be outstanding from sites A and B.


   4.All the read requests are buffered in Capture Buffers at the respective requesting sites, before
     they are buffered in site C's Home Request buffers. Invalidation requests are issued to sites
     A and B.


   5.When the invalidation requests, Inval x1 and Inval y1 arrive at sites A and B respectively,
     invalidation bus operations are performed. However, in order for the Modified data to be
     invalidated and written back, there must be Capture Buffers to accept them. Unfortunately,
     all the Capture Buffers have been used up by outstanding read requests, so the invalidations
     cannot complete.


   6.Worse yet, no Capture Buffers will free up at A or B until there is space in the Home Buffers
     of C. But no the Home Buffers in C are full, and will not free up until the invalidations are
     done. A deadlock has now occurred. This situation is illustrated by Figure 4, where the arcs
     indicate dependence.


   The above deadlock scenarios arises purely from the constraints of finite buffering.  If the
home site, C, or the capture buffers has infinite buffering, this problem will not occur. But infinite
buffering is not a realistic solution unless one can bound the maximum amount of buffering required
which is sometimes feasible but not elegant or scalable.
   This deadlock is easy to see with a static dependence graph, which, for the example, is shown
in Figure 56.  Each node of this graph is a pool of hardware resources used by the CCDSM
implementation in a blocking fashion. An arc, such as from the Capture Buffers node to the Home
Buffers node indicates that a Capture Buffer in use is not released until a Home Buffer is obtained.
If this graph does not contain any cycles, no deadlock due to sharing of hardware resource can
occur dynamically. Unfortunately, a cycle exists for the example. This indicates that deadlock can
occur if too many concurrent events are outstanding7.
   A solution for the example requires not only seperating write-back capture buffers from read
capture buffers, but also the home read request buffers from the home write-back request buffers.
The deadlock-free dependency graph is shown in Figure 6. Intuatively, this means that write-backs
of cache-lines have a dedicated path to memory which read requests cannot block.  Reads are
________________________________6
   We have developed this technique for illuminating deadlock possibilites. It is very similar to techniques used to
show networks are deadlock free. An article will be written on the technique in the near future.
  7It is possible for a particular implementation with cycles to avoid deadlock by constraining the number of

concurrent actions.


                                       12


                                       13


only blocked as long as write-backs are blocked, so avoiding write-back blocking avoids deadlock
altogether. Splitting buffers into separate categories require us to be able to retry bus accesses that
use certain buffers that are full while allowing others to proceed. This selective retry capability is
provided in our ACD.


4.5  Why Our sP Cannot Issue Invalidates


Blocking shared resources include more than just buffers.  In StarT-ng, the sP is capable of
executing invalidations (using a Flush command). Unfortunately, a Flush is not allowed to complete
until any write-back it triggers completes. An sP performing a Flush operation can thus be blocked.
This would make the sP a blocking resource, leading to deadlocks because of the sP's other roles.
   This problem can be seen with the static dependence graph of Figure 7. Because the sP is used
in a blocking fashion to perform the invalidation, it has to be modeled in the graph. The sP is part
of almost every step of the processing, creating all the arcs entering it. Using the sP to perform
invalidations produces the arc to the write-back capture buffers node. That arc produces cycles,
indicating possible deadlocks.
   This problem is solved by providing extra hardware that performs the invalidation. This gives
rise to the static dependence graph shown in Figure 8 that has no cycles. In StarT-ng, we have
chosen to implement this within the ACD hardware, though it functions independently of all other
functionalities of the ACD.


4.6  Difficulty in Sharing Cachable Data between dPs and sP


The snoopy bus protocol should allow cache-to-cache transfer of Modified data to a processor
performing a Read without requiring a write-back to main memory. Without this capability, the
sP cannot share cacheable, writable data with the dPs. The sP cannot read data that is Modified
in a dP because it will block until the dP is able to push the Modified data out with a write-back.
Although this write-back is to local memory, it may be waiting behind write-backs that are going
to global memory. Hence, the write-back may be blocked because all the write-back capture buffer
resources are in use, and can only be freed by the sP. In the static dependence graph, sharing
                                       14


of writable, cacheable data between the sP and the dPs introduces an arc from the sP to the
write-back capture buffers resource, resulting in a cycle. Thus, data sharing between the sP and
dP must be through uncached memory.
   It is important to note that since operating systems data-structures on SMPs are usually
cacheable and writable by any processor, this prohibition on sharing cached, writable memory
between the dPs and the sP means that the sP cannot use the same operating system services
used by dPs. For this reason, the sP will in most likelihood have to run firmware code without
virtual memory, although it is conceivable that the sP could run a simple operating system of its
own that does not share any resource with the main operating system.


4.7  Complications due to Non-FIFO Network


A non-FIFO network introduces considerable complexity since messages can get arbitrarily out of
order. Although the protocol can use acknowledgements to serialize its actions and hence avoid the
effects of a non-FIFO network, such a solution is unattractive because of the additional overhead and
latency. Better solutions use more states in the protocol to handle possible out-of-order situations,
sending extra messages only when necessary.  Keeping track of outstanding operations, which is
required in previous examples, is also necessary here.
   Let us consider the scenario below and also shown in Figure 9 which is related to the example
in Section 4.2.


   1.PA sends a Read x to site B, home of x.


   2.B sends back x


   3.B writes x sends A Invalidation


   4.Invalidation arrives before x!!!


                                       17

                  5.x arrives


                  If the arrivals of Data x and Invalidate x at A are reversed, it is not correct to process them in
               their arrival order, since the home processor might then believe that site A does not have a copy
               of the cache-line. Simply having the invalidation wait while there is an outstanding operation will
               potentially deadlock since it is possible that the Invalidate should come before the Data (Example
               in Section 4.2) and the Data won't come until the Invalidate completes.
                  If there is an outstanding operation to the same cache-line on site A, an invalidate can wait
               at site A. The home site B, upon receiving a Read x request from a site to which it has already
               sent an invalidation must reply with a Retry Read x. If this retry gets back to site A and finds a
               pending invalidation, the invalidation can then be allowed to complete without being put on the
               bus. If a cache-line returns to find a waiting invalidation, the cache-line can be returned to satisfy
               the outstanding request, then the invalidation can be put on the bus. Proving this solution works
               is involved because other messages can overtake each other. For example, the Retry Read x can
               overtake the Invalidate x. Another document[18] describes cache coherence protocols we intend to
               implement on StarT-ng and proofs of their logical correctness.


               5   Conclusion


               The design of a CCDSM system using highly integrated commercial microprocessor presents many
               challenges. Because the snoopy bus protocol of the processor forms the substrate on which CCDSM
               protocols are implemented, the exact semantics and detailed implementation of the snoopy bus can
               have profound influence on the design. We have shown in this paper that
                  The lesson that we have learned from designing CCDSM on StarT-ng is that it is possible to
               design
                  This paper explores the challenges facing a designer of a CCDSM system from commercial
               SMP's which uses a processor from each SMP as the sP and a minimal piece of custom hardware.
               Some of the problems we faced were particular to our system, while others were general and applied
               to all CCDSM systems.  Our particular problems can be attributed to the processors used, the
               snoopy bus protocol used, and the network used. Many of the problems are due to oversights in
               the processor/bus design and could be solved by modifying the processor or switching processors.
               Luckily, there are software solutions to all the problems we discovered. Other problems, notable
               the buffering problem, require hardware to be designed in a particular way regardless of what the
               software does.
                  The performance of coherent shared memory on this machine, like all CCDSM machines,
               depends mostly on the global memory miss ratios. Though our global cache misse penalties are a
               factor of three away from more aggressive CCDSM machines, cache misses generally account for
               a very small fraction of the entire instruction stream, giving us competitive performance for many
               programs. Considering the cost of implementation and the experimental nature of coherent caches
               on our machine, our expected performance is surprisingly good.
                  In the following subsection, we summarize the problems and solutions presented in this paper.


Snoopy Bus Protocols:In order to facilitate CCDSM implementions, snoopy bus protocols should have the following
                    properties.


                      - They should not block invalidations because of outstanding operations.

                      - They should support cache-to-cache transfer of Shared as well as Modified data. Cache-
                        to-cache transfer of Shared can be avoided if some form of L3 cache is provided. De-
                        pending on the implementations, either can perform better than the other.


                                                      18

                               - They should support Shared Modified state, which allows cache-to-cache transfers
                                 for Reads of modified cache-lines without write-backs to memory.


           non-FIFO networks:If the network is non-FIFO, the CCDSM protocol should track outstanding operations and
                             implement a protocol which handles the non-FIFO network correctly and efficiently.


Buffering and Blocking hardware:Buffering and blocking hardware should be divided in a manner that avoids deadlocks. Using
                             dependency graphs, it is easy to see where deadlocks may occur. Then, either by splitting up
                             groups of buffers or by preallocating buffers when required, deadlocks can be avoided.