Issues in Building a Cache-Coherent Distributed Shared Memory Machine using Commercial SMPs tbd December 7, 1994 Abstract Many research groups and companies, past and present, have expended a significant amount of effort designing cache-coherent shared memory systems for distributed memory machines. Cache-coherent shared memory is a desirable feature, giving the programmer, operating system and compiler a uniform view of memory. The effort involved in designing and implementing this feature, however, has traditionally been very large and seems to have only gotten larger as time goes on. Many of the recent efforts involve building large custom chips, some of which are almost as large as the microprocessors they support. The StarT-ng project is an attempt at building a parallel machine that supports both fine- grained user-level message passing and cache-coherent distributed shared memory (CCDSM) using commercial symetric multiprocessors (SMP) as the basic building block. The goal is to leverage as much as possible from commercial systems, allowing us to concentrate on the parallel aspects of the machine. Since we wish to mimimize the effort of implementing custom hardware, we naturally looked for ways to minimize custom hardware for global cache-coherent memory. We achieve this by dedicating a stock processor, called the service processor or sP, on each SMP to maintain coherency between caches. The service processor works in conjunction with a small custom device, called the address capture device (ACD), which sits on the memory bus and "captures" accesses to global memory. By having a stock processor do most of the work, we drastically reduce the amount of custom hardware needed to offer support for CCDSM. This paper describes the implementation of CCDSM on StarT-ng. We discuss some of the problems we encountered during the design. Many of these problems are common to all CCDSM implementations. We also estimate the CCDSM performance of StarT-ng. 1 Introduction Researchers generally agree that a global address space simplifies the task of parallel programming. Having a global address space (shared memory) simply means that a uniform addressing mechanism can be used by all processors to access both local and non-local data. For programming purposes, however, one also needs to specify a memory model, i.e., how memory accesses are viewed and ordered by processors in the system. The most well known memory model is Lamport's strongly consistent sequential model where only an interleaving of sequential memory reference streams from processors is permitted. For improved performance, people have designed "weaker" consistency models. There is yet another class of memory models, such as I-structures, which are motivated by concerns for parallelism. Though there is no general agreement on what is the best memory model for parallel programming, strong sequential consistency is a requirement for running most operating systems (including Unix) on Symmetric Multiprocessors (SMPs). The StarT-ng project is about building an efficient message passing machine which is also suitable for experimenting with different memory models. An important goal of the project is to 1 maximize the use of commercially available hardware and operating systems. For various technical and non-technical reasons we have chosen to connect PowerPC based SMP's together by a fast network of our own design. Following the idea presented in 14], one processor in each SMP is dedicated to handling memory requests from other SMPs. Though we expect to implement additional features required by different memory models in software, each SMP is augmented with a simple address capture device (ACD) to make the implementation of a global address space possible. In this paper we will discuss some of the interesting issues that had to be resolved to design such a machine. The most important issue in the implementation of a memory model is whether several copies of a memory location are permitted or not. Currently, by far the most popular approach to reducing memory latency and bus bandwidth is to rely on caches. Since in a parallel system there are potentially multiple readers and writers of a cache-line, it is possible for a cache to have an out-of- date or "stale" copy of that cache-line. Protocols to keep the copies of replicated data consistent with each other as dictated by a memory model is a major concern in the implementation of any parallel machine built using commercial microprocessors. However, there are machines without caches where each memory reference always goes to the physical memory location. In such systems, memory references to a location are automatically serialized in a fixed manner for all processors. The most notable example of such a machine is Cray X-MP. Most dataflow and multi-threaded machines also do not cache global data (See for example, HEP [19], Tera [2], EM-4 [17] Monsoon [15]). Such machines rely on split-phase transactions to absorb memory latency, and provide effiecient mechanisms to support word-level synchronization throughout the global memory. The major criticism of machines without caches is that they demand higher bandwidth memory systems than machines with local caches. Some machines, such as the Cray T3D[], pass on the burden of keeping the caches coherent to the user-level software. This position, however, is not always desirable. The oldest way to keep the caches coherent comes from bus based multiprocessors (dual processor IBM 370 systems from the early seventies). The solution, which is also used in all current SMPs, is to "snoop" on the address bus during a cache miss. On a write operation the data is automatically invalidated from all caches to maintain consistency. Bus-based shared memory machines, however, are not scalable because the bus bandwidth is shared between all the processors. Cache coherent distributed shared memory (CCDSM) systems have been advanced as a so- lution to the scalability problem. CCDSM systems consist of multiple snoopy buses connected via an interconnection network. Some form of a directory-based cache-coherence protocol is used to maintain cache coherence across multiple sites. Machines in this class include the Stanford DASH[9, 10], the Kendall Square Research KSR[5], and the MIT Alewife[1], which are all op- erational, and the machines under design such as the Stanford FLASH[8] and the University of Wisconson Typhoon(a paper architecture). All these machines are characterized by a substantial amount of custom hardware to support directory based cache coherence protocols. The amount of hardware required to implement shared memory depends to a large degree on the desired performance on a class of applications. At one extreme, the underlying system supports only message passing. Some have written compilers or libraries that can give the illusion of shared memory. Those schemes, however, use global memory in a highly stylized manner and generally avoid coherence issues. Others[11, 13, 12] have implemented coherent shared memory at page granularity by modifying the virtual memory manager to allow demand paging from remote nodes. These schemes work well in limited contexts and are not suitable for building general purpose parallel programming models. The StarT-ng project does provide a minimal amount of custom hardware (i.e., ACD) to support directory-based cache-coherent shared memory on a network of SMPs. We hope to have 2 extreme flexibility in experimenting with cache coherence protocols by using one of the processors in each SMP for protocol processing. Our approach will also allow us to explore the virtual memory management (VMM) aspects of CCDSM, and to pursue language driven memory models such as I-structures. Since almost all of the CCDSM support is in software, performance is invariably going to be compromised. However, at this stage it is not clear how much improvement in overall performance can be expected from a more aggressive implementation. Unlike traditional shared memory machines, where the entire machine runs one copy of the operating system (OS) which views the whole machine as an SMP, StarT-ng will run a separate copy of the OS on each SMP site, allowing us to use a stock operating system. We believe that the hardware resources at each SMP should be managed separately for fault tolerance1 and scala- bility reasons. Coordination across multiple SMPs, such as for work scheduling or virtual memory management (VMM), will be done as a layer on top of the per-site OS. Although this paper is about CCDSM on StarT-ng, the message passing capability of the machine is still of fundamental importance. The performance of the message passing system will have a direct impact on the performance of the CCDSM system. User-level message passing also provides additional capability - the arrival of messages can be used as a means of activating computation or performing sychronization. Other researchers[7] have pointed out the synergy between shared memory and message passing. StarT-ng, like Alewife and FLASH, will support both. The rest of this paper is organized as follows. The next section provides some background for the rest of the paper. Section 3 describes StarT-ng and outlines how we plan to implement CCDSM. Estimates for the performance of StarT-ng's CCDSM are included in that section. In Section 4, we discuss the issues that we encountered while implementing CCDSM. Approaches for dealing with the issues are discussed. Most of these problems are not specific to StarT-ng. The main contributions of this paper are the exploration of very simple hardware for supporting cache-line granularity cache-coherence, the identification of the problems of this and other approachs, and the development of solutions for those problems. 2 CCDSM Concerns Cache-coherence on distributed systems are usually maintained with directory based schemes. "Di- rectory" information on every cache-line is kept by a "home" location. Typical directory entries are Uncached by any site; Shared among one or more sites and thus is read-only at that moment; or Modified (and temporarily owned) in one site and only that site can read and write that cache-line. When a memory operation misses in a processor's local cache, the request is forwarded to the home location which checks its directory. If the cache-line is in the Uncached state, the request is satisfied and the directory updated. The home site can also satisfy a read request when the cache-line is in Shared state. A Write2 request will, however, require all the current readers to be invalidated. Invalidation means that all processors caching that cache-line must have that cache-line removed from their cache. Regardless of the request type, if the cache-line is in the Modified state, the home must invalidate the cache-line at the current owner's site, forcing the up-to-date cache-line to be returned to home, which then sends it to the requester. The directory is updated appropriately during this process. Though all directory-based cache-coherence schemes follow this basic pattern, there ________________________________1 Crashes due to the operating system occur more frequently than crashes due to the hardware! 2Write requests generally come out of a processor as a "read-with-intent-to-modify" request rather than a "write" because the write may just be to one word in a cache-line. 3 are many variations and optimizations. For simplicity of discussion, this paper will assume a sequentially consistent implementation of memory operations when they appear on the memory bus. The rest of this section gives an overview of the building blocks of a CCDSM system and the issues related to each component. 2.1 Memory Model A specific processor implements a particular memory consistency model. Most current microproces- sors, including PowerPC 620, will reorder memory operations to improve performance; for example, giving loads higher priority than stores since the instruction stream will stall for want of a loaded value. To preserve the semantics of sequential execution on a processor, memory transactions from a processor to the same location proceed in order. However, memory transactions to different loca- tions do not necessarily have any fixed ordering. This consistency model is commonly called weak consistency. Processors that implement weak consistency invariably provide sequentializing instructions, such as the sync instruction in PowerPC, that insure that memory transactions go out in a sequen- tially consistant order. In order to implement sequential consistency, sequentializing instructions could be inserted at appropiate places in the instruction stream. Weak consistency is sufficient for correct execution most of the time. An instance where weak consistency is in-sufficient is when dealing with memory mapped I/O devices. Generally, I/O devices require that writes to their address space are performed in a very specific order. In that case, it may be important that operations that appear on the bus are in the same order as their ordering in the instruction stream. Considering the performance hit taken if sequential consistency is maintained, especially in fast superscalar processors, sequentializing instructions are used only when absolutely necessary. 2.2 Memory Instruction Semantics We are using commercial processors to implement a CCDSM system. Because these processors are designed for the uniprocessor or SMP setting, the semantics of some of their instructions are not necessarily "scalable". For example, the PowerPC 620 used in StarT-ng has a data cache-block flush (dcbf) instruction which flushes a specified address out of all processors' caches. It may be possible for the exact dcbf semantics to carry over to a parallel machine built from multiple indirectly connected buses, essentially broadcasting every flush to all processors, but it is certain that such a scheme would be inefficient. The basic reason why directory information is kept is so that invalidations do not have to be broadcast. The alternative is to limit dcbf to flush only processors on the bus where the dcbf is issued. This semantics will allow us to use the operation to effect single-site invalidations needed in CCDSM. These and related issues require the designers of a CCDSM system to examine the instruction set's semantics. 2.3 Memory bus implementation Like processor cache design, memory buses have gone through refinements to improve performance. Early memory buses allow bus transaction to occur one at a time. If a processor issued a read, the memory bus would block until that read was satisfied. Later memory buses allowed pipelining which increased bus throughput. A pipelined bus, however, cannot tolerate memory operations that are satisfied out of order. If one operation takes longer to perform, all of the operations that follow it must wait until it finishes. 4 A more general approach is provided in a split-phase bus, which allows replies to return in an order different from that in which their requests were issued. This approach requires some way of associating a reply to its request making implementations more complicated but promises the best performance. Most processors of the future will have split-phase buses. As we shall see later in this paper, split-phase buses are needed in the CCDSM setting not only to improve performance, but also to avoid certain types of deadlocks. It is important to note that when a processor issues a memory request on the bus, it can be retried by the memory system multiple times. The processor has to keep trying until the request is accepted. During this time, the processor has to perform snooping operation and react as though the request has not occurred. Once a request has been accepted by the memory system, however, the request is considered to be "committed" and the processor, in general, does not allow cancellation of the request. The need for such cancellations does not arise in SMPs. On a split-phase bus, the time between when a request is accepted to the time when the reply comes back opens up a window during which the semantics and interaction between new bus operations and the outstanding operation is not obvious. As an example, after a processor's Read x request has been accepted, how should the processor respond to an Invalidate x bus operation? Should it allow the Invalidate to complete because it does not yet have a copy of x in its cache, or should it prevent the Invalidate from completing until it gets back data x, and remove that from its cache? Both positions make sense in the SMP setting. The choice, however, makes very big difference to a CCDSM system as we will see in Section 4. 2.4 Buffers and Networks Buffer management is important in a CCDSM system as the amount of buffer space in the pro- cessor, network, and coherency engine for storing the state of in-progress requests is finite. If one is not careful in the design of the system, concurrent processing of multiple requests can lead to deadlocks due to insufficient buffers. Another major concern for CCDSM is whether the network delivers messages in FIFO order. If the network is not FIFO, it is important for the coherence protocol to know that and account for it, or for some lower level software to insure that FIFO ordering is observed by the coherence protocol. Generally speaking, non-FIFO networks add significant complexity to cache coherence protocols. 3 The Design of StarT-ng A block diagram of a StarT-ng site is shown in Figure 1. The commercial building block of StarT- ng (unshaded in the diagram) is an SMP built from PowerPC 620 processors. The processor card is modified to include a Network Interface Unit (NIU). An address capture device (ACD) is added to the coherent interconnect to help support CCDSM. When the machine is running in the mode that supports global cache-coherence, a 620 from each SMP, which we call the service processor (sP), is dedicated to this task. The rest of the processors, which are used for computation, are called data processors (dP). Parallel programs running on StarT-ng can communicate through both message passing and shared memory, whether they are executing on one site or multiple sites. Each 620 in StarT-ng is connected to an NIU through its L2 cache interface. The NIU is being designed and implemented by Motorola and its partners and was heavily influenced by the MSU functional unit of the 88110MP[16]. The NIUs provide access to a Fat-tree network built out of Arctic[4] routing chips. Currently under development at MIT, the Arctic router is a 4x4 router that provides 200 megabytes/sec/link bandwidth. 5 Figure 1: StarT-ng block diagram 3.1 ACD The ACD's has two basic functions: (i) "capture" global bus transactions and pass it to the sP and (ii) allow the sP to return data to the requesting processors. Capturing means to recognize a global memory transaction, store away a complete copy of the transaction (including the address, transaction type and bus tag to return a value if necessary) and respond accordingly to the processor that initiated that transaction. If the ACD buffers are full, i.e., have not been emptied by the sP, it forces the processor to retry the transaction. Because of limited ACD buffering, the sP should remove captured transactions as quickly as possible. ACD can also be instructed by the sP to selectively flow-control (retry) certain types of trans- actions. This is needed for buffer management because the ACD hardware buffers are backed by software buffers under the sP's control. The ACD also performs Invalidations (FLUSH bus operation). It has hardware support that allows it to interleave a Flush operation with the other functions it performs. We will examine the reasons for having the ACD do the Flush bus operation in Section 4.5. It is possible to build the sP into the ACD or provide some dedicated path between the two. Either design requires substantially more hardware than our design where the sP communicates with and commands the ACD over the SMP bus. The ACD also has a one bit line going to the sP's NIU, allowing the sP to poll its NIU for ACD events. 6 3.2 Supporting CCDSM with the ACD and sP The address space in StarT-ng is divided into four parts: (i) dP local; (ii) dP global; (iii) sP local; and (iv) ACD command spaces. The dP local space is private to each site, directly accessible by load/store operations only from the dPs of that site. The dP global space is the address space supported by CCDSM. Bus operations to this address space are captured by the ACD. The global physical address that is seen by the ACD is not really a physical address but encodes information that will allow the sP to get to the real physical address of the cache-line. We call this address a virsical address. Part of the virsical address indicates the current home site and the rest of the virsical address is a virtual address in the sP local address space of the home site. Finally, the ACD command space is used by the sP to communicate with the ACD. The ACD and the sP support CCDSM as follows. When a processor accesses a global location and misses in its cache, the request propogates to the memory bus and is captured by the ACD. The ACD then signals the sP by setting a bit in its status word as well as asserting a line connected to the sP's NIU. The sP either polls the ACD directly or through the NIU. The sP then reads out the captured transactions, clearing the transactions from the buffers. The sP will inform the ACD of bus transactions to be retried, if necessary. We plan to implement an inclusive L3 cache in software for performance and correctness. The sP will take appropiate action to satisfy the request by checking its own L3 cache and sending messages to the home site if necessary. The sP will also serve as a protocol processor, receiving messages from other sites requesting cache-lines whose home is its site. It is important, therefore, for an sP to continue to serve incoming requests even though it has outstanding requests. When the response to the remote request returns, the sP updates whatever state it needs to, including L3 cache state, and returns the data through the ACD to the requesting processor. 3.3 Planned Features We plan to try the following ideas in the implementation of CCDSM on StarT-ng. Software L3 Cache: An essential idea is a large software L3 cache, implemented in main memory. The sP can implement an inclusive software cache which stores all remote data brought to the site. Any sort of caching strategy can be implemented at this level, including automatic prefetching, presending and update protocols. Split-phase Cache Accesses: During a split-phase cache access, a processor switches threads when an access to global memory misses in its cache. The tough part is how to indicate that there was a cache-miss. We plan to implement this idea by having every load of global locations checked against a "miss pattern". The miss pattern would be returned by the sP to the requesting processor after not finding the desired cache-line in its L3 cache. If a load returns a miss pattern, a cache-miss is assumed and the current task is swapped out. Part of the swap-out would be a request to fetch the cache-line. A microthread descriptor (an instruction pointer/frame pointer pair), which would allow the task to restart, would be included in that request. The cache-line could be delivered to the requesting processor's sP for insertion into the L3 cache, while the desired value sent directly to the requesting processor along with the microthread discriptor. When the desired value returns to the requesting processor, the thread wakes up, and continues with the value it requested. If another processor requests the same cache-line in the near future, the sP can satisfy the request directly from the L3 cache. 7 Supporting Coherence from Page Level to Cache-line Level: During the time when a global page is accessed by only one site, "localizing" the page temporarily allows cache miss processing to by-pass the ACD and sP, resulting in a lower cache miss penalty. "Localization" is achieved by incorporating Kai Li's style page-level coherence through the VMM. A global page can either have its coherence managed at the page-level or cache-line level, with dynamic, and possibly automatic, switching between the two schemes based on access patterns. Integrating I-structure Semantics into Cache Coherence Protocols: Our research language relies heavily on I-structures[3], data-structures with synchronizing semantics. These are write-once, read-many-times memory locations that allows reads to be issued before they are written. Since they never need to be invalidated, coherence protocols for these locations may become significantly easier. We plan on investigating these and other custom protocols tailored for specific applications. 3.4 Penalty of a Cache Miss Since StarT-ng's support for CCDSM is mostly in software, a cache miss on global data incurs a significant penalty. The penalties of cache-misses are shown in Figure 2. The times are given in processor cycles and are our current best estimates. The corresponding numbers for the FLASH, as reported in [6] are given for comparison. To the first order, the global memory miss penalties for StarT-ng are roughly three times those of FLASH, with the penalty for the case of global memory with "Local" home, and Clean directory state approaching four times that of FLASH. Our miss penalty for local address space appears to be faster than FLASH's partly because our machine will run at a slower clock rate, making the memory latency a fewer number of processor cycles. Though our miss penalties are significantly higher than those of FLASH, it is important to remember that we only pay these penalties when we miss in the L1 and L2 cache. We can express the relative efficiency of a particular implementation by average clocks-per-instruction (CPI) which is computed using the following equation. CP Iglobalmemoryops= CP Imemoryopcachehit+ miss - rate x miss - penalty: CP Iaverage= %non-global-opsxCP Inon-global-ops+%global-memory-opsxCP Iglobalmemoryops 8 A low miss-rate and/or a low percentage of global memory operations will reduce the impact of StarT-ng's disadvantage. A recent FLASH paper[6], which measured the performance of FLASH using a simulator, gives some ideas about the kind of miss rates that one may expect. For the seven programs (mostly SPLASH programs) that they ran on the simulator, three have miss rates below 0.1%, three have miss rates between 0.1-1.0%, and only one had the rather high miss rate of 6.0%. A quick back-of-the-envelope calculation for the Average CPI, using some reasonable estimated numbers5 for missing components of the Average CPI formula shows that for miss rate of under 0.1%, the difference in miss penalty between StarT-ng and FLASH does not make any significant difference. At 1.0% miss rate, the difference becomes noticeable, with FLASH about 50% faster. At about the 1% miss rate, however, the exact miss rate that each machine encounters will have a greater impact on performance. StarT-ng should encounter lower miss rates than FLASH since each site is an SMP, making it the home of more data local, our caches will potentially be bigger, and an L3 is used which should increase the sharing of remote data on a site. Miss rates that are in the range of 6% is not of very great interest because both machines will be suffering from such high "parallel slowdown" that neither can handle the program very efficiently. Thus, for programs that run well on an aggressive implementation like FLASH, StarT-ng has the potential of running reasonably. Clearly "embarrassingly local" programs will run equally well on either machine. Programs that do not run as well on StarT-ng as on FLASH do exist, but FLASH's range of programs does not go on forever. It will be interesting to see just what kind of performance we can get out of StarT-ng's CCDSM. 4 Issues in Designing CCDSM In designing StarT-ng, we encountered a surprisingly large number of complications from us- ing a processor in the SMP as the sP. The exact nature of these problems is dependent on the microprocessor bus interface and the design of the snoopy bus itself. Other problems deal with buffer management and allocation. In the following section, we outline the major problems we encountered and their solutions. We believe that these problems, or some solution to them, are encountered in most CCDSM implementations today. Before we discuss these examples, it should be noted that cache coherence protocols tend to be quite complex in their actual implementations. That complexity is a compelling reason for doing them in software instead of hardwiring them. In this paper, we are not presenting any specific protocols. However, what the following examples will illustrate is that without some cooperation from the underlying hardware, implementing CCDSM will be either impossible, or so inefficient as to be practically uninteresting. The following examples will not describe any complete protocols, but will just show protocol fragments to illustrate the problems encountered. Except for Section 4.7, all the other sections assume a FIFO network. 4.1 Need for a Split-phase Bus in SMP A blocking bus, which locks up the bus until an operation is complete, can cause deadlocks in a CCDSM system. For example, consider processors PA and PB at sites A and B respectively. 1.PA has x in Modified state and now wants to read y. The Read y request goes on the bus and is accepted. ________________________________5 We used these estimated numbers: CPI for other ops is 0.8; cache hit CPI = 0.8; % global memory ops = 10%; miss penalty = 800 for StarT-ng and 266 for FLASH. 9 2.At the same time, PB, which has y in Modified state wants to read x. It issues a Read x request which is accepted on its bus. 3.Neither can proceed any further because each request needs to get the cache-line from the other bus. This problem is due to concurrent events that have no immediate knowledge of each other. A snoopy bus protocol used in a CCDSM system should therefore be split-phase allowing processors to snoop the bus and service bus initiated operations while waiting for its own requests to be fulfilled. It is true that as long as it is possible to retry bus operations, it is possible to implement CCDSM using either a blocking or pipelined snoopy bus protocol. However, the implementation complexity and the performance penalty makes either an unreasonable engineering choice. This was the main reason for us to select an SMP based on PowerPC 620 rather than PowerPC 601. 4.2 Interaction of Invalidates and Outstanding Reads Suppose a processor retries an Invalidation to x if there is an outstanding bus operation to x. This processor could deadlock a CCDSM implemention that uses it. For an example of the problem, consider simple scenario, which is also shown in Figure 3, where an Invalidate x request arrives at A when there is an outstanding Read x. 1.PA has x in Shared but replaces it in its cache. 2.PB wants to write x. Invalidation sent to A 3.PA reads x again 4.PA receives Invalidation. 10 If the processor does not allow the Invalidation to complete until the Read completes, the Inval- idation of x will deadlock due to the invalidation waiting for the outstanding Read x to complete, which depends on the Invalidation completing through B's Write x. In this case, the CCDSM implementation must track outstanding operations, intercept invalidation requests and send back a message indicating that an invalidation to a cache-line accessed by outstanding operations has completed without putting the invalidation onto the bus. Skipping the actual invalidation is legal, since we know that the cache-line does not exist at the requesting site. This problem is solved if the snoopy bus protocol allows an Invalidate x bus operation to complete even if a processor on the bus has an outstanding Read x. When the read cache-line returns, it is the read request is satisfied. 4.3 Complications from Multi-processor Sites Multiple processors at a site create problem situations in CCDSM systems which an inclusive cache shared by all site processors eliminiates. A variation of the example in Section 4.2 with two processors PA1 and PA2 at site A explains why. 1.PA1 reads x, B is the home site of x. PA1 obtains x and caches it. 2.Processor PB tries to write x. The directory indicates site A has a copy. An invalidation is sent to site A. 3.Meanwhile, PA2 tries to read x, misses in its cache, and sends a Read x request to the home site. 4.When the Invalidate x request arrives at A, there is an outstanding read request to cache-line x. Unlike the example in Section 4.2, there is a cache copy of x in PA1 which must be invalidated by a invalidation bus operation. If the snoopy bus protocol allows the invalidation to complete despite PA2's outstanding operation, there is no problem. Otherwise, an inclusive shared cache avoids the problem altogether by eliminating the situation where there is an outstanding Read x from a site and x in some cache of that site. An extra cache is not necessary if the snoopy bus protocol supports cache-to-cache transfer of all "external" data. Cache-to-cache transfer when a Read x occurs while x is in the Modified state in some processor's cache is currently supported on many snoopy buses. Extending it to happen when x is in the Shared state will allow all the caches to function as a combined per-site cache. 4.4 Need to Separate Buffers by Functionality Assuming our protocol is logically correct, another source of deadlock is a finite number of buffers. The sharing of hardware resources can introduce dependencies between operations that are logically unrelated. There is no system that requires unbounded buffering; it is possible to estimate the number of buffers needed given the size of the machine and the protocol. This limit may even be acceptable in certain cases. The limit is may not be satisfactory, however, if buffer size is directly proportional to the number processors per site or the number of sites per machine. We will show that unless we can estimate the sizes, we must functionally separate the buffers. Consider an implementation with four types of buffers: (i) Capture Buffers, used for storing captured requests until they can be forwarded to the home site; (ii) Home Request Buffers, which 11 hold requests at a home site until they are processed; (iii) Invalidation Buffers, which hold inval- idation requests until they complete; and (iv) Reply Buffers which allow the protocol to return a cache-line to the requester. Our example will deal with three sites, A, B and C with processors PA and PB on sites A and B respectively. Assume each site has two Cpature Buffers, four Home Request Buffers, one Invalidation Buffers, and one Reply Buffer. 1.PA writes cache-lines x1, x2, : :,:x10, (all of whose home is site C), caching them in the Modified state. 2.Similarly, PB writes cache-lines y1, y2, : :,:y10, (whose home is also site C), caching them in the Modified state in PB. 3.PA and PB "exchange" the data that they work on, i.e., PA reads cache-lines y1, y2, : :,: y10, while PB reads cache-lines x1, x2, : :,:x10. If the bus is split-phase or if there are many processors per site, many requests can be outstanding from sites A and B. 4.All the read requests are buffered in Capture Buffers at the respective requesting sites, before they are buffered in site C's Home Request buffers. Invalidation requests are issued to sites A and B. 5.When the invalidation requests, Inval x1 and Inval y1 arrive at sites A and B respectively, invalidation bus operations are performed. However, in order for the Modified data to be invalidated and written back, there must be Capture Buffers to accept them. Unfortunately, all the Capture Buffers have been used up by outstanding read requests, so the invalidations cannot complete. 6.Worse yet, no Capture Buffers will free up at A or B until there is space in the Home Buffers of C. But no the Home Buffers in C are full, and will not free up until the invalidations are done. A deadlock has now occurred. This situation is illustrated by Figure 4, where the arcs indicate dependence. The above deadlock scenarios arises purely from the constraints of finite buffering. If the home site, C, or the capture buffers has infinite buffering, this problem will not occur. But infinite buffering is not a realistic solution unless one can bound the maximum amount of buffering required which is sometimes feasible but not elegant or scalable. This deadlock is easy to see with a static dependence graph, which, for the example, is shown in Figure 56. Each node of this graph is a pool of hardware resources used by the CCDSM implementation in a blocking fashion. An arc, such as from the Capture Buffers node to the Home Buffers node indicates that a Capture Buffer in use is not released until a Home Buffer is obtained. If this graph does not contain any cycles, no deadlock due to sharing of hardware resource can occur dynamically. Unfortunately, a cycle exists for the example. This indicates that deadlock can occur if too many concurrent events are outstanding7. A solution for the example requires not only seperating write-back capture buffers from read capture buffers, but also the home read request buffers from the home write-back request buffers. The deadlock-free dependency graph is shown in Figure 6. Intuatively, this means that write-backs of cache-lines have a dedicated path to memory which read requests cannot block. Reads are ________________________________6 We have developed this technique for illuminating deadlock possibilites. It is very similar to techniques used to show networks are deadlock free. An article will be written on the technique in the near future. 7It is possible for a particular implementation with cycles to avoid deadlock by constraining the number of concurrent actions. 12 13 only blocked as long as write-backs are blocked, so avoiding write-back blocking avoids deadlock altogether. Splitting buffers into separate categories require us to be able to retry bus accesses that use certain buffers that are full while allowing others to proceed. This selective retry capability is provided in our ACD. 4.5 Why Our sP Cannot Issue Invalidates Blocking shared resources include more than just buffers. In StarT-ng, the sP is capable of executing invalidations (using a Flush command). Unfortunately, a Flush is not allowed to complete until any write-back it triggers completes. An sP performing a Flush operation can thus be blocked. This would make the sP a blocking resource, leading to deadlocks because of the sP's other roles. This problem can be seen with the static dependence graph of Figure 7. Because the sP is used in a blocking fashion to perform the invalidation, it has to be modeled in the graph. The sP is part of almost every step of the processing, creating all the arcs entering it. Using the sP to perform invalidations produces the arc to the write-back capture buffers node. That arc produces cycles, indicating possible deadlocks. This problem is solved by providing extra hardware that performs the invalidation. This gives rise to the static dependence graph shown in Figure 8 that has no cycles. In StarT-ng, we have chosen to implement this within the ACD hardware, though it functions independently of all other functionalities of the ACD. 4.6 Difficulty in Sharing Cachable Data between dPs and sP The snoopy bus protocol should allow cache-to-cache transfer of Modified data to a processor performing a Read without requiring a write-back to main memory. Without this capability, the sP cannot share cacheable, writable data with the dPs. The sP cannot read data that is Modified in a dP because it will block until the dP is able to push the Modified data out with a write-back. Although this write-back is to local memory, it may be waiting behind write-backs that are going to global memory. Hence, the write-back may be blocked because all the write-back capture buffer resources are in use, and can only be freed by the sP. In the static dependence graph, sharing 14 of writable, cacheable data between the sP and the dPs introduces an arc from the sP to the write-back capture buffers resource, resulting in a cycle. Thus, data sharing between the sP and dP must be through uncached memory. It is important to note that since operating systems data-structures on SMPs are usually cacheable and writable by any processor, this prohibition on sharing cached, writable memory between the dPs and the sP means that the sP cannot use the same operating system services used by dPs. For this reason, the sP will in most likelihood have to run firmware code without virtual memory, although it is conceivable that the sP could run a simple operating system of its own that does not share any resource with the main operating system. 4.7 Complications due to Non-FIFO Network A non-FIFO network introduces considerable complexity since messages can get arbitrarily out of order. Although the protocol can use acknowledgements to serialize its actions and hence avoid the effects of a non-FIFO network, such a solution is unattractive because of the additional overhead and latency. Better solutions use more states in the protocol to handle possible out-of-order situations, sending extra messages only when necessary. Keeping track of outstanding operations, which is required in previous examples, is also necessary here. Let us consider the scenario below and also shown in Figure 9 which is related to the example in Section 4.2. 1.PA sends a Read x to site B, home of x. 2.B sends back x 3.B writes x sends A Invalidation 4.Invalidation arrives before x!!! 17 5.x arrives If the arrivals of Data x and Invalidate x at A are reversed, it is not correct to process them in their arrival order, since the home processor might then believe that site A does not have a copy of the cache-line. Simply having the invalidation wait while there is an outstanding operation will potentially deadlock since it is possible that the Invalidate should come before the Data (Example in Section 4.2) and the Data won't come until the Invalidate completes. If there is an outstanding operation to the same cache-line on site A, an invalidate can wait at site A. The home site B, upon receiving a Read x request from a site to which it has already sent an invalidation must reply with a Retry Read x. If this retry gets back to site A and finds a pending invalidation, the invalidation can then be allowed to complete without being put on the bus. If a cache-line returns to find a waiting invalidation, the cache-line can be returned to satisfy the outstanding request, then the invalidation can be put on the bus. Proving this solution works is involved because other messages can overtake each other. For example, the Retry Read x can overtake the Invalidate x. Another document[18] describes cache coherence protocols we intend to implement on StarT-ng and proofs of their logical correctness. 5 Conclusion The design of a CCDSM system using highly integrated commercial microprocessor presents many challenges. Because the snoopy bus protocol of the processor forms the substrate on which CCDSM protocols are implemented, the exact semantics and detailed implementation of the snoopy bus can have profound influence on the design. We have shown in this paper that The lesson that we have learned from designing CCDSM on StarT-ng is that it is possible to design This paper explores the challenges facing a designer of a CCDSM system from commercial SMP's which uses a processor from each SMP as the sP and a minimal piece of custom hardware. Some of the problems we faced were particular to our system, while others were general and applied to all CCDSM systems. Our particular problems can be attributed to the processors used, the snoopy bus protocol used, and the network used. Many of the problems are due to oversights in the processor/bus design and could be solved by modifying the processor or switching processors. Luckily, there are software solutions to all the problems we discovered. Other problems, notable the buffering problem, require hardware to be designed in a particular way regardless of what the software does. The performance of coherent shared memory on this machine, like all CCDSM machines, depends mostly on the global memory miss ratios. Though our global cache misse penalties are a factor of three away from more aggressive CCDSM machines, cache misses generally account for a very small fraction of the entire instruction stream, giving us competitive performance for many programs. Considering the cost of implementation and the experimental nature of coherent caches on our machine, our expected performance is surprisingly good. In the following subsection, we summarize the problems and solutions presented in this paper. Snoopy Bus Protocols:In order to facilitate CCDSM implementions, snoopy bus protocols should have the following properties. - They should not block invalidations because of outstanding operations. - They should support cache-to-cache transfer of Shared as well as Modified data. Cache- to-cache transfer of Shared can be avoided if some form of L3 cache is provided. De- pending on the implementations, either can perform better than the other. 18 - They should support Shared Modified state, which allows cache-to-cache transfers for Reads of modified cache-lines without write-backs to memory. non-FIFO networks:If the network is non-FIFO, the CCDSM protocol should track outstanding operations and implement a protocol which handles the non-FIFO network correctly and efficiently. Buffering and Blocking hardware:Buffering and blocking hardware should be divided in a manner that avoids deadlocks. Using dependency graphs, it is easy to see where deadlocks may occur. Then, either by splitting up groups of buffers or by preallocating buffers when required, deadlocks can be avoided.