6.375 Complex Digital Systems

The most difficult part of any digital system design is having a complete and accurate specification of the intended functionality. We are providing two standard projects (an SMIPS microprocessor and a non-blocking memory system) with associated testbenches that you can choose for your class project. For these two standard projects, there are a number of directions you can choose to take. Information on non-standard projects is listed at the bottom of this page.

SMIPS Microprocessor

The SMIPS microprocessor is based on the infrastructure you used in the labs. We will compare your different designs for performance, power, and area. Here are some example ideas you might consider for your course project:

Superscalar Processor
You can try to implement a 2-4 way issue superscalar processor. An in-order superscalar with branch prediction should be possible within the time constraints of the course. A version of Tomasulo's algorithm for out-of-order execution with register renaming through reservation stations is another possible option. Schemes for unified physical register files and separated instruction window and reorder buffer are probably too complex for the time available. See 6.823 lecture notes for more information.
Superpipelined Processor
This project would focus on producing the highest performance SMIPS processor through aggressive pipelining. Extreme pipelining will introduce lots of pipeline hazards, and you will want to investigate microarchitectural approaches to reduce the impact of these pipeline hazards including some form of branch prediction, and possibly some form of load/store queue.
SMIPS DSP Extensions
Use the SMIPS coprocessor interface to add a DSP accelerator to a basic SMIPS processor. You will need to extend the SMIPS ISA and write appropriate test/benchmark codes. Compare the area, power, and delay with respect to the baseline SMIPS processor.
Decoupled Access-Execute Processor
Decoupled architectures provide many of the same advantages as register renamed out-of-order machines but with a much simpler microarchitecture. There is some resurgence of interest in these simpler high-performance architectures given the power problems of more complex machines. You can try and build an SMIPS two-issue access-execute decoupled architecture similar to the Astronautics SZ-1. J. Smith, "Dynamic Instruction Scheduling and the Astronautics ZS-1", IEEE Computer, Jul, 1989, 22(7), pp 21-35. To find other citations on access-execute architectures see: M. Sung, R. Krashinsky, and K. Asanovic, "Multithreading Decoupled Architectures for Complexity-Effective General Purpose Computing", Workshop on Memory Access Decoupled Architectures, Sep 2001.
Runahead Processor
A relatively new simple architecture to hide memory latency is the runahead processor. When a processor hits a cache miss, rather than stalling, the processor keeps running ahead speculatively using a duplicate set of registers to try to prefetch subsequent cache misses early. Once the initial cache miss returns, the processor swaps back to the architectural registers and keeps running. J. Dundas and T. Mudge, "Improving data cache performance by pre-executing instructions under a cache miss", Int'l Conf. on Supercomputing, Jul 1997, pp 68-75.
Multithreaded SMIPS
Implement an SMIPS processor that interleaves the execution of multiple threads in hardware. You can experiment with cores supporting 2-8 threads, and can try to implement either fine-grain multithreading where threads switch every cycle, or coarse-grain multithreading where threads only switch on a cache miss.
Significance Compression
Some authors have proposed saving power by only activating the bit positions within a datapath that are actually required to compute a result. For example, when you add +1 to a register, in most cases, only a few bits change. The following paper describes this idea and suggests a few alternative implementations. R. Canal, A. Gonzalez, and J. Smith, "Very low power pipelines using significance compression", Int'l Symp. on Microarchitecture, Dec 2000, pp181-190.
Operand Network for Multicore SMIPS
Implement a multicore SMIPS processor with two to four cores. Consider using incoherent local memories and focus instead on tightly integrating core-to-core communication to enable very low-latency operand passing. Experiment with area, power, and delay tradeoffs in the operand network routers. See M. Taylor, W. Lee, S. Amarasinghe, and A. Agarwal, "Scalar Operand Networks", IEEE Trans. on Parallel and Distributed Systems, Feb 2005. and M. Taylor, et. al, " Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams", Int'l Symp. on Computer Architecture, Jun 2004.
Heads-and-Tails Variable-Length Instruction Encoding
The heads-and-tails format is a way to pack variable-length instructions in memory such that it is still easy to fetch and decode instructions in a pipelined or parallel fashion. Note, this project will require a different binary format than the regular SMIPS. H. Pan and K. Asanovic, "Heads and Tails: A Variable-Length Instruction Format Supporting Parallel Fetch and Decode", Int'l Conf. on Compilers, Architecture, and Synthesis for Embedded Systems, Nov 2001.

Non-blocking Memory System

The second standard project is a non-blocking memory system. The test rig will supply a stream of memory requests each tagged with an identifier, and the memory system can return these requests out-of-order (this is also the interface the project SMIPS uses to talk to memory). Your memory system will include some form of cache and will drive a DRAM main memory.

NUCA L2 Cache
Non-Uniform Cache Architectures (NUCA) seek to exploit the difference between best-case and worst-case access latencies in large, heavily banked L2 caches. By embedding a network in the L2 cache, NUCA caches can return data much faster if it is located in a bank close to the processor. Various schemes can be used to exploit this intra-cache locality. See C. Kim, D. Burger, and S. Keckler, "An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches", Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, Oct 2002.
Memory System for Multicore SMIPS
Implement a multicore SMIPS processor with two to four cores. Your multicore processor might include a shared L2 with coherent L1 caches. Alternatively, you might experiment with a tiled CMP approach and investigate different on-chip network implementations. For one approach to flexible shared L2 caches in multicore processors see M. Zhang, and K. Asanovic, "Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled CMPs", Int'l Symp. on Computer Architecture, Jun 2005.
Prefetching
You can try implementing a hardware prefetcher to bring values into cache before the processor requests them. Stream buffers are one technique which predicts the stride of regular accesses. N. Jouppi, "Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers", Int'l Symp. on Computer Architecture, Jun 1990, pp364-373 and S. Palacharla, R. Kessler, "Evaluating Stream Buffers as a Secondary Cache Replacement", Int'l Symp. on Computer Architecture, Apr 1994, pp24-33.
Memory Access Scheduler
Your project will implement a memory access scheduler to reorder memory requests to improve DRAM performance by performing all outstanding accesses to each DRAM row before moving to the next row. S. Rixner, W. Dally, U. Kapasi, P. Mattson, J. Owens, "Memory access scheduling", Int'l Symp.on Computer Architecture, Jun 2000, pp128-138
Compressed Memory Systems
This project would implement a compressed memory system, where cache lines are uncompressed when loaded into cache, and compressed again when evicted to main memory. R. Tremaine, T. Smith, M. Wazlowski, D. Har, K. Mak, S. Arramreddy, "Pinnacle: IBM MXT in a Memory Controller Chip", IEEE Micro, Mar/Apr 2001 (Vol. 21, No. 2).

Non-Standard Projects

Students who would like to work on a non-standard project must develop their own specification and testbench. Students working on non-standard projects should schedule time to meet with the instructors several days before the preliminary proposal is due. During this meeting, the group should be prepared to discuss the algorithm or specification for the non-standard project, as well as the existing reference code which will be used to build an appropriate test rig. The instructors will review the non-standard project and determine if it is appropriate for the course. Students should not submit non-standard project proposals unless they have met with the instructors and received approval.

6.375 Final Project Ideas

Spring 2006

SMIPS Microprocessor

Non-blocking Memory System

Non-Standard Projects