The most difficult part of any digital system design is having a complete and accurate specification of the intended functionality. We are providing two standard projects (an SMIPS microprocessor and a non-blocking memory system) with associated testbenches that you can choose for your class project. For these two standard projects, there are a number of directions you can choose to take.
The SMIPS microprocessor is based on the infrastructure you used for labs 1 and 2. We will compare your different designs for performance, power, and area. Here are some example ideas you might consider for your course project:
You can try to implement a 2-4 way issue superscalar processor. An in-order superscalar with branch prediction should be possible within the time constraints of the course. A version of Tomasulo's algorithm for out-of-order execution with register renaming through reservation stations is another possible option. Schemes for unified physical register files and separated instruction window and reorder buffer are probably too complex for the time available. See 6.823 lecture notes for more information.
This project would focus on producing the highest performance SMIPS processor through aggressive pipelining. Extreme pipelining will introduce lots of pipeline hazards, and you will want to investigate microarchitectural approaches to reduce the impact of these pipeline hazards including some form of branch prediction, and possibly some form of load/store queue.
Decoupled Access-Execute Processor
Decoupled architectures provide many of the same advantages as register renamed out-of-order machines but with a much simpler microarchitecture. There is some resurgence of interest in these simpler high-performance architectures given the power problems of more complex machines. You can try and build an SMIPS two-issue access-execute decoupled architecture similar to the Astronautics SZ-1. J. Smith, "Dynamic Instruction Scheduling and the Astronautics ZS-1", IEEE Computer, Jul, 1989, 22(7), pp 21-35. To find other citations on access-execute architectures see: M. Sung, R. Krashinsky, and K. Asanovic, "Multithreading Decoupled Architectures for Complexity-Effective General Purpose Computing", Workshop on Memory Access Decoupled Architectures, Sep 2001.
A relatively new simple architecture to hide memory latency is the runahead processor. When a processor hits a cache miss, rather than stalling, the processor keeps running ahead speculatively using a duplicate set of registers to try to prefetch subsequent cache misses early. Once the initial cache miss returns, the processor swaps back to the architectural registers and keeps running. J. Dundas and T. Mudge, "Improving data cache performance by pre-executing instructions under a cache miss", Int'l Conf. on Supercomputing, Jul 1997, pp 68-75.
Implement an SMIPS processor that interleaves the execution of multiple threads in hardware. You can experiment with cores supporting 2-8 threads, and can try to implement either fine-grain multithreading where threads switch every cycle, or coarse-grain multithreading where threads only switch on a cache miss.
Some authors have proposed saving power by only activating the bit positions within a datapath that are actually required to compute a result. For example, when you add +1 to a register, in most cases, only a few bits change. The following paper describes this idea and suggests a few alternative implementations. R. Canal, A. Gonzalez, and J. Smith, "Very low power pipelines using significance compression", Int'l Symp. on Microarchitecture, Dec 2000, pp181-190.
Heads-and-Tails Variable-Length Instruction Encoding
The heads-and-tails format is a way to pack variable-length instructions in memory such that it is still easy to fetch and decode instructions in a pipelined or parallel fashion. Note, this project will require a different binary format than the regular SMIPS. H. Pan and K. Asanovic, "Heads and Tails: A Variable-Length Instruction Format Supporting Parallel Fetch and Decode", Int'l Conf. on Compilers, Architecture, and Synthesis for Embedded Systems, Nov 2001.
Non-blocking Memory System
The second standard project is a non-blocking memory system. The test rig will supply a stream of memory requests each tagged with an identifier, and the memory system can return these requests out-of-order (this is also the interface the project SMIPS uses to talk to memory). Your memory system will include some form of cache and will drive a DRAM main memory.
You can try implementing a hardware prefetcher to bring values into cache before the processor requests them. Stream buffers are one technique which predicts the stride of regular accesses. N. Jouppi, "Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers", Int'l Symp. on Computer Architecture, Jun 1990, pp364-373 and S. Palacharla, R. Kessler, "Evaluating Stream Buffers as a Secondary Cache Replacement", Int'l Symp. on Computer Architecture, Apr 1994, pp24-33.
Memory Access Scheduler
Your project will implement a memory access scheduler to reorder memory requests to improve DRAM performance by performing all outstanding accesses to each DRAM row before moving to the next row. S. Rixner, W. Dally, U. Kapasi, P. Mattson, J. Owens, "Memory access scheduling", Int'l Symp.on Computer Architecture, Jun 2000, pp128-138
Compressed Memory Systems
This project would implement a compressed memory system, where cache lines are uncompressed when loaded into cache, and compressed again when evicted to main memory. R. Tremaine, T. Smith, M. Wazlowski, D. Har, K. Mak, S. Arramreddy, "Pinnacle: IBM MXT in a Memory Controller Chip", IEEE Micro, Mar/Apr 2001 (Vol. 21, No. 2).