For the 752 course project, you have two options: 1) implement at logic level some important and relatively complex subsystem of a modern processor; 2) evaluate a new idea or modest extension to prior work, or re-implement and re-evaluate a prior proposal, possibly within the context of new workloads or different simulators.
· You need to work in teams of 2-3 students. Smaller or larger teams must adjust the scope of their work to match the size of the team.
· You must submit a written project proposal (via the canvas dropbox) of up to two pages by midnight on Monday, October 30, 2017. The proposal must include the names of all team members, a summary of the proposed topic and a research plan that outlines how you will accomplish your goals. For a hardware implementation project, you must also describe your proposed testbench and validation methodology.
· Submit a 2-3 page progress report by midnight on November 22, 2017 using the . This must report your current progress on achieving the goals you set in your initial proposal, and how you will recover or reset your goals if you are behind.
· Prepare an oral presentation of approximately 25 min. (20 minutes presentation + 5 minutes for questions) for a scheduled slot during the last week of class. This should motivate the problem, describe what you did, and present some interesting results. Place your presentation slides in the canvas dropbox.
· Submit a detailed final report by midnight on December 13, 2017 using the canvas dropbox. This should detail all relevant information about your project at an appropriate level of detail. Neatness, organization, and quality of writing will all count towards your grade.
· The project report must also include a statement of work that identifies the contributions of each individual on the team. This statement of work must reflect a team consensus and must be signed by all team members. I recommend that you structure this statement as a table with a row for each project milestone, a column for each team member, and the percentage contribution of each team member to each milestone in the entries in the table.
For hardware implementation options, here are some possible suggestions. Note that you are largely on your own for tool support with hardware projects, so you should probably rely on pre-existing familiarity with simulation and synthesis tools from prior research or coursework (e.g. ECE 551):
· Design a superscalar decoder for the x86 ISA (or a reasonable subset) with variable-length ops. You can start with an existing parallel decoder for the Z80 ISA that maps instructions of 1, 2, 3, or 4 bytes to 1-12 uops.
· Instruction fetch unit, including aggressive, pipelined (or overriding) branch predictor
· Implement an advanced branch prediction unit, either neural or TAGE. Evaluate area, power, delay. Start with proposals from latest CBP5.
· Implement tightly-coupled accelerators, e.g. hash map, heap manager, string, regex (see [Gope et al., ISCA 2017])
· Memory ordering buffer or load/store queue architecture that resolves memory RAW dependences, preferably using speculation. You can start with an existing store queue implementation in Verilog, and extend it with a load queue and control logic.
· A real issue queue, with wide issue, support for realistic instruction latencies, and recovery from speculative scheduling. Here also, we can provide an existing Verilog implementation as a starting point.
· Non-blocking interface to L2, including multiple MSHRs, replacement policy, load merging, etc.
· Implement a state-of-the art prefetcher. Start with winning designs from DPC-2 championship. Evaluate area, power, delay.
Instead, you can also work on any computer architecture-related topic pertinent to the course. I prefer that you come up with your own ideas of what you are interested in. Attached are some ideas if you get stuck. Note that many of these ideas are publishable.
· Set up and characterize SPEC2017 using gem5 and simpoints. Compare/contrast to SPEC2006.
· Investigate temporal coding for interconnects (see [Mishkin, ISLPED 2017]) vs. compression of data. Compression increases entropy, which increases toggle rates, so temporal coding may have a bigger payoff.
· Investigate some approximate computing technique, including concise loads and stores.
· Investigate low-precision matrix inverse in software and/or hardware.
· Investigate IoT workloads, benchmarks.
· Investigate accelerators for NFAs (see Cache Automaton, MICRO 2017).
· Investigate critical path prediction for load-to-store sequences. Start with Fields/Bodik/Hill work from early 2000s. Consider applications in timing speculation.
· Investigate unconventional uses for architectural techniques (branch prediction for EEG – Abhishek Bhattacharjee, MICRO 2017).
· Investigate quantization and sparsity for efficient neural network inference. See (many) recent papers from MICRO, ISCA, HPCA.
· Investigate use of power-gated core resources for prefetching. See CHARSTAR ISCA-2017 as starting point.
· Explore ARM machine learning library (ML and NE10). Examine viability of using many small/simple ARM cores to execute inference workloads, compare against accelerators. What about load balancing? Tail latency issues? Modify gem5 to model simple ARM cores (Cortex-M7).
· Deprioritizing loads that are likely to reference a block that already missed (to save energy). Please see Prof. Lipasti if you are interested in this idea.