CS152 Computer Architecture and Engineering

Lab #3: Pipelining Your Processor

John Lazzaro

Fall 2006

In Lab 3, your group will build a pipelined processor, like the one described in Chapter 6 of COD. The processor will run in simulation (ModelSim) and on real hardware (Xilinx). Your processor will be able to update its memory over the network (using the Trivial File Transfer Protocol, or TFTP), so that you will no longer need to resynthesize your CPU in order to execute a new program.

Lab 3 has several "checkoffs" and a final deadline:

  1. By Thursday 9/28 by 11:59 PM, your group will submit a preliminary design document (Problem 0c) to cs152-st aff@cory. Team Evaluations (Problem 0b) are also due by 11:59 PM on Thursday 9/28 to the same address. Each team member emails a separate confidential team evaluation report. Be careful to send team evaluations to the staff list, NOT the class list!
  2. On Friday 9/29 in lab section, your TA will review your design document with you, and suggest changes. A final version of the document, incorporating these changes, is due via email to cs152-st aff@cory by Wednesday 10/4 by 11:59pm (due to Mid-term I being Tuesday night). However, we recommend you complete and submit the design document well before the deadline.
  3. On Friday 10/6 in lab section, you will demonstrate the processor running on the Calinx board. During the demo, the TAs will provide you with test code whose instruction sequences do not have data hazards (thus, you will not need to implement forwarding muxes) and assume that load instructions require one load delay slot for correct operation (thus, you will not need to implement stalling). These programs will test basic pipelining operation, but will not test the "hard" pipelining features: how to stall and how to forward. Apart from these exceptions, the semantics of all of the instructions in the table shown in Problem 1a must be supported for this checkoff. Referring to Figure 6.36 of COD/3e, the "hazard detection unit" and "forwarding unit" need not be implemented for this checkoff.
  4. On Friday 10/13 in lab section, you will demonstrate the processor running on the Calinx board. Unlike the 10/6 checkoff, the demo requires that your processor correctly handles all hazards (see Problem 3a), and expects your processor to implement load instructions that do not require a load-delay slot (see Problem 3b). During this demo, the TA will provide you with secret test code. If you are able to pass these tests on your first try, you will receive bonus points. You can also receive bonus points if you fix your processor to pass the tests within your section time. If your processor is not fixed by the end of section, your TA will provide source for the secret test code, for use in your weekend debugging sessions.
  5. On Monday 10/16 at 11:59 PM the lab (including the lab report) is due.

On checkoff days, be prepared to demo by the time your section begins! The sooner you begin, the more time you will have to recover from problems. Also, be sure to follow the specifications in this report (and in the MIPS ISA manual in the Resources section). A working processor that doesn't correctly implement all of the instructions will fail the checkoff.

Lab Report Submission Policies: To submit your lab report, run m:\bin\submit-fall2006.exe, or at command prompt type "submit-fall2006.exe" then follow the instructions. The required format for lab reports is shown on the resources page, as is the required format for your design notebook.

Lab 3 Document History:

  1. 8/29 Lab 3 posted on website.

Problem 0: Pre-Flight

Before your group begins the design, you will perform several preparatory tasks.

Problem 0a: Preliminary Design Document

Your group will prepare a preliminary design document. The design document will be 2-4 pages in length, and will contain:

  1. The identity of the spokesperson for the lab, and a roster of group members. The responsibility of the spokesperson is to communicate questions to the TA and reply to questions from the TA. Choose a different spokesperson than the one you had for Lab 2.
  2. A short description of the structure of the design. The description will be accompanied by preliminary high-level schematics of your datapath, and a preliminary discussion of the controller.
  3. A description of the unit test benches and multi-unit test benches you intent to create for your processor, and a description of machine language programs you intent to write to use in complete processor testing. Also, a test plan, using the epoch charting method shown in the 9/7 lecture, that shows when you plan to run each type of test.
  4. A tentative division of labor, showing the tasks each group member intends to do.
  5. The "paranoia" section: discuss potential areas of difficulty in the lab. An early guess of critical timing paths for the design should be a part of this section.

See the start of this document for the deadlines associated with the preliminary design document (preliminary submission and TA review). See Problem 0c for a description of the final design document.

Problem 0b: Team Evaluation for Lab 2

Each team member will submit an evaluation of how the team performed during the lab (see top of the lab for the due date). Each team member sends in a separate evaluation email. Write your evaluations in private (we are looking for your independent opinion) and send them directly to the staff mailing list (not the class list!).

The staff uses the evaluations to help us help groups make it through "team dynamics" problems -- it clues us in on why the group is not working well together. Evaluations are also used in grading, as we describe later in this section.

Begin your evaluation email by describing the work you did for the lab. Also list the work you originally committed to do in your group's design document (we realize that early plans may change, and you may have ended up doing different work).

Next, evaluate the performance of the other members of your group (do not evaluate yourself). Express your evaluation as a percentage between 0% and 100%. A score of 100% indicates that the person met your expectations for a good group member. Note the phrasing: the numerical scores are not meant for differentiating between "good" and "great" by giving out 90% to the "good" member and the 100% to the "great" member. The "good" and the "great" group member should both get 100%.

A score of less than 100% indicates that you feel that the group member did not fully live up to the "social contract" that comes along with enrolling in a team project course in EECS at Cal: to show up to scheduled meetings, to participate fully and act professionally in the meetings, to commit to doing a fair share of the work and to come through on that commitment, and to be participating as much as possible in the final part of the project when the pieces are put together and the final bugs are found.

If a member hasn't met your standard of being a "good group member", the score you give should reflect what percent of the way the person got to meeting your expectations. For example, someone who missed most meetings and acted badly when he or she showed up, who committed to designing several modules but never did (without telling anyone until it was too late), and who was missing in action on the final days of the project, deserves a 0%. Whereas, a group member who handed in their modules in working order and on time, but felt like it wasn't his or her job to help put the project together down the stretch, deserves a score somewhere in the middle -- he or she was responsible in some ways but irresponsible in others.

In addition to a numerical score, describe each group member's performance in a short summary. For scores < 100%, the text should justify why the person didn't live up to your expectations. For 100% scores, the text should describe the contributions of the group member to the success of the project -- this text serves as a way to bring to our attention group members who went "beyond the call of duty".

You will be evaluating your teammates in this way after each lab. At the start of each lab, everyone starts with a "clean slate" -- your evaluation for Lab 3 should only reflect how a team member performed during Lab 3, and should not reflect Lab 2 issues.

The peer evaluations are used in grading. The evaluations from all members of a team are evaluated for consistency, and cross-checked against staff observations of the group in action in the lab. If we believe the peer evaluations are consistent and reflect reality, they form the basis of about 20% of the final 152 grade of a student. Thus, you should be careful to be sure that your judgements about your team members are fair and based in fact.

Reviews are confidential, in that we do not tell group member A that "group member B gave you 30% and said this about you". However, if the situation warrants, we will feed back anonymized data back to group member A (example: your average score was 20%, and the text comments cited problems X, Y, Z with your contributions to the group). Without this feedback, a group member might not realize how much he or she has let down the team -- and thus won't make the effort to do better, which does no one in the group any good.

Below, we show an example peer evaluation. Pat is on a 4-person team, and sent in this evaluation:

In the design document, I contracted to do the regfile, and to do half of the controller logic. I completed both on time, but unfortunately, there was a bug in the regfile that we didn't find until we put the processor together (Sue actually found it). I was present for most of the final debugging sessions, and helped with the writeup.

Name Evaluation Justification
Bill  30% Bill did design the modules he was assigned to do, I'll give him that much. But he was missing in action during the 20 hours we spent in the lab doing integration and bug-fixes. This may be related to his behavior during our first meeting. Bill started yelling and screaming after we decided not to use the gated-clock reset scheme he proposed for the processor -- and then he stomped out of the room and slammed the door.
100% 5 hours into our final debugging section, Sue spotted the bug that kept our processor from working -- a bug that was my mistake. She saved our group. She also designed her own bug-free modules on-time, and set up our CVS system.
Joe 100% A solid contribution from Joe -- he did everything I would expect from a good team member. He did many of the testing and integration tasks that Bill was assigned to do but never did.

Problem 0c: Design Notebook

As part of this lab, your group will keep an on-line notebook. See the lecture notes for the Teamwork lecture (and also the Lab 2 writeup) for detailed information about the notebook.

Problem 1: Pipelining Your Design

Problem 1a: Pipelining.

Implement the following five-stage pipeline for your processor:

IF Instruction Fetch:    Access the instruction memory for the instruction to be executed.

ID Instruction Decode:    Decode the instruction and read the operands from the register file. For branch instructions, calculate the branch target instruction address and compare the registers.

EX Execute:    Bypass operands from other pipeline stages. One of the following activities can occur:

MEM Data Memory Access:    Access the data memory for load or store instructions.

WB Write Back:    Write result to register file.

Add the appropriate pipeline registers to your single cycle design.

For this design, you must use new RAM files for your instruction and data memories.  Note that Verilog files that we refer to in this lab are all in the m:\lab3\ directory. Start by reading the Readme in m:\lab3\Lab3Help. The new RAMs (called sdatamem.v and sinstrmem.v) ) are fully synchronous. This means that you must set up the address and any data to be written before the edge of the clock.  Both reads and writes are synchronous in this way.  Keep this in mind when you work on your pipeline.  One way to view the result, is that some of the registers that you see in the pipeline diagrams that we show you (for instance, the PC or the S and D registers) are partially duplicated in the RAM block.  This means, for instance, that you still need to keep a separate PC register, but that you also need to pipe the value of the address before the PC register to the actual RAM block; on a clock edge, the new address will be clocked into both the PC register and the internal address registers of the RAM.

A common error students make when using the synchronous RAMs is that they design a 5-stage pipeline that has 6 stages (oops). Examine your design carefully to make sure you have not made this mistake.

Implement the following instructions in your processor :

Type  Instructions
arithmetic  addu, subu, addiu
logical  and, andi, or, ori, xor, xori, lui
shift  sll, sra, srl 
compare slt, slti, sltu, sltui
control  beq, bne, bgez, bltz, j, jr, jal 
data transfer lw, sw
Other: break

Note that unlike commercial implementations, your processor does not implement exception handling. So, if an instruction other than the ones listed above appears in the instruction stream, what your processor does is undefined by this spec (a practical option is to treat undefined instructions as no-ops).

Implement your data path in Verilog. Do not write your datapath as one giant Verilog module. Instead, first implement simpler modules as building block components (register, comparator, etc), and assemble these components to form your data path in a clean way. Use enough intermediate levels of modules to model the structure of the datapath, but don't use too many intermediate levels.

All clock inputs should be triggered on the rising clock edge.

All control instructions should have a single delay slot (i.e. the following instruction is always executed after the control instruction).

Make sure to verify the bitfields of instructions such as bgez in the MIPS Instruction Set Reference. Note that the "rt" field is actually used to distinguish bgez and bltz.

Note that your processor should not implement all ALU instructions in the MIPS instruction set -- only the ALU instructions listed in the table above. When pipelining your processor, be sure to think through the sign-extension logic -- it is common for groups to introduce bugs into this logic when pipelining their processors.

As in Lab 2, there should an arithmetic/logical multibit (31) shifter external to the ALU. Instructions such as SLT should be handled outside the ALU as well.  SLT should subtract the two operands, and then use the ALU status flags (Zero, Neg, Ovf) to compute the output, and then put the correct value back in the destination register.

The break instruction is special.  See the MIPS Instruction Set Reference for its bitfield. Although this is normally an exception-causing instruction, you should treat it more like a halt instruction.  After being decoded, the break instruction should freeze the pipeline from advancing further.  This means that the PC will not advance further, the break instruction will stay in the decode stage, and later instructions will drain from the pipeline as they complete.  The proper terminology for this is that the break instruction will "stall" in the decode stage.  Assume that there will be a single input signal called "release" that comes from outside.  When it is high, you should release a blocked break instruction exactly once (you need to build a small circuit that generates a single, one-clock-cycle pulse when release is high, then ignores release until it goes low again -- however, be sure the circuit also works correctly with SINGLE_CLOCK, described in Problem 1d).  When we map our pipeline to the board (Problem 2), the break instruction will stop the pipeline and potentially display its code on the LEDs.  Further, we will have the option to "unfreeze" the pipeline with a debounced switch.

Your processor must produce an 8-bit STAT output signal: if the processor is not stopped, STAT=0.  Otherwise, the high bit (bit 7) of STAT = 1 and the low 7-bits = low 7 bits of break code (which is in bits 6-25 of the break instruction).  Whenever this signal changes, make sure that you have a monitor output that prints "STATUS Changed: 0xvalue" on the console.

Problem 1b: Memory-Mapped Input/Output

Any processor is useless without I/O, and your processor is no exception. Therefore you must build a memory-mapped I/O module.  All writes to addresses 0x80000000-0xFFFFFFFC will be considered writes to I/O space.  All reads and writes to I/O space should not be mapped to your data memory.  Instead these operations should be handled by your memory I/O module.

The specifications of the I/O module are as follows.  It should have a 32-bit address, 32-bit data input, and a 32-bit data output for the processor, just like memory.  It should also have 2 I/O buses: a 32-bit; input data bus and a 32-bit output data bus and a 1-bit output selector.  Other control signals are probably necessary as well.  Internally, this I/O module should have two 32-bit I/O registers.

Behavior is as follows: Reads and writes to and from 0xFFFFFFF0 go to one 32-bit register (call it DP0).  Reads and writes to and from 0xFFFFFFF4 go to the other (call it DP1).  Reads from 0xFFFFFFF8 will come from the input I/O bus. Writes to 0xFFFFFFF8 are ignored. Reads and writes to and from addresses 0x80000000-0xFFFFFFEC and 0xFFFFFFFC are ignored.  The output I/O bus will be DP0 if the output selector is 0 and DP1 if the selector is 1.

The input I/O bus will be connected to the dipswitches on the board.  The output I/O bus will be connected to the hexadecimal LEDs on the board.

0x80000000-0xFFFFFFEC Reserved for future use. Reserved for future use.
0xFFFFFFF8 Input switches Nothing
0xFFFFFFFC Reserved for future use. Reserved for future use.

Note that you can read/write I/O space with normal loads and stores with negative offsets:

        lw $1, -8($0)    ; Read input => $1
      sw $7, -16($0)    ; Write $7 to DP0
      sw $8, -12($0)    ; Write $8 to DP1

This works because offsets are sign-extended.  Thus, for instance, -12($0) means address 0xFFFFFFF4.

Finally, within this module, add non-synthesized code (see Synplify manual) that outputs a message to the console whenever a change is written (i.e. something like: "I/O Write to DP0: 0x44455523"). This message should also be written to a file called "iooutput.trace".  Further, whenever the module inputs a value, arrange to have the value to come as the next value from an input file called "ioinput.trace".

Problem 1c: Update the Monitor Module

Make sure to update your disassembly monitoring module from Lab 2:

To do the latter, don't output instructions until they have reached the memory stage (since you won't be able to print out load instructions until the memory stage where you finally know the value of the destination register).  In order to do this, introduce a number of signal arrays in your monitor which hold on to values until they are needed.  For instance, to hold onto the instruction itself, you would have a series of statements like:

        EXCinstruction = instruction;
        MEMinstruction = instruction;

In this way, you have the value of the instruction word when it is at the end of the memory stage.  This is like a mini pipeline.  Values of input registers wouldn't have to be kept as long, etc.  Think through this very carefully...

Problem 1d: Top-level Module Integration: Chip Mapping

As with Lab 2, you will have a top-level schematic module that ties everything together.  Now, however, you will have several I/O pins left over: a clock net, 1 reset signal, 1 release signal (for break instructions), 1 output select signal (from the I/O module), 1 8-bit output (from the break logic), 1 32-bit output (I/O) and 32-bit input (I/O).

Use the FPGA_TOP2.v module in m:\lab3 as the top-level integration for your design. Note that FPGA_TOP2.v is a testbench created by the TAs for the TFTP module; you will need to replace some of their code with your processor code before pushing to board.

You may want to briefly read the description of the Calinx boards (see the Resources page) to see what the pins of FPGA_TOP2.v mean. You will be modifying the Verilog for this top-level module to integrate all the pieces you need.

You should assume that the following is true:

There are 2 sets of 4 pushbuttons. We will only use group 1 (although you are free to use the others if you wish). 

Since these are buttons, they will be naturally bouncy, so you should include the debouncer module from m:\lab3, just as you did in Lab 2.

The break instruction outputs the 8-bit STAT signal.  This should be mapped to the 8 individual LEDs (on the side of the board).

The output from your memory-mapped I/O module should go to the HEX leds.  To drive these properly, you need to the bin2Hex or the ledtool modules.

The input of the I/O module should come from the first group of 8-bit dip switches (switch9).  Assume that the value on these switches goes to the lowest 8-bits of the input bus and that the top 24-bits are set to zero.  If you like (possibly a good idea), you can consider switching in the current PC or the instruction being executed to the HEX leds when a switch is set.  Consider the second switch of the second set of 8 dipswitches as controlling this.  You can use other switches to indicate what is going to be displayed other than the normal I/O.

Finally, the CLOCK net for your pipeline should be connected either to the clk from the DLL off of the XILINX board or to your debounced SINGLE_CLOCK signal.  Let the first switch of the second set of 8 dipswitches (switch10) be the choice (call this signal "CLK_SOURCE"):

    processor_clock = CLK_SOURCE ? LAB_CLK: SINGLE_CLOCK;

Please read the documentation on DLLs, available in the M:\lab3\DLL Examples directory for more information on using DLLs. FPGA_TOP2.v defaults to using a 9 Mhz clock. This may be too fast.

Problem 1e: Test Plan

Write test benches to test your processor. These test benches include include unit tests (for the components in your processor), multi-unit tests (for the datapath), and a machine language program suite for complete-processor testing.

For complete-processor testing, the programs you write should be similar to the "Broken SPIM" programs you wrote in Lab 1. A subset of these programs should be written so that they work on a processor that does not handle hazards correctly (we add logic for hazards in Problem 4 below). Thus, these programs should not use values too soon after they are generated.

In addition to processor testing, a few test benches should also test the top-level module, to debug the interface between the processor and the Calinx board. These benches will simplify debugging the first time you "go to board" with your pipeline processor.

Build these test-bench modules around the FPGA top-level module; the test bench should provide a clock (as in Lab 2), print output to the console when I/O changes, and perhaps "push" the buttons for testing.

Note that debouncing of the switches is tricky when interacting with the clock.  Think carefully before trying to test the single-stepping clock feature.

Also think carefully about the I/O features.  How will you test these in ModelSim?   For demonstrating your pipeline in simulation, create a test module that recognizes when break has been asserted, waits 10 cycles, then asserts the release line -- printing something to the console in the process.  This will let you to run programs on ModelSim that utilize the I/O features to output results.

Problem 2: Map to Calinx

Map your processor design onto the Calinx board. Your will do this (at least) twice: once for the simple Xilinx processor checkoff, and once for the complete Xilinx processor checkoff. Some of the information in this problem only applies to the complete checkoff.

Make sure to read the information in Lab3Help about changes to get versions of the RAMs that map to Xilinx, and on how to use the TFTP interface. You should be using the same design flow as you did in Lab 2.

Note that you should be able to put the processor in single-step mode (first dipswitch of second set put to zero).  Then you should be able to use push-button #4 as a single-stepping clock.  You should also be able to spread break instructions in your code and debug code this way. 

You should be able to use a loop at the very end of your execution with a combination of break instructions and writes to the I/O (address to 0xFFFFFFF0, data to 0xFFFFFFF4) to dump the contents of your memory to the hexadecimal display when you are done.  Make sure that this works! For the first Xilinx checkoff, you will need to write a loop that does not require the unimplemented parts of the controller to be present (forwarding and hazard detection).

Make sure that the RESET line causes important processor state to be reset!  Remember that "initial" blocks in Verilog will be ignored by the synthesizer.  Many bugs can be introduced when registers contain random initial state!  One obvious thing that must be reset is the PC.  Are there other things?

Don't try to debug everything at once.  Start with extremely simple examples.  Possibly divert the hexadecimal display to show PC information during debugging (feel free to divert other things as well).  For instance, what about simple program with break as the first instruction and a bunch of nops.  Can you get that to work?  What about simple I/O examples?  Once you make sure that your simplest tests work, you can move on to more complicated tests.

Here are common problems that groups often miss:

Please include information in your writeup about the total number of FPGA slices used for your design and the fraction of the Xilinx part that has been used for your design.  This information should be available in the log files post place-and-route.

Problem 3: Hazards

Problem 3a: Handling Hazards

Here is a list of the hazards you must handle:

    Write or modify programs to test all the different hazard cases.   Remember that hazards do not necessarily occur between two adjacent instructions.   They can happen between two instructions that are separated by another instruction (or two?).   Consider the following lines of code:
ADD    $1, $2, $3
SLL    $5, $6
SUB    $6, $1, $7
The ADD and SUB instructions have a data hazard, yet there is an SLL between them.  Be sure to check these kinds of cases.

Problem 3b: Pipeline Interlocks/Interlocking Loads

Now that you have basic hazards dealt with,  you should figure out how to handle pipeline stalls.  Your current processor deals with the load delay slot in the same way as the original version of MIPS: If the compiler generates a code sequence in which a value is loaded from memory and used by the next instruction, the following instruction gets the wrong value.  Of course, the instruction specification explicitly disallowed such code sequences; if no other options were available, the compiler would have to introduce a noop in the load delay slot to avoid getting the wrong answer.

As your final exercise, introduce a pipeline stall so that a value can be used by the compiler in the very next cycle after it is loaded from memory. This feature was added to later versions of the MIPS instructions set.  To be clear, we want the following code sequence to do the "obvious" thing, i.e. the add should use the value loaded from memory:

    LW     $1, 4($2)
    ADD    $2, $1, $3

Make sure to rerun your Modelsim tests to verify the processor still works correctly. Your test bench suite should include tests that try several different distances between loads and their following values.  Hint: the mechanism for this single-cycle stall is very similar to what you need for the break instruction...

Problem 4: Pipelining Gain

Calculate the cycle time for your pipelined processor. To do this, you will need to understand the Xilinx timing analysis tools. In your writeup, discuss what your critical path is, and what steps you could do to reduce it.  If we just took our Lab 2 single-cycle processor, and added pipeline registers at key points, we would expect the cycle time to be the inverse of the delay through the longest block (ALU? Next PC? Memory?). Is this the performance that you were able to achieve? Why or why not?

Final Step: Lab Report

Turn in a copy of your Verilog code (including test benches), schematics, testing suites, updated versions of your design document module and processor specifications, and your on-line logs. Explain any changes that were needed in the specifications you presented in your 10/4 design document.

Also turn in simulation logs that show correct operation of the processor. These logs should show the operations that were performed, and then the contents of memory with the correct values in it. Also turn in logs from your test benches.

As part of your writeup, do a port-mortem for your test plan. Show bug curves, and give examples of the type of bugs you found early on because of your test plan (as well as "escaped" bugs you found later than you would have hoped).

How much time did your team spend on this assignment?