In Lab 3, your group will build a pipelined processor, like the one described in Chapter 6 of COD. You will use a combination of schematics and Verilog to represent the design. The processor will run in simulation (ModelSim) and on real hardware (Xilinx). You will create a test plan for your processor. You will use the test plan for bug diagnosis, and to verify that your processor executes a subset of the MIPS instruction set.
Lab 3 has several "checkoffs" and a final deadline:
Lab Report Submission Policies: To submit your lab report, run m:\bin\submit-fall2004.exe, or at command prompt type "submit-fall2004.exe" then follow the instructions. The required format for lab reports is shown on the resources page, as is the required format for your design notebook.
Lab 3 Document History:
Before your group begins the design, you will perform several preparatory tasks.
Your group will prepare a design document. The design document will be 2-4 pages in length, and will contain:
See the start of this document for the deadlines associated with the design document (preliminary submission, TA review, and final submission). You are encouraged to submit the preliminary document early, to speed the TA review process.
To help us understand how your team is functioning we require you to evaluate yourself and each of your team members individually.
To evaluate yourself, give us a list of the portions of Lab 2 that you were originally assigned to work on, as well as the set of things that you eventually ended up doing (these are not necessarily the same, and we realize this).
Next, based on your own observations, evaluate the performance of the other members of your team during the last lab assignment. Do not evaluate yourself. Assume an average team member would receive a score of 20 points. Top performers would receive more points, poor performers would receive fewer points.
The maximum score for a person is 40 points. Each evaluation should have a one or two sentence justification for the evaluation
For example, suppose you were on a 5 person team
with the following other members: Sue Superstar, Teddy Tryhard, Annie
Average, and Ned Neverthere. Your distribution might look like the
|Sue Superstar||30||Sue really helped along the group. Sue figured out how to handle interlocks in the pipeline, and stayed extra long last week to fix out last bug.|
||13||Teddy was always at our meetings, had a very positive
attitude, and did everything the group asked him to. However, he
often made mistakes and needed help.
|Annie Average||20||Annie did a good job.|
|Ned Neverthere||5||Ned never showed up to group meetings. We ended up reimplementing the one piece that he did give us.|
You should reevaluate your team members after every lab assignment and base your evaluation only on their performance during that lab assignment. These scores will be used for grading. Be honest and fair as you would hope others will be.
See the schedule at the front of the lab for the due date for the evaluations. Note that each team member emails a separate evaluation report.
As part of this lab, your group will keep an on-line notebook. See the Lab 2 writeup for detailed information about the notebook.
Implement the following five-stage pipeline for your processor:
IF Instruction Fetch: Access the instruction memory for the instruction to be executed.
ID Instruction Decode: Decode the instruction and read the operands from the register file. For branch instructions, calculate the branch target instruction address and compare the registers.
EX Execute: Bypass operands from other pipeline stages. One of the following activities can occur:
WB Write Back: Write result to register file.
Add the appropriate pipeline registers to your single cycle design.
For this design, you must use a new RAM file for your instruction and data memories. Note that Verilog files that we refer to in this lab are all in the m:\lab3\ directory. Start by reading the Readme in m:\lab3\Lab3Help. The new RAM (called sramblock2048.v) is fully synchronous. This means that you must set up the address and any data to be written before the edge of the clock. Both reads and writes are synchronous in this way. Keep this in mind when you work on your pipeline. One way to view the result, is that some of the registers that you see in the pipeline diagrams that we show you (for instance, the PC or the S and D registers) are partially duplicated in the RAM block. This means, for instance, that you still need to keep a separate PC register, but that you also need to pipe the value of the address before the PC register to the actual RAM block; on a clock edge, the new address will be clocked into both the PC register and the internal address registers of the RAM.
Implement the following instructions in your processor :
|arithmetic||addu, subu, addiu|
|logical||and, andi, or, ori, xor, xori, lui|
|shift||sll, sra, srl|
|compare||slt, slti, sltu, sltui|
|control||beq, bne, bgez, bltz, j, jr, jal|
|data transfer||lw, sw|
Implement the components you need in Verilog. All Verilog models should correspond to realistic components (e.g. register, comparator, etc). No super-composite components, e.g. a branch unit that takes in the opcode, operands, and PC, and outputs a new PC, or something like that.
Assemble the datapath using the schematic editor. Use the Verilog datapath module created by the schematic editor as a component in the hand-coded Verilog module that describes your processor.
All modules should be rising-edge triggered.
All control instructions should have a single delay slot (i.e. the following instruction is always executed after the control instruction).
Make sure to verify the bitfields of instructions such as bgez in Appendix A of COD. Note that the "rt" field is actually used to distinguish bgez and bltz.
The processor should only support the ALU instructions listed in the table above. As in Lab 2, there should an arithmetic/logical multibit (31) shifter external to the ALU. Instructions such as SLT should be handled outside the ALU as well. SLT should subtract the two operands, and then use the ALU status flags (Zero, Neg, Ovf) to compute the output, and then put the correct value back in the destination register.
The break instruction is special. See COD for its bitfield. Although this is normally an exception-causing instruction, you should treat it more like a halt instruction. After being decoded, the break instruction should freeze the pipeline from advancing further. This means that the PC will not advance further, the break instruction will stay in the decode stage, and later instructions will drain from the pipeline. The proper terminology for this is that the break instruction will "stall" in the decode stage. Assume that there will be a single input signal called "release" that comes from outside. When it is high, you should release a blocked break instruction exactly once (you need to build a small circuit that generates a single, one-clock-cycle pulse when release is high, then ignores release until it goes low again). When we map our pipeline to the board (Problem 3), the break instruction will stop the pipeline and potentially display its code on the LEDs. Further, we will have the option to "unfreeze" the pipeline with a debounced switch.
Make sure to produce an 8-bit output signal (STAT): that is as follows: if the processor is not stopped, STAT=0. Otherwise, the high bit (bit 7) of STAT = 1 and the low 7-bits = low 7 bits of break code (which is in bits 6-25 of the break instruction). Whenever this signal changes, make sure that you have a monitor output that prints "STATUS Changed: 0xvalue" on the console.
Any processor is useless without I/O, and your processor is no exception. Therefore you must build a memory-mapped I/O module. All writes to addresses 0x80000000-0xFFFFFFFC will be considered writes to I/O space. All reads and writes to I/O space should not be mapped to your data memory. Instead these operations should be handled by your memory I/O module.
The specifications of the I/O module are as follows. It should have a 32-bit address, 32-bit data input, and a 32-bit data output for the processor, just like memory. It should also have 2 I/O buses: a 32-bit; input data bus and a 32-bit output data bus and a 1-bit output selector. Other control signals are probably necessary as well. Internally, this I/O module should have two 32-bit I/O registers.
Behavior is as follows: Reads and writes to and from 0xFFFFFFF0 go to one 32-bit register (call it DP0). Reads and writes to and from 0xFFFFFFF4 go to the other (call it DP1). Reads from 0xFFFFFFF8 will come from the input I/O bus. Writes to 0xFFFFFFF8 are ignored. Reads and writes to and from addresses 0x80000000-0xFFFFFFEC and 0xFFFFFFFC are ignored. The output I/O bus will be DP0 if the output selector is 0 and DP1 if the selector is 1.
The input I/O bus will be connected to the dipswitches on the board. The output I/O bus will be connected to the hexadecimal LEDs on the board.
|0x80000000-0xFFFFFFEC||Reserved for future use.||Reserved for future use.|
|0xFFFFFFFC||Reserved for future use.||Reserved for future use.|
Note that you can read/write I/O space with normal loads and stores with negative offsets:
lw $1, -8($0) ; Read input => $1
sw $7, -16($0) ; Write $7 to DP0
sw $8, -12($0) ; Write $8 to DP1
This works because offsets are sign-extended. Thus, for instance, -12($0) means address 0xFFFFFFF4.
Finally, within this module, add non-synthesized code (see Synplify manual) that outputs a message to the console whenever a change is written (i.e. something like: "I/O Write to DP0: 0x44455523"). This message should also be written to a file called "iooutput.trace". Further, whenever the module inputs a value, arrange to have the value to come as the next value from an input file called "ioinput.trace".
Make sure to update your disassembly monitoring module from Lab 2:
MEMinstruction = instruction;
In this way, you have the value of the instruction word when it is at the end of the memory stage. This is like a mini pipeline. Values of input registers wouldn't have to be kept as long, etc. Think through this very carefully...
As with Lab 2, you will have a top-level schematic module that ties everything together. Now, however, you will have several I/O pins left over: a clock net, 1 reset signal, 1 release signal (for break instructions), 1 output select signal (from the I/O module), 1 8-bit output (from the break logic), 1 32-bit output (I/O) and 32-bit input (I/O).
Use the TopLevel.v module in m:\lab3 as the top-level integration for your design. You may want to briefly read the description of the Calinx boards (manual off the handouts page) to see what the various pins mean. You will be modifying the Verilog for this top-level module to integrate all the pieces you need.
You should assume that the following is true:
There are 2 sets of 4 pushbuttons. We will only use group 1 (although you are free to use the others if you wish).
Since these are buttons, they will be naturally bouncy, so you should include the debouncer module from m:\lab3, just as you did in Lab 2.
The break instruction outputs an 8-bit signal called STAT. This should be mapped to the 8 individual LEDs (on the side of the board).
The output from your memory-mapped I/O module should go to the HEX leds. To drive these properly, you need to the bin2Hex or the ledtool modules.
The input of the I/O module should come from the first group of 8-bit dip switches (switch9). Assume that the value on these switches goes to the lowest 8-bits of the input bus and that the top 24-bits are set to zero. If you like (possibly a good idea), you can consider switching in the current PC or the instruction being executed to the HEX leds when a switch is set. Consider the second switch of the second set of 8 dipswitches as controlling this. You can use other switches to indicate what is going to be displayed other than the normal I/O.
Finally, the CLOCK net for your pipeline should be connected either to the clk from the DLL off of the XILINX board or to your debounced SINGLE_CLOCK signal. Let the first switch of the second set of 8 dipswitches (switch10) be the choice (call this signal "CLK_SOURCE"):
processor_clock = CLK_SOURCE ? LAB_CLK: SINGLE_CLOCK;
Please read the documentation on DLLs, available in the M:\lab3\DLL Examples directory for more information on using DLLs. FPGA_TOP.v is currently setup so that you will receive a 9 Mhz clock. This may be too fast.
Execute the test plan you created in your design document. As in Lab 2, the plan should include unit testing (for the components in your processor), multi-unit testing (for the datapath), and complete-processor testing.
For complete-processor testing, you should write diagnostic programs, similar to those that you wrote in Lab 1 (Broken Spim). Feel free to reuse your Lab 1 test code in Lab 3. Keep in mind that you haven't handled hazards yet (Problem 2), so you must be careful that your test programs don't try to use values too soon after they are generated.
In addition to processor testing, you should also test the top-level module, to debug the interface between the processor and the Calinx board.
To do this, build one or more test-bench modules around the FPGA top-level module above that provides a clock (as in Lab 2), prints output to the console when I/O changes, and perhaps "pushes" the buttons for testing.
Note that debouncing of the switches is tricky when interacting with the clock. Think carefully before trying to test the single-stepping clock feature.
Also think carefully about the I/O features. How will you test these? For demonstrating your pipeline in simulation, create a test module that recognizes when break has been asserted, waits 10 cycles, then asserts the release line -- printing something to the console in the process. This will allow us to run programs that utilize the I/O features to output results.
Here is a list of the hazards you must handle:
ADD $1, $2, $3The ADD and SUB instructions have a data hazard, yet there is an SLL between them. Be sure to check these kinds of cases.
SLL $5, $6
SUB $6, $1, $7
Now that you have basic hazards dealt with, you should figure out how to handle pipeline stalls. Your current processor deals with the load delay slot in the same way as the original version of MIPS: If the compiler generates a code sequence in which a value is loaded from memory and used by the next instruction, the following instruction gets the wrong value. Of course, the instruction specification explicitly disallowed such code sequences; if no other options were available, the compiler would have to introduce a noop in the load delay slot to avoid getting the wrong answer.
As your final exercise, introduce a pipeline stall so that a value can be used by the compiler in the very next cycle after it is loaded from memory. This feature was added to later versions of the MIPS instructions set. To be clear, we want the following code sequence to do the "obvious" thing, i.e. the add should use the value loaded from memory:
LW $1, 4($2)
ADD $2, $1, $3
Make sure to rerun your tests to verify the processor still works
correctly. Write tests that try several different distances between
loads and their following values.
Hint: the mechanism for this single-cycle stall is very similar to
what you need for the break instruction...
Map your processor design onto the Calnix board. Make sure to read the information in Lab3Help about changes to get versions of the RAMs that map to Xilinx, and on how to use the TFTP interface. You should be using the same design flow as you did in Lab 2.
Note that you should be able to put the processor in single-step mode (first dipswitch of second set put to zero). Then you should be able to use push-button #4 as a single-stepping clock. You should also be able to spread break instructions in your code and debug code this way. Note also that you should be able to use a loop at the very end of your execution with a combination of break instructions and writes to the I/O (address to 0xFFFFFFF0, data to 0xFFFFFFF4) to dump the contents of your memory to the hexadecimal display when you are done. Make sure that this works!
Make sure that the RESET line causes important processor state to be reset! Remember that "initial" blocks in Verilog will be ignored by the synthesizer. Many bugs can be introduced when registers contain random initial state! One obvious thing that must be reset is the PC. Are there other things?
Don't try to debug everything at once. Start with extremely simple examples. Possibly divert the hexadecimal display to show PC information during debugging (feel free to divert other things as well). For instance, what about simple program with break as the first instruction and a bunch of nops. Can you get that to work? What about simple I/O examples? Once you make sure that your simplest tests work, you can move on to more complicated tests.
Please include information in your writeup about
the total number of FPGA slices used for your design and the fraction
the Xilinx part that has been used for your design. This
should be available in the log files post place-and-route.
Calculate the cycle time for your pipelined processor. To do this, you will need to understand the Xilinx timing analysis tools. In your writeup, discuss what your critical path is, and what steps you could do to reduce it. If we just took our Lab 2 single-cycle processor, and added pipeline registers at key points, we would expect the cycle time to be the inverse of the delay through the longest block (ALU? Next PC? Memory?). Is this the performance that you were able to achieve? Why or why not?
Turn in a copy of your Verilog code (including test benches), processor schematic, diagnostic program(s) and your on-line logs. Also turn in simulation logs that show correct operation of the processor. These logs should show the operations that were performed, and then the contents of memory with the correct values in it. Also turn in logs from your test benches.
As part of your writeup, do a port-mortem for your test plan. Show bug curves, and give examples of the type of bugs you found early on because of your test plan (as well as "escaped" bugs you found later than you would have hoped).
How much time did your team spend on this assignment?