Recall the project diagram from the previous document:
In this checkpoint we will be concerned with the red colored blocks. The Micron MT4HTF3264HY is the SODIMM(small outline dual inline memory module) on the XUPv5 development board in the Cory 125 lab. To see this SODIMM, flip over the board(carefully) and it can be seen under the protective cover. The following gives some information about the SODIMM,
As can be seen in the diagram, the interface to this SODIMM will leverage the work done by Xilinx to develop the MIG(memory interface generator) tool. This tool is run through Coregen and supports a handful of DRAM types and timings. The staff have generated the correct MIG files for use with the XUPv5, and the SODIMM installed in the XUPv5's available in Cory 125. The task for this checkpoint is to implement data and instruction caches by interfacing with the given MIG cores.
The MIG module of most immediate interest is the mig_v3_61
module. This is
the highest level MIG module and provides the most easy to use interface to the
user. This interface consists of the following system signals,
Name | Width | Direction | Description |
---|---|---|---|
clk0 |
1 | input | Clock signal input for MIG core to run at |
clk90 |
1 | input | Clock signal input that is clk0 phase shifted by 90 degrees. |
clkdiv0 |
1 | input | Clock signal input that is half the speed of clk0. |
clk200 |
1 | input | Clock signal input at 200 MHz used to drive IDELAYCTRL. |
locked |
1 | input | Indicates that the driving PLL has locked. |
sys_rst_n |
1 | input | Indicates that MIG should reset, synchronized to clk0 |
rst0_tb |
1 | output | Indicates that circuits interfacing with MIG should reset. |
clk0_tb |
1 | output | Clock signal generated by to interface with the MIG, same frequency as clk0. |
phy_init_done |
1 | output | Indicates initialization of memory complete. |
the following data interface signals,
Name | Width | Direction | Description |
---|---|---|---|
app_af_afull |
1 | output | Address FIFO almost full. |
app_af_wren |
1 | input | Address FIFO write enable. |
app_af_addr |
31 | input | Address FIFO address. |
app_af_cmd |
3 | input | Address FIFO command(000=Write, 001=Read). |
rd_data_valid |
1 | output | Read data FIFO output valid. |
rd_data_fifo_out |
128 | output | Read data FIFO output. |
app_wdf_afull |
1 | output | Write data FIFO almost full. |
app_wdf_wren |
1 | input | Write data FIFO write enable. |
app_wdf_data |
128 | input | Write data FIFO input. |
app_wdf_mask_data |
16 | input | Active low write data mask FIFO input. |
For low level documentation of the MIG core consult, ug086.pdf, the important parts start in chapter 3, page 123. Documentation on the user interface specifically is available starting on page 141.
The memory port of our MT4HTF3264HY is 64-bits wide. The MIG core generated by the staff has a burst width of 4, meaning that reads and writes optimally occur in sequences of 4 consecutive addresses(hint this gives a natural size for a cache block). Looking at the timing diagram on page 146 gives examples of write timing, and the timing diagram on page 149 gives the same for reads. Each A_{0-3} is a distinct address, each D_{0-3} is a distinct 64-bit data value, and each M_{0-3} is a write mask for the corresponding D_{0-3}. As can be seen in the diagram, the DRAM interface is DDR(Dual Data Rate), meaning that it does a memory transfer on each clock edge(positive and negative), therefore the memory controller appears to have data port width of 128-bits. In reality the port is 64-bits wide, and clocked twice as fast as the circuit doing the access.
To execute a write
app_af_addr
, of which the low 25-bits
matter, while the upper 6 should be zero.app_af_cmd
to 3'b000.app_af_wren
.app_wdf_data
.app_wdf_mask
, remember this signal is
active low.app_wdf_wren
.To execute a read
app_af_addr
, of which the low 25-bits
matter, while the upper 6 should be zero.app_af_cmd
to 3'b001.rd_data_valid
to be asserted.Notice that the address and write FIFOs can fill up, so your cache should be able to stall in that event.
The design space for caches is huge! We recommend choosing a simple direct mapped design, and getting it to work, prior to exploring different designs. We suggest you develop the data cache first, while using your current instruction memory. This at least allows you to execute instructions while testing your data cache. There are two main components to consider in the design of the cache, the first is the memory that stores the data and bookkeeping bits. You will need to calculate the size of this(these) memories and generate them with coregen. A decent starting size to consider would be around 16KB of data per cache. Once you have generated these memories, you must begin the design of the memory controller. It will consist of a(some) fairly complicated FSM(s), we suggest initially attempting to merely get a design working, do not attempt to cut every possible cycle you can. A working, but non-optimal, cache is preferable to a broken cache. Lastly, we suggest using a writeback/writeallocate scheme for handling writes to the cache.
As you get farther into the design you will come to realize that the MIG core
runs at a clock frequency(200 MHz) that differs from the cache clock frequency.
Because of this we will need to deal with clock boundary crossing issues. These
issues come up with the three FIFOs and the reset signal. More concretely this
means that the MIG core expects signals to be clocked using the clk0_tb
clock. As stated earlier this clock is most likely quite different from the
caches clock. This means that all signals interfacing with the MIG most first
run through clock crossing FIFOs. This will be three different FIFOs, one for
the address, one for the write data, and one for the read data. As you can see
MIG already has FIFOs related to these signals, they are just in the wrong
clock domain. To solve this merely send all address and write data going from
the cache to MIG through a FIFO with a read clock of clk0_tb
and a write
clock of cpu_clk_g
(mig_af
and mig_wdf
). Likewise with the read data FIFO
send all read data from the MIG core through a FIFO with a read clock of
cpu_clk_g
and a write clock of clk0_tb
(mig_rdf
). The coregen files for
creating these FIFOs are available in the skeleton files with the names
mig_af
, mig_wdf
, and mig_rdf
. The last signal to synchronize is the
rst0_tb
signal, again this signal is clocked with the clk0_tb
clock and
will need a synchronizer to bring it into the cpu_clk_g
domain. This can be
done with the synchronizer discussed in lecture, merely a 2 bit shift register
clocked at the receiving domains clock.
We have built a simulation model from the provided Xilinx models, this can be
found in the mt4htf3264hy module located in the mig_v3_61 directory. Please
instantiate this model in your testbench and connect it to the ml505top module
in order to simulate your DRAM interface. Furthermore, you may want to set the
parameter SIM_ONLY to 1 for the MIG core this allows it to initialize faster
during simulation, remember to remove this when synthesizing! Lastly, we
recommend the liberal use of $display
statements, this is the easiest way to
debug, and simple scripts can help you sort through and interpret the messages.
The staff have written a block RAM memory model generator in Python. This script will allow you to generate Verilog modules that mimic the block RAM in simulation. Warning: these modules will not synthesize they are merely for increased design visibility. To run this script use the following command,
brammodelgen [module-name] [bytes-wide] [address-width] > [module-name].v
feel free to edit the module that is produced in any way, for instance, adding
$display
statements to see what is going on inside the model.
A testbench exists in the module ddr2_tb_top
, this can be used to test the
MIG module. Instantiate it and connect it to the correct ports on the MIG core.
When running it will generate a series of addresses and data values, write them
to memory, then read them back out and compare. If any errors are detected
during this process it will assert error
or error_cmp
. The parameters for
the module should be configured as follows,
BANK_WIDTH=2
COL_WIDTH=10
DM_WIDTH=8
DQ_WIDTH=64
ROW_WIDTH=13
APPDATA_WIDTH=128
ECC_ENABLE=0
BURST_LEN=4
Simulating this testbench can give a good understanding of how to perform burst reads and writes, and furthermore verify that the MIG core is operating correctly in simulation and on board.
To begin running programs we suggest taking the following steps,
First, you want to attempt to run something that does not have a static data section, a good example is a vector-vector add program that initializes two regions of memory with a vector of values and then attempts to add them together. To guarantee that no static data section is created, we suggest writing an assembly file similar to the following,
li $t0, 0x7000 # some memory address, location of first vector
li $t1, 0x0000
sw $t1, 0($t0)
li $t1, 0x0001
sw $t1, 4($t0)
.
.
.
li $t0, 0x8000 # some memory address, location of second vector
li $t1, 0x0200
sw $t1, 0($t0)
li $t1, 0x0020
sw $t1, 4($t0)
.
.
.
This type of test should at least exercise your cache and cause evictions to happen, if you use the simulation model provided, and a small cache(around 4 or 8 entries) it should be fairly easy to verify correct behaviour. The easiest way to run this program is to map the block RAM(or ROM) storing this program directly into the instruction fetches address space, and the data cache directly into the dmem address space.
Second, once the previous example works, the next step is to execute programs that have a static data section, this takes some modification of the address space, leading to a simplification of the decoding. Since the DRAM is 256 MB, we know that we need the low 28 bits of our 32 bit address space to address every byte of the DRAM, this leaves us the upper 4 bits to use to select between different devices. This leads to the following address space allocation, for the dmem accesses
Device | R/W | Address Pattern |
---|---|---|
D$ | R/W | xxx1_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx |
BIOS | R | x1xx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx |
IO | R/W | 1xxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx |
and for the imem accesses we leave a few devices off,
Device | R/W | Address Pattern |
---|---|---|
BIOS | R | x1xx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx |
What this means is that from the dmem accesses point of view(sw, lw, etc) any access to an address with the 29th bit set should go to the dcache, any access with the 31th bit set should go to the BIOS(read-only), and any access with the 32nd bit set should go to the IO subsystem, this means that the new addresses for the IO registers are as follows,
Address | R/W | Function |
---|---|---|
0x80000000 |
R | UART Receiver control |
0x80000004 |
R | UART Receiver data |
0x80000008 |
R | UART Transmitter control |
0x8000000C |
W | UART Transmitter data |
0x80000010 |
R | ENET Receiver control |
0x80000014 |
R | ENET Receiver data |
0x80000018 |
R | ENET Transmitter control |
0x8000001C |
W | ENET Transmitter data |
To accomadate this change, the linker will need to know how to offset the
BIOS program, and the stack pointer will need to be set to the top of the
DRAM. To do this edit the bios150v2.ld
file, there is a line
. = 0x0
this tells the GNU linker that it should offset the code(jump addresses and
static memory accesses) to the address 0x0, with the new layout we want the
address 0x40000000
. Lastly, we will need to modify the immediate loaded into
the $sp
register in the start.s
file. This should be set to the top of the
DRAM, or 0x1FFFFFFC
.
Third, we want to actually run instructions out of the icache, and not the BIOS area of memory. To do this we will need to memory map the icache into the imem access address space, leading to the following layout for the imem fetches,
Device | R/W | Address Pattern |
---|---|---|
I$ | R | xxx1_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx |
BIOS | R | x1xx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx |
this makes sense because we want programs that are fetching instructions from the icache to be pulling data in from the dcache. To handle this, programs wanting to execute out of the icache will need to have their linker script modified, setting the origin as follows,
. = 0x10000000
examples of helpful programs that can be used to test this are the bios150, and something simpler like the following
li $t0, 0x10000000 # the dcache
li $t1, 0x27bdfff0 # addiu $sp, $sp, -16
sw $t1, 0($t0)
.
.
.
jr $t0
this code will store instructions into the data cache, which will then be evicted to the DRAM, and upon jal pulled into the icache. Now there is a small cache coherency problem here as it is possible the icache will not see the writes to DRAM if they have not been evicted from the dcache, that is solved in the following step.
Fourth, we would like to solve the coherency issue mentioned above, to do this we will memory map the icache into the dmem address space, this will allow us to perform stores directly into the icache. Therefore the final dmem access address space looks like the following,
Device | R/W | Address Pattern |
---|---|---|
D$ | R/W | xxx1_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx |
I$ | W | xx1x_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx |
BIOS | R | x1xx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx |
IO | R/W | 1xxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx |
Since in general, the icache will need to be read every cycle, we will limit
writing to the icache to the case where the processor is executing BIOS code.
This allows us to write to the icache without stalling the instruction fetch,
furthermore, since address decodes are done on a bit basis, writing to the
address 0x30000000
should write to both the icache and dcache, keeping them
coherent. Therefore the program from the previous example would be,
li $t0, 0x30000000 # the dcache
li $t1, 0x27bdfff0 # addiu $sp, $sp, -16
sw $t1, 0($t0)
.
.
.
jr $t0
If this program is stored in the BIOS memory, it will be allowed to write the instruction loaded as an immediate into both the icache and dcache, then it will jump to those instructions. Eventually both the icache and dcache will evict them, but since they are the same it does not cause problems.
Checkoff will consist of running some programs out of the instruction and data caches.