Checkpoint 3: Caches and DRAM Interface

Checkpoint 3: Caches and DRAM Interface

Introduction

Recall the project diagram from the previous document:

In this checkpoint we will be concerned with the red colored blocks. The Micron MT4HTF3264HY is the SODIMM(small outline dual inline memory module) on the XUPv5 development board in the Cory 125 lab. To see this SODIMM, flip over the board(carefully) and it can be seen under the protective cover. The following gives some information about the SODIMM,

256 megabytes
single rank
13 row address bits
10 column address bits
2 bank address bits
4 cycle CAS latency (tCL)
4 cycle RAS to CAS delay (tRCD)
4 cycle RAS precharge (tRP)

As can be seen in the diagram, the interface to this SODIMM will leverage the work done by Xilinx to develop the MIG(memory interface generator) tool. This tool is run through Coregen and supports a handful of DRAM types and timings. The staff have generated the correct MIG files for use with the XUPv5, and the SODIMM installed in the XUPv5's available in Cory 125. The task for this checkpoint is to implement data and instruction caches by interfacing with the given MIG cores.

Memory Interface Generator(MIG)

The MIG module of most immediate interest is the mig_v3_61 module. This is the highest level MIG module and provides the most easy to use interface to the user. This interface consists of the following system signals,

Name	Width	Direction	Description
`clk0`	1	input	Clock signal input for MIG core to run at
`clk90`	1	input	Clock signal input that is clk0 phase shifted by 90 degrees.
`clkdiv0`	1	input	Clock signal input that is half the speed of clk0.
`clk200`	1	input	Clock signal input at 200 MHz used to drive IDELAYCTRL.
`locked`	1	input	Indicates that the driving PLL has locked.
`sys_rst_n`	1	input	Indicates that MIG should reset, synchronized to clk0
`rst0_tb`	1	output	Indicates that circuits interfacing with MIG should reset.
`clk0_tb`	1	output	Clock signal generated by to interface with the MIG, same frequency as clk0.
`phy_init_done`	1	output	Indicates initialization of memory complete.

the following data interface signals,

Name	Width	Direction	Description
`app_af_afull`	1	output	Address FIFO almost full.
`app_af_wren`	1	input	Address FIFO write enable.
`app_af_addr`	31	input	Address FIFO address.
`app_af_cmd`	3	input	Address FIFO command(000=Write, 001=Read).
`rd_data_valid`	1	output	Read data FIFO output valid.
`rd_data_fifo_out`	128	output	Read data FIFO output.
`app_wdf_afull`	1	output	Write data FIFO almost full.
`app_wdf_wren`	1	input	Write data FIFO write enable.
`app_wdf_data`	128	input	Write data FIFO input.
`app_wdf_mask_data`	16	input	Active low write data mask FIFO input.

For low level documentation of the MIG core consult, ug086.pdf, the important parts start in chapter 3, page 123. Documentation on the user interface specifically is available starting on page 141.

The memory port of our MT4HTF3264HY is 64-bits wide. The MIG core generated by the staff has a burst width of 4, meaning that reads and writes optimally occur in sequences of 4 consecutive addresses(hint this gives a natural size for a cache block). Looking at the timing diagram on page 146 gives examples of write timing, and the timing diagram on page 149 gives the same for reads. Each A_{0-3} is a distinct address, each D_{0-3} is a distinct 64-bit data value, and each M_{0-3} is a write mask for the corresponding D_{0-3}. As can be seen in the diagram, the DRAM interface is DDR(Dual Data Rate), meaning that it does a memory transfer on each clock edge(positive and negative), therefore the memory controller appears to have data port width of 128-bits. In reality the port is 64-bits wide, and clocked twice as fast as the circuit doing the access.

To execute a write

Supply a 31-bit address to app_af_addr, of which the low 25-bits matter, while the upper 6 should be zero.
Set app_af_cmd to 3'b000.
Assert app_af_wren.
Supply 128-bits worth of data to app_wdf_data.
Supply 16-bits worth of byte mask to app_wdf_mask, remember this signal is active low.
Assert app_wdf_wren.
Repeat 4-6 for burst length(for full throughput).

To execute a read

Supply a 31-bit address to app_af_addr, of which the low 25-bits matter, while the upper 6 should be zero.
Set app_af_cmd to 3'b001.
Wait for rd_data_valid to be asserted.
Repeat for burst length(for full throughput).

Notice that the address and write FIFOs can fill up, so your cache should be able to stall in that event.

Cache Development

The design space for caches is huge! We recommend choosing a simple direct mapped design, and getting it to work, prior to exploring different designs. We suggest you develop the data cache first, while using your current instruction memory. This at least allows you to execute instructions while testing your data cache. There are two main components to consider in the design of the cache, the first is the memory that stores the data and bookkeeping bits. You will need to calculate the size of this(these) memories and generate them with coregen. A decent starting size to consider would be around 16KB of data per cache. Once you have generated these memories, you must begin the design of the memory controller. It will consist of a(some) fairly complicated FSM(s), we suggest initially attempting to merely get a design working, do not attempt to cut every possible cycle you can. A working, but non-optimal, cache is preferable to a broken cache. Lastly, we suggest using a writeback/writeallocate scheme for handling writes to the cache.

Clock Crossing

As you get farther into the design you will come to realize that the MIG core runs at a clock frequency(200 MHz) that differs from the cache clock frequency. Because of this we will need to deal with clock boundary crossing issues. These issues come up with the three FIFOs and the reset signal. More concretely this means that the MIG core expects signals to be clocked using the clk0_tb clock. As stated earlier this clock is most likely quite different from the caches clock. This means that all signals interfacing with the MIG most first run through clock crossing FIFOs. This will be three different FIFOs, one for the address, one for the write data, and one for the read data. As you can see MIG already has FIFOs related to these signals, they are just in the wrong clock domain. To solve this merely send all address and write data going from the cache to MIG through a FIFO with a read clock of clk0_tb and a write clock of cpu_clk_g(mig_af and mig_wdf). Likewise with the read data FIFO send all read data from the MIG core through a FIFO with a read clock of cpu_clk_g and a write clock of clk0_tb(mig_rdf). The coregen files for creating these FIFOs are available in the skeleton files with the names mig_af, mig_wdf, and mig_rdf. The last signal to synchronize is the rst0_tb signal, again this signal is clocked with the clk0_tb clock and will need a synchronizer to bring it into the cpu_clk_g domain. This can be done with the synchronizer discussed in lecture, merely a 2 bit shift register clocked at the receiving domains clock.

Simulation

We have built a simulation model from the provided Xilinx models, this can be found in the mt4htf3264hy module located in the mig_v3_61 directory. Please instantiate this model in your testbench and connect it to the ml505top module in order to simulate your DRAM interface. Furthermore, you may want to set the parameter SIM_ONLY to 1 for the MIG core this allows it to initialize faster during simulation, remember to remove this when synthesizing! Lastly, we recommend the liberal use of $display statements, this is the easiest way to debug, and simple scripts can help you sort through and interpret the messages.

The staff have written a block RAM memory model generator in Python. This script will allow you to generate Verilog modules that mimic the block RAM in simulation. Warning: these modules will not synthesize they are merely for increased design visibility. To run this script use the following command,

brammodelgen [module-name] [bytes-wide] [address-width] > [module-name].v

feel free to edit the module that is produced in any way, for instance, adding $display statements to see what is going on inside the model.

A testbench exists in the module ddr2_tb_top, this can be used to test the MIG module. Instantiate it and connect it to the correct ports on the MIG core. When running it will generate a series of addresses and data values, write them to memory, then read them back out and compare. If any errors are detected during this process it will assert error or error_cmp. The parameters for the module should be configured as follows,

BANK_WIDTH=2
COL_WIDTH=10
DM_WIDTH=8
DQ_WIDTH=64
ROW_WIDTH=13
APPDATA_WIDTH=128
ECC_ENABLE=0
BURST_LEN=4

Simulating this testbench can give a good understanding of how to perform burst reads and writes, and furthermore verify that the MIG core is operating correctly in simulation and on board.

Running Programs

To begin running programs we suggest taking the following steps,

First, you want to attempt to run something that does not have a static data section, a good example is a vector-vector add program that initializes two regions of memory with a vector of values and then attempts to add them together. To guarantee that no static data section is created, we suggest writing an assembly file similar to the following,

li $t0, 0x7000 # some memory address, location of first vector


li           $t1, 0x0000
sw           $t1, 0($t0)
li           $t1, 0x0001
sw           $t1, 4($t0)
.
.
.
li           $t0, 0x8000 # some memory address, location of second vector
li           $t1, 0x0200
sw           $t1, 0($t0)
li           $t1, 0x0020
sw           $t1, 4($t0)
.
.

.

This type of test should at least exercise your cache and cause evictions to happen, if you use the simulation model provided, and a small cache(around 4 or 8 entries) it should be fairly easy to verify correct behaviour. The easiest way to run this program is to map the block RAM(or ROM) storing this program directly into the instruction fetches address space, and the data cache directly into the dmem address space.

Second, once the previous example works, the next step is to execute programs that have a static data section, this takes some modification of the address space, leading to a simplification of the decoding. Since the DRAM is 256 MB, we know that we need the low 28 bits of our 32 bit address space to address every byte of the DRAM, this leaves us the upper 4 bits to use to select between different devices. This leads to the following address space allocation, for the dmem accesses

Device	R/W	Address Pattern
D$	R/W	`xxx1_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx`
BIOS	R	`x1xx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx`
IO	R/W	`1xxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx`

and for the imem accesses we leave a few devices off,

Device	R/W	Address Pattern
BIOS	R	`x1xx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx`

What this means is that from the dmem accesses point of view(sw, lw, etc) any access to an address with the 29th bit set should go to the dcache, any access with the 31th bit set should go to the BIOS(read-only), and any access with the 32nd bit set should go to the IO subsystem, this means that the new addresses for the IO registers are as follows,

Address	R/W	Function
`0x80000000`	R	UART Receiver control
`0x80000004`	R	UART Receiver data
`0x80000008`	R	UART Transmitter control
`0x8000000C`	W	UART Transmitter data
`0x80000010`	R	ENET Receiver control
`0x80000014`	R	ENET Receiver data
`0x80000018`	R	ENET Transmitter control
`0x8000001C`	W	ENET Transmitter data

To accomadate this change, the linker will need to know how to offset the BIOS program, and the stack pointer will need to be set to the top of the DRAM. To do this edit the bios150v2.ld file, there is a line

. = 0x0

this tells the GNU linker that it should offset the code(jump addresses and static memory accesses) to the address 0x0, with the new layout we want the address 0x40000000. Lastly, we will need to modify the immediate loaded into the $sp register in the start.s file. This should be set to the top of the DRAM, or 0x1FFFFFFC.

Third, we want to actually run instructions out of the icache, and not the BIOS area of memory. To do this we will need to memory map the icache into the imem access address space, leading to the following layout for the imem fetches,

Device	R/W	Address Pattern
I$	R	`xxx1_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx`
BIOS	R	`x1xx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx`

this makes sense because we want programs that are fetching instructions from the icache to be pulling data in from the dcache. To handle this, programs wanting to execute out of the icache will need to have their linker script modified, setting the origin as follows,

. = 0x10000000

examples of helpful programs that can be used to test this are the bios150, and something simpler like the following

li $t0, 0x10000000 # the dcache


li      $t1, 0x27bdfff0    # addiu $sp, $sp, -16
sw      $t1, 0($t0)
.
.
.

jr $t0

this code will store instructions into the data cache, which will then be evicted to the DRAM, and upon jal pulled into the icache. Now there is a small cache coherency problem here as it is possible the icache will not see the writes to DRAM if they have not been evicted from the dcache, that is solved in the following step.

Fourth, we would like to solve the coherency issue mentioned above, to do this we will memory map the icache into the dmem address space, this will allow us to perform stores directly into the icache. Therefore the final dmem access address space looks like the following,

Device	R/W	Address Pattern
D$	R/W	`xxx1_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx`
I$	W	`xx1x_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx`
BIOS	R	`x1xx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx`
IO	R/W	`1xxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx_xxxx`

Since in general, the icache will need to be read every cycle, we will limit writing to the icache to the case where the processor is executing BIOS code. This allows us to write to the icache without stalling the instruction fetch, furthermore, since address decodes are done on a bit basis, writing to the address 0x30000000 should write to both the icache and dcache, keeping them coherent. Therefore the program from the previous example would be,

li $t0, 0x30000000 # the dcache


li      $t1, 0x27bdfff0    # addiu $sp, $sp, -16
sw      $t1, 0($t0)
.
.
.

jr $t0

If this program is stored in the BIOS memory, it will be allowed to write the instruction loaded as an immediate into both the icache and dcache, then it will jump to those instructions. Eventually both the icache and dcache will evict them, but since they are the same it does not cause problems.

Checkoff

Checkoff will consist of running some programs out of the instruction and data caches.