Pairs and the heap
Today we’ll add a new type to our language: the pair type. A pair, as its name suggests, packages up two values into one value. We can then access the components individually. We’ll implement a few operations to work with this new type:
(pair e1 e2)
makes a new pair out ofe1
ande2
(left e)
gets the left component of the paire
evaluates to(right e)
gets the right component of the paire
evaluates to
This is a simple addition to our language, but it’s quite powerful! In Scheme
and other Lisp-like languages, the pair
operation is often called cons
(short for “construct”) and pairs are called cons-cells. These cons-cells are
used to implement complex nested structures like lists and trees. In our
language, we could implement a list of values like this:
(pair 1 (pair 2 (pair 3 false)))
We can access the first element of the list with left
and the rest of the list
with right
. In Lisp, the left
and right
operations are called car
and
cdr
; the names refer to assembly instructions on the computer on which Lisp
was originally implemented.
Pairs in the interpreter
Our implementation of pairs in the interpreter will be straightforward because
we’ll rely on OCaml’s own pair implementation. Pairs are a kind of value, so
we’ll need to add a constructor to our value
type and implement printing:
type value = Number of int | Boolean of bool | Pair of (value * value) let rec string_of_value (v : value) : string = match v with | Number n -> string_of_int n | Boolean b -> if b then "true" else "false" | Pair (v1, v2) -> Printf.sprintf "(pair %s %s)" (string_of_value v1) (string_of_value v2)
We’ll then implement the pair operations:
let rec interp_exp env (exp : s_exp) : value = match exp with | Lst [Sym "pair"; e1; e2] -> Pair (interp_exp env e1, interp_exp env e2) | Lst [Sym "left"; e] -> ( match interp_exp env e with | Pair (v, _) -> v | _ -> raise (BadExpression exp) ) | Lst [Sym "right"; e] -> ( match interp_exp env e with | Pair (_, v) -> v | _ -> raise (BadExpression exp) )
Pairs in the compiler
OK, so now we’ve got a value which is actually a pair of values. This is going to be a bit tricky to implement in the compiler–remember, all of our values have to fit in a 64-bit register! And since our other values are 64 bits long, it’s going to be hard to fit two of them in 64 bits.
My undergrad advisor, Tia Newhall, frequently quoted the Fundamental theorem of software engineering (“We can solve any problem by introducing an extra level of indirection.”) to explain how systems adapt to this kind of challenge. What if we added a level of indirection? Instead of trying to store a pair in a register, can we store the address of a memory location that stores a pair?
We’ve seen how to write to a section of memory called the stack; we’ve used it to store the values of variables, as well as temporary values used during computation. Can we put pairs on the stack?
If we try to do that, we’re going to run into some trouble. Consider our let expression:
(let ((x <expr)) <body>)
Right now, regardless of what expr
is, we can compile the let
by moving the
value in rax
(i.e., the value of expr
) to the next available stack
address. So how should we compile this expression?
(let ((x (pair 1 2))) (left x))
If that pair lives on the stack, we have to put x
somewhere else! We’d need to
somehow track how much stack space the expression used, and put x
after
it. What about an expression like
(let ((x (if (< 1 2) (pair 1 (pair 2 3)) (pair 1 2))) (left x))
How much stack space will we use for the let? At compile-time, we really don’t know.
We need to put that pair someplace where the let
won’t clobber it. If the
stack is for “short-lived” values whose size is known at compile-time, we need
somewhere to put longer-lived values whose size might only be known at run-time.
This other area of memory is called the heap. The heap is where objects live in
Python and Java. It’s where OCaml puts complex objects (including pairs). In C,
we can allocate space on the heap with the malloc
call.
Just as the stack traditionally “grows” from higher to lower addresses, the heap
will grow from lower addresses to higher addresses. Just as our program
starts with the address of the bottom of the stack in the rsp
register, it
will start with the bottom of the heap in the rdi
register (we’ll see how we
can make that happen in the runtime momentarily). We keep track of indexes into
the stack at compile-time; the rsp
register never changes. As we saw above,
though, we won’t be able to do that for the heap: at compile-time, we don’t
actually know how much heap space we’ll use. So we’ll need to increment the
rdi
register as our program executes in order to make sure it always points at
the next available chunk of heap memory.
Let’s see how that works by implementing the pair
operation:
let rec compile_exp (tab : int symtab) (stack_index : int) (exp : s_exp) : directive list = match exp with | Lst [Sym "pair"; e1; e2] -> compile_exp tab stack_index e1 @ [Mov (stack_address stack_index, Reg Rax)] @ compile_exp tab (stack_index - 8) e2 @ [ Mov (Reg R8, stack_address stack_index) ; Mov (MemOffset (Reg Rdi, Imm 0), Reg R8) ; Mov (MemOffset (Reg Rdi, Imm 8), Reg Rax) ; Mov (Reg Rax, Reg Rdi) ; Add (Reg Rdi, Imm 16) ]
All we’re doing here is moving the arguments to pair
into memory, storing the
right address in rax
, and making sure rdi
still points at the last place in
the heap.
Let’s go ahead and implement the runtime now. First, we’ll need to make sure
rdi
points at a region of memory we can write to. We’ll do that like this:
extern uint64_t entry(void *heap); // some definitions elided... int main(int argc, char **argv) { print_value(entry((void*)malloc(4096))); return 0; }
We’re allocating 4096 bytes of space to use as our heap, then passing its
address into our entry
function. In the x86-64 C calling convention, the first
argument to a function is passed in the rdi
register, so this will do what we
want it to!
Now, let’s implement print_value
for our pairs. Here we’re going to run into
some trouble. After all, at runtime a pair
value is just a memory
address. These addresses are just 64-bit integers! How will we know that we’re
looking at the address of a pair and not at an integer, or a boolean, or some
future other heap-allocated type? We want to tag our pair addresses just like
we’re tagging our other types. We really can’t shift them, though–x86-64 might
use the whole 64-bit number!
We’re going to use a little trick in order to make this work. All of our values
are 64-bit numbers, and we’re storing values in memory; so, we’ve been modeling
memory as an array of 8-byte cells. As long as we only ever access memory like
this, we’ll only ever increment rdi
by multiples of 8. So as long as rdi
starts out as a multiple of 8, all of our addresses will end in 0b000
. And,
malloc is guaranteed to return a multiple of 8 (actually 16). So, since all of
our actual addresses will end in 0b000
, we can use those last three bits to
store a tag.
This tag can’t end in 0b00
, since that’s used by numbers. And it can’t end in
0b111
, since our tags for booleans (and characters, on the homework) end in
0b111
. All of the other three-bit sequences, however, are fair game! We’ll use
0b010
for pairs. Here’s how that looks in the runtime:
#define heap_mask 0b111 #define pair_tag 0b010 void print_value(uint64_t value) { if ((value & num_mask) == num_tag) { int64_t ivalue = (int64_t)value; printf("%" PRIi64, ivalue >> num_shift); } else if ((value & bool_mask) == bool_tag) { if (value >> bool_shift) { printf("true"); } else { printf("false"); } } else if ((value & heap_mask) == pair_tag) { uint64_t v1 = *(uint64_t*)(value - pair_tag); uint64_t v2 = *(uint64_t*)(value - pair_tag + 8); printf("(pair "); print_value(v1); printf(" "); print_value(v2); printf(")"); } else { printf("BAD %" PRIu64, value); } }
(Remember to recompile the runtime with gcc -c runtime.c -o runtime.o
.)
And in the compiler:
let heap_mask = 0b111 let pair_tag = 0b010 let rec compile_exp (tab : int symtab) (stack_index : int) (exp : s_exp) : directive list = match exp with | Lst [Sym "pair"; e1; e2] -> compile_exp tab stack_index e1 @ [Mov (stack_address stack_index, Reg Rax)] @ compile_exp tab (stack_index - 8) e2 @ [ Mov (Reg R8, stack_address stack_index) ; Mov (MemOffset (Reg Rdi, Imm 0), Reg R8) ; Mov (MemOffset (Reg Rdi, Imm 8), Reg Rax) ; Mov (Reg Rax, Reg Rdi) ; Or (Reg Rax, Imm pair_tag) ; Add (Reg Rdi, Imm 16) ]
Now we can correctly compile and run programs that create pairs:
$ compile_and_run "(pair 1 false)";;
The left
and right
operations are now straightforward: we just need to
remove the pair tag and access the right location in memory.
let rec compile_exp (tab : int symtab) (stack_index : int) (exp : s_exp) : directive list = match exp with | Lst [Sym "left"; e] -> compile_exp tab stack_index e @ [Mov (Reg Rax, MemOffset (Reg Rax, Imm (-pair_tag)))] | Lst [Sym "right"; e] -> compile_exp tab stack_index e @ [Mov (Reg Rax, MemOffset (Reg Rax, Imm (-pair_tag + 8)))]
A note about garbage
Once we’re done with some memory on the stack, we decrement our stack index and keep executing the program, using that memory again if necessary. What happens when we’re done with memory on the heap?
Right now, nothing! We never decrement rdi
. For long-running programs, this
could be a big problem. We’re only allocating 4096 bytes for our heap–that
could run out!
Reclaiming unused allocated memory is called garbage collection. Efficiently determining which memory can be safely reclaimed is a tough problem, and fast garbage collection is the subject of ongoing research. For now, we’re not going to worry about reclaiming heap space. We’ll talk more about garbage collection later in the course.