Pairs and the heap
Today we’ll add a new type to our language: the pair type. A pair, as its name suggests, packages up two values into one value. We can then access the components individually. We’ll implement a few operations to work with this new type:
-
(pair e1 e2)
makes a new pair out ofe1
ande2
-
(left e)
gets the left component of the paire
evaluates to -
(right e)
gets the right component of the paire
evaluates to
This is a simple addition to our language, but it’s quite
powerful! In Scheme and other Lisp-like languages, the
pair
operation is often called cons
(short for
“construct”) and pairs are called cons-cells. These
cons-cells are used to implement complex nested structures like lists
and trees. In our language, we could implement a list of values like
this:
(pair 1 (pair 2 (pair 3 false)))
We can access the first element of the list with left
and
the rest of the list with right
. In Lisp, the
left
and right
operations are called
car
and cdr
; the names refer to assembly
instructions on the computer on which Lisp was originally implemented.
Pairs in the interpreter
Our implementation of pairs in the interpreter will be
straightforward because we’ll rely on OCaml’s own pair
implementation. Pairs are a kind of value, so we’ll need to
add a constructor to our value
type and implement
printing:
type value = Number of int | Boolean of bool | Pair of (value * value) let rec string_of_value (v : value) : string = match v with | Number n -> string_of_int n | Boolean b -> if b then "true" else "false" | Pair (v1, v2) -> Printf.sprintf "(pair %s %s)" (string_of_value v1) (string_of_value v2)
We’ll then implement the pair operations:
let rec interp_exp env (exp : s_exp) : value = match exp with | Lst [Sym "pair"; e1; e2] -> Pair (interp_exp env e1, interp_exp env e2) | Lst [Sym "left"; e] -> ( match interp_exp env e with | Pair (v, _) -> v | _ -> raise (BadExpression exp) ) | Lst [Sym "right"; e] -> ( match interp_exp env e with | Pair (_, v) -> v | _ -> raise (BadExpression exp) )
Pairs in the compiler
OK, so now we’ve got a value which is actually a pair of values. This is going to be a bit tricky to implement in the compiler–remember, all of our values have to fit in a 64-bit register! And since our other values are 64 bits long, it’s going to be hard to fit two of them in 64 bits.
My undergrad advisor, Tia Newhall, frequently quoted the Fundamental theorem of software engineering (“We can solve any problem by introducing an extra level of indirection.”) to explain how systems adapt to this kind of challenge. What if we added a level of indirection? Instead of trying to store a pair in a register, can we store the address of a memory location that stores a pair?
We’ve seen how to write to a section of memory called the stack; we’ve used it to store the values of variables, as well as temporary values used during computation. Can we put pairs on the stack?
If we try to do that, we’re going to run into some trouble. Consider our let expression:
(let ((x <expr)) <body>)
Right now, regardless of what expr
is, we can compile
the let
by moving the value in rax
(i.e.,
the value of expr
) to the next available stack address.
So how should we compile this expression?
(let ((x (pair 1 2))) (left x))
If that pair lives on the stack, we have to put
x
somewhere else! We’d need to somehow track how
much stack space the expression used, and put x
after
it. What about an expression like
(let ((x (if (< 1 2) (pair 1 (pair 2 3)) (pair 1 2))) (left x))
How much stack space will we use for the let? At compile-time, we really don’t know.
We need to put that pair someplace where the
let
won’t clobber it. If the stack is for
“short-lived” values whose size is known at
compile-time, we need somewhere to put longer-lived values whose
size might only be known at run-time.
This other area of memory is called the heap. The heap is where
objects live in Python and Java. It’s where OCaml puts complex
objects (including pairs). In C, we can allocate space on the heap
with the malloc
call.
Just as the stack traditionally “grows” from higher to
lower addresses, the heap will grow from lower addresses to
higher addresses. Just as our program starts with the address
of the bottom of the stack in the rsp
register, it will
start with the bottom of the heap in the rdi
register
(we’ll see how we can make that happen in the runtime
momentarily). We keep track of indexes into the stack at
compile-time; the rsp
register never changes. As we saw
above, though, we won’t be able to do that for the heap: at
compile-time, we don’t actually know how much heap space
we’ll use. So we’ll need to increment the
rdi
register as our program executes in order to make
sure it always points at the next available chunk of heap memory.
Let’s see how that works by implementing the
pair
operation:
let rec compile_exp (tab : int symtab) (stack_index : int) (exp : s_exp) : directive list = match exp with | Lst [Sym "pair"; e1; e2] -> compile_exp tab stack_index e1 @ [Mov (stack_address stack_index, Reg Rax)] @ compile_exp tab (stack_index - 8) e2 @ [ Mov (Reg R8, stack_address stack_index) ; Mov (MemOffset (Reg Rdi, Imm 0), Reg R8) ; Mov (MemOffset (Reg Rdi, Imm 8), Reg Rax) ; Mov (Reg Rax, Reg Rdi) ; Add (Reg Rdi, Imm 16) ]
All we’re doing here is moving the arguments to
pair
into memory, storing the right address in
rax
, and making sure rdi
still points at
the last place in the heap.
Let’s go ahead and implement the runtime now. First,
we’ll need to make sure
rdi
points at a region of memory we can write to.
We’ll do that like this:
extern uint64_t entry(void *heap); // some definitions elided... int main(int argc, char **argv) { print_value(entry((void*)malloc(4096))); return 0; }
We’re allocating 4096 bytes of space to use as our heap, then
passing its address into our entry
function. In the
x86-64 C calling convention, the first argument to a function is
passed in the rdi
register, so this will do what we
want it to!
Now, let’s implement print_value
for our pairs.
Here we’re going to run into some trouble. After all, at
runtime a pair
value is just a memory address. These
addresses are just 64-bit integers! How will we know that
we’re looking at the address of a pair and not at an integer,
or a boolean, or some future other heap-allocated type? We want to
tag our pair addresses just like we’re tagging our other
types. We really can’t shift them, though–x86-64 might
use the whole 64-bit number!
We’re going to use a little trick in order to make this work.
All of our values are 64-bit numbers, and we’re storing values
in memory; so, we’ve been modeling memory as an array of
8-byte cells. As long as we only ever access memory like this,
we’ll only ever increment rdi
by multiples of 8.
So as long as rdi
starts out as a multiple of 8, all of
our addresses will end in 0b000
. And, malloc is
guaranteed to return a multiple of 8 (actually 16). So, since all of
our actual addresses will end in 0b000
, we can use
those last three bits to store a tag.
This tag can’t end in 0b00
, since that’s
used by numbers. And it can’t end in 0b111
, since
our tags for booleans (and characters, on the homework) end in
0b111
. All of the other three-bit sequences, however,
are fair game! We’ll use 0b010
for pairs.
Here’s how that looks in the runtime:
#define heap_mask 0b111 #define pair_tag 0b010 void print_value(uint64_t value) { if ((value & num_mask) == num_tag) { int64_t ivalue = (int64_t)value; printf("%" PRIi64, ivalue >> num_shift); } else if ((value & bool_mask) == bool_tag) { if (value >> bool_shift) { printf("true"); } else { printf("false"); } } else if ((value & heap_mask) == pair_tag) { uint64_t v1 = *(uint64_t*)(value - pair_tag); uint64_t v2 = *(uint64_t*)(value - pair_tag + 8); printf("(pair "); print_value(v1); printf(" "); print_value(v2); printf(")"); } else { printf("BAD %" PRIu64, value); } }
(Remember to recompile the runtime with
gcc -c runtime.c -o runtime.o
.)
And in the compiler:
let heap_mask = 0b111 let pair_tag = 0b010 let rec compile_exp (tab : int symtab) (stack_index : int) (exp : s_exp) : directive list = match exp with | Lst [Sym "pair"; e1; e2] -> compile_exp tab stack_index e1 @ [Mov (stack_address stack_index, Reg Rax)] @ compile_exp tab (stack_index - 8) e2 @ [ Mov (Reg R8, stack_address stack_index) ; Mov (MemOffset (Reg Rdi, Imm 0), Reg R8) ; Mov (MemOffset (Reg Rdi, Imm 8), Reg Rax) ; Mov (Reg Rax, Reg Rdi) ; Or (Reg Rax, Imm pair_tag) ; Add (Reg Rdi, Imm 16) ]
Now we can correctly compile and run programs that create pairs:
$ compile_and_run "(pair 1 false)";;
The left
and right
operations are now
straightforward: we just need to remove the pair tag and access the
right location in memory.
let rec compile_exp (tab : int symtab) (stack_index : int) (exp : s_exp) : directive list = match exp with | Lst [Sym "left"; e] -> compile_exp tab stack_index e @ [Mov (Reg Rax, MemOffset (Reg Rax, Imm (-pair_tag)))] | Lst [Sym "right"; e] -> compile_exp tab stack_index e @ [Mov (Reg Rax, MemOffset (Reg Rax, Imm (-pair_tag + 8)))]
A note about garbage
Once we’re done with some memory on the stack, we decrement our stack index and keep executing the program, using that memory again if necessary. What happens when we’re done with memory on the heap?
Right now, nothing! We never decrement rdi
. For
long-running programs, this could be a big problem. We’re only
allocating 4096 bytes for our heap–that could run out!
Reclaiming unused allocated memory is called garbage collection. Efficiently determining which memory can be safely reclaimed is a tough problem, and fast garbage collection is the subject of ongoing research. For now, we’re not going to worry about reclaiming heap space. We’ll talk more about garbage collection later in the course.