Interacting with the environment

Up until this point, our compiler has had a curious property. Here’s another compiler we could have written for our language:

let rec compile_value (stack_index : int) (v : Interp.value) =
  match v with
  | Number n ->
      [Mov (Reg Rax, operand_of_num n)]
  | Boolean b ->
      [Mov (Reg Rax, operand_of_bool b)]
  | Pair (v1, v2) ->
      compile_value stack_index v1
      @ [Mov (stack_address stack_index, Reg Rax)]
      @ compile_value (stack_index - 8) v2
      @ [ Mov (Reg R8, stack_address stack_index)
        ; Mov (MemOffset (Reg Rdi, Imm 0), Reg R8)
        ; Mov (MemOffset (Reg Rdi, Imm 8), Reg Rax)
        ; Mov (Reg Rax, Reg Rdi)
        ; Or (Reg Rax, Imm pair_tag)
        ; Add (Reg Rdi, Imm 16) ]

let compile (program : s_exp) : string =
  [Global "entry"; Label "entry"]
  @ compile_value (-8) (Interp.interp_exp Symtab.empty program)
  @ [Ret]
  |> List.map string_of_directive
  |> String.concat "\n"

(Go ahead and put it into our usual compile.ml file to see that it works.)

That’s quite a bit simpler! It’s even a better compiler–the programs it outputs are short and execute very efficiently. Given that it has been much easier to write our interpreter than to write our compiler, why not just do this? Why does anyone work on compilers?

The answer, of course, is that most programs don’t just compute the answer to a fixed expression! We usually write programs because we want to do the same operation multiple times on different inputs. However, in our current language, there’s no way to get an input from the outside world. Let’s fix that now!

Adding input to the interpreter

let input_channel = ref stdin

let rec interp_exp env (exp : s_exp) : value =
  match exp with
  | Lst [Sym "read-num"] ->
      Number (input_line !input_channel |> int_of_string)
  (* ... *)

Simple enough! We get slightly weird behavior if we do this, though (at least on my machine):

> interp "(pair (read-num) (read-num))";;
1
2
(pair 2 1)

What’s going on here? As it turns out, the problem isn’t with read-num–it’s with pair!

let rec interp_exp env (exp : s_exp) : value =
  match exp with
  (* ... *)
  | Lst [Sym "pair"; e1; e2] ->
      Pair (interp_exp env e1, interp_exp env e2)
  (* ... *)

We’re calling interp_exp twice, and each of those ends up reading input. But it seems like the second one is happening first!

The order in which OCaml evaluates arguments to functions (in this case, the function that constructs a pair) is actually unspecified: the implementation can evaluate them in whatever order it likes. This often doesn’t really matter–for pure expressions like most of the ones we’ve been dealing with, it doesn’t matter when they are evaluated! Now, though, we’ve introduced a side-effect: reading input. It really does matter now when those calls to interp_exp get evaluated!

Our compiler evaluates arguments to binary operations like pair in left to right order. We’ll do the same in the interpreter:

let rec interp_exp env (exp : s_exp) : value =
  match exp with
  (* ... *)
  | Lst [Sym "pair"; e1; e2] ->
      let l = interp_exp env e1 in
      let r = interp_exp env e2 in
      Pair (l, r)
  (* ... *)


let interp (program : string) : string =
  interp_exp Symtab.empty (parse program)

Adding input to the compiler

First, we’ll add a function to the C runtime to read a number:

uint64_t read_num() {
  int r;
  scanf("%d", &r);
  return (uint64_t)(r) << num_shift;
}

We’ll need to let the assembler know that the read_num label is defined in the runtime:

let compile (program : s_exp) : string =
  [ Global "entry"
  ; Extern "error"
  ; Extern "read_num"
  ; Label "entry" ]
  @ compile_exp Symtab.empty (-8) program
  @ [Ret]
  |> List.map string_of_directive
  |> String.concat "\n"

Now we need to actually compile the (read-num) form into a call to this function. In our error-handling code, we were able to “call” C functions just by jumping to the right label. Fundamentally that’s still what’s going to happen, but we’re going to need to do some additional work to make sure our program can keep executing after the function call:

  • read_num might use the rdi register where our heap pointer lives. We’ll need to save the heap pointer to the stack before read_num starts executing.
  • Speaking of the stack, we’ve been writing to stack locations like [rsp - 8], [rsp - 16], etc. read_num might also want to write to [rsp - 8], which will be a problem since we might have data there! The solution is to adjust rsp and restore it after the call.
  • We haven’t modified memory at [rsp + 0] because that’s where our function’s return address is–when we call ret, the processor resumes operations at that address. Instead of calling read_num using jmp, we’ll use call, which will put the new return address on the top of the stack.
  • Finally: C functions on x86-64 require that when they start executing, rsp (the location of the return address) is a multiple of 16. So, we’ll need to further adjust rsp to make sure this property holds.

Here’s the code we end up with:

let align_stack_index (stack_index : int) : int =
  if stack_index mod 16 = -8 then stack_index else stack_index - 8

let rec compile_exp (tab : int symtab) (stack_index : int) (exp : s_exp) :
    directive list =
  match exp with
  | Lst [Sym "read-num"] ->
      [ Mov (stack_address stack_index, Reg Rdi)
      ; Add (Reg Rsp, Imm (align_stack_index stack_index))
      ; Call "read_num"
      ; Sub (Reg Rsp, Imm (align_stack_index stack_index))
      ; Mov (Reg Rdi, stack_address stack_index) ]
  (* ... *)

See the lecture capture for an explanation of why this works, including some worked examples.