# OCaml and S-Expressions

Last week we wrote a compiler for a very small language: the integers. Next time, we’ll grow our language a bit. Today, though, we’re going to step back and talk about how the OCaml features we’ve been learning these last few days actually helped us build our starter compiler. We’ll also talk about how to represent our language’s syntax.

## OCaml basics

Here’s the compiler we wrote last time:

let compile (program: string): string = String.concat "\n" [ "global _entry"; "_entry:"; Printf.sprintf "\tmov rax, %s" program; "\tret" ]

This program illustrates several OCaml features (see the prior lecture capture for details):

- Functions
- Types
- Lists

## S-Expressions

We want to eventually extend our language with support for many new features,
such as conditionals and functions. We will talk in great depth about the
*semantics* of those features: what a conditional expression *means*. We’ll also
need to figure out what the *syntax* corresponding to those features is. For
instance, conditionals look like this in Javascript:

if (x) { return y; } else { return z; }

And like this in Python:

if x: return y else: return z

We will talk more about syntax much later in the course. For now, our language is going to support a very simple syntax: S-expressions.

S-expressions (short for “symbolic expressions”) were introduced by John McCarthy in his paper on the LISP programming language. They were originally intended as an internal representation that programmers wouldn’t manipulate directly, but LISP programmers liked writing them and the intended language syntax (M-Expressions) was never implemented. Here are some S-expressions:

(if x y z) (* (+ 2 6 4) 3) (define (f x) (if (= x 0) 1 (* x (f (- x 1)))))

S-expressions are a simple, convenient way to represent structured expressions. Because of their simple structure, they are very easy to represent and manipulate inside a compiler or an interpreter.

Look at the expressions above. What components are there? Ignoring everything we
might know about what the expressions *mean* and focusing on how they *look*, we
can see three different kinds of expression:

- Symbols
- such as
`f`

and`if`

- Numbers
- such as
`4`

and`0`

- Lists
- sequences of other S-Expressions wrapper in parentheses, like
`(+ 2 6 4)`

or`(* (+ 2 6 4) 3)`

## S-Expressions in OCaml

How should we represent S-expressions in OCaml? One option would be to use strings. So we could have

let e1 = "(if x y z)" let e2 = "(* (+ 2 6 4) 3)"

This representation is convenient to read in: programmers write programs as
sequences of characters in files. It is, however, very difficult to work
with. Imagine extending our compiler to support unary operations on numbers (for
instance, a function that increments a number). Doing so using this string
representation would be **very** difficult. We’d have to see if the first
character of the string is a `(`

, then look at the next characters to find the
operation, then find the argument, and so on. Moreover, we’d need to repeat this
effort for every operation we want to support.

A better representation for S-expressions will use the structure we talked about above. We can add a new type to OCaml to encode this structure, as follows:

type s_exp = Sym of string | Num of int | Lst of s_exp list

`Sym`

, `Num`

, and `Lst`

are all *constructors*: ways of building
s-expressions. So what this type definition is saying is that an s-expression
can be a symbol **or** a number **or** a list.

The “of” after each constructor indicates that that constructor is associated
with some data of a particular type. For instance, the symbol `if`

represented
as an `s_exp`

looks like `Sym "if"`

; 4000000000000 looks like `Num 4000000000000`

.

`Lst`

is defined *recursively*: it takes as its argument a list of other
s-expressions. These could be build using the `Sym`

or `Num`

constructors or
using more `Lst`

constructors. So for instance, `(if x y z)` looks like

let e1 = Lst [Sym "if"; Sym "x"; Sym "y"; Sym "z"]

and `(* (+ 2 6 4) 3)`

looks like

let e2 = Lst [Sym "*"; Lst [Sym "+"; Num 2; Num 6; Num 4]; Num 3]

We can see that compared to the string representation, which is “flat”, this
representation includes the nested structure of s-expressions. We can access
this structure using *pattern matching*:

let which_constructor (e: s_exp) : string = match e with | Sym x -> "a symbol" | Num n -> "a number" | Lst l -> "a list"

This looks like a repeated “if”. The real power of pattern matching is
*destructuring*: we can use names defined on the left-hand side of each pattern.

let rec total (e: s_exp) : int = match e with | Sym _ -> 0 | Num n -> n | Lst l -> List.fold_left ( + ) 0 (List.map total l)

For the `Lst`

constructor, it can sometimes be useful to destructure the list as
well:

let rec has_if (e: s_exp) : bool = match e with | Sym _ -> false | Num _ -> false | Lst (Sym "if" :: _) -> true | Lst l -> List.exists has_if l

## S-expressions in our compiler

The course staff has written a small library for S-expressions; HW1, out this week,
is designed to get you some practice working with this library. The library
defines the same `s_exp`

type listed above; it also defines functions to
convert strings to s-expressions.

If our OCaml project is set up correctly, we can use the library like this:

open S_exp let e1 = parse "(* (+ 2 6 4) 3)"

`parse`

takes in a string and returns the corresponding s-expression. Later in
the course, we will see how this function works.

The compiler we have developed can be modified to use s-expressions:

open S_exp let compile (program: s_exp): string = match program with | Num n -> String.concat "\n" [ "global _entry"; "_entry:"; Printf.sprintf "\tmov rax, %d" n; "\tret" ]

We only support numbers, so we haven’t included the other cases. Our OCaml editor will complain about this program, saying something about “non-exhaustive pattern matching”; this means that there are some s-expressions that we haven’t matched. We can “fix” this error like this:

open S_exp exception BadExpression of s_exp let compile (program: s_exp): string = match program with | Num n -> String.concat "\n" [ "global _entry"; "_entry:"; Printf.sprintf "\tmov rax, %d" n; "\tret" ] | e -> raise (BadExpression e)

Now we will get a somewhat-useful error if we try to compile something that isn’t a number.