Write an interpreter
I spent last week volunteering as a resident at Hacker School, where I helped a number of students write small interpreters. For the most part, the students I worked with weren’t exactly beginning programmers. One of them, for instance, was a robotics Ph.D. student who had written a trajectory follower that helped a robot autonomously navigate through the desert.
Clearly, it’s possible to get quite far in a programming career without having written an interpreter – or, at least, without having consciously done so. So, why does it matter? Why should you write an interpreter?
To me, one reason is that an interpreter is the quintessential program that operates on programs. To be comfortable with interpreters is to be comfortable with the idea of code as data, a powerful and ubiquitous idea. Programs that operate on programs include things like interpreters and compilers, but also things like emulators and debuggers. Much of the programming world regards these kinds of programs as magical, but they aren’t.
So, if you’ve never written an interpreter, or if you’re not sure if you have, let’s write a tiny one, right now!
A tiny language of arithmetic expressions
We’ll start by interpreting simple arithmetic expressions, like (5 + 3)
or ((4 - 2) * (5 + (3 + 9)))
.
To interpret an expression means to evaluate it: to determine its value. For instance, the value of (5 + 3)
might be 8
, and the value of ((4 - 2) * (5 + (3 + 9)))
might be 34
. So, our interpreter will be a program that takes an arithmetic expression as input and returns an integer as output.
Let’s write down a grammar for expressions we want to be able to interpret. For now, we’ll handle integers and a few two-argument operations on them: addition, subtraction, and multiplication.
Expr =:: int
| (Expr + Expr)
| (Expr - Expr)
| (Expr * Expr)
Because the grammar is defined recursively, it allows for nested expressions like (1 + ((3 * 7) - 8))
, as well as simple ones like 5
and (1000 * 3)
.
If we use a language that allows us to define our own data types, we can turn our grammar into a data type for expressions. In Haskell, for instance, it might look like this:
data Expr = Number Int
| Plus Expr Expr
| Minus Expr Expr
| Times Expr Expr
Here, Number
, Plus
, Minus
, and Times
can be thought of as “tags” that we’ve put on each variant of Expr
so that we can tell them apart. These “tags”, though, are really value constructor functions that we can call to construct various values of type Expr
.
The Plus
value constructor takes two arguments, each of which must itself be of type Expr
; it’s the same for Minus
and Times
. The Number
value constructor, though, only expects an Int
, which is a type built in to Haskell.
With our Expr
type defined, we can construct a few Expr
s by calling the value constructors. We can use GHCi’s :t
command to see the type of each one, making sure that it is in fact an Expr
:
*Interp> :t Number 3
Number 3 :: Expr
*Interp> :t Plus (Number 3) (Number 6)
Plus (Number 3) (Number 6) :: Expr
*Interp> :t Plus (Number 3) (Times (Number 6) (Number 1))
Plus (Number 3) (Times (Number 6) (Number 1)) :: Expr
Do things like 3
, (3 + 6)
, and (3 + (6 * 1))
count as Expr
s? Not quite. We’ve skipped a step, which is to write a parser that will convert a string like "(3 + (6 * 1))"
into its Expr
representation of Plus (Number 3) (Times (Number 6) (Number 1))
.
We’ll come back to parsing in a little while. For now, let’s concern ourselves with Expr
s rather than strings. We want to be able to interpret Expr
s like Plus (Number 3) (Times (Number 6) (Number 1))
, evaluating them to integer values like 9
.
The interp
function
We know that any combination of addition, subtraction, and multiplication operations on integers will result in another integer. (We conveniently left out division, which would mean dealing with non-integral values.) So, our interpreter can be a function that takes an expression and returns an integer. In Haskell, its signature would look like this:
interp :: Expr -> Int
Since there are four value constructors for Expr
s, the interp
function will have four cases to handle. The first case is easy: If interp
is passed a Number
, then all it has to do is return the Int
wrapped up inside.
interp :: Expr -> Int
interp e = case e of
Number n -> n
Trying out our code, we see that a Number
just evaluates to itself, represented as an Int
.
*Interp> interp (Number 4)
4
At this point, though, if we try to call interp
on any value that isn’t a Number
, we’re in trouble:
*Interp> interp (Plus (Number 4) (Number 5))
*** Exception: Interp.hs:(10,12)-(14,42): Non-exhaustive patterns in case
GHCi is complaining that we haven’t covered all the possibilities for Expr
s. So, let’s write the rest of the cases:
interp :: Expr -> Int
interp e = case e of
Number n -> n
Plus e1 e2 -> (interp e1) + (interp e2)
Minus e1 e2 -> (interp e1) - (interp e2)
Times e1 e2 -> (interp e1) * (interp e2)
Just as the Plus
, Minus
, and Times
value constructors in the Expr
type are defined recursively, so are the Plus
, Minus
, and Times
clauses of interp
. To interpret a Plus
expression, we interpret its two subexpressions e1
and e2
, then just add the results together using Haskell’s built-in +
operation. We do the analogous thing for Minus
and Times
.
(Plus (Number 4) (Number 5))
works fine now:
*Interp> interp (Plus (Number 4) (Number 5))
9
And fancier, nested expressions work too:
*Interp> interp (Plus (Number 3) (Times (Number 6) (Number 1)))
9
*Interp> interp (Plus (Number 3) (Times (Number 6) (Number (-3))))
-15
Hooray! We’ve written a tiny, ten-line interpreter. Here’s the whole thing:
data Expr = Number Int
| Plus Expr Expr
| Minus Expr Expr
| Times Expr Expr
interp :: Expr -> Int
interp e = case e of
Number n -> n
Plus e1 e2 -> (interp e1) + (interp e2)
Minus e1 e2 -> (interp e1) - (interp e2)
Times e1 e2 -> (interp e1) * (interp e2)
Parsing
It’s cumbersome to have to write long Expr
s like (Plus (Number 3) (Times (Number 6) (Number 1)))
to pass to interp
. Being able to write the string "(3 + (6 * 1))"
would be much more convenient. This is where parsing comes in. The job of a parser is to turn strings of concrete syntax like "(3 + (6 * 1))"
into abstract syntax representations like (Plus (Number 3) (Times (Number 6) (Number 1)))
.
Some people enjoy writing parsers. Others find it unbearably tedious. Fortunately, if you belong to the latter group, parser generators help automate the work. We can use Happy, a parser generator for Haskell, to generate a parser for our tiny language.
In a Happy grammar file of a few dozen lines, we provide a grammar for our language, plus a little supporting code. The Happy documentation explains how to go about creating such a file. (In fact, I had never used a parser generator before writing this post, but I didn’t have any trouble creating the Happy grammar file. It was largely a matter of paring down the provided example in the Happy docs to the smaller grammar we’re using here!)
Happy will take this grammar file and generate a Haskell module that provides lexer
and parser
functions. The lexer
function turns a string into a list of tokens:
*Main> lexer "(3 + (6 * 1))"
[TokenOB,TokenInt 3,TokenPlus,TokenOB,TokenInt 6,TokenTimes,TokenInt 1,TokenCB,TokenCB]
And parser
turns this list of tokens into an Expr
:
*Main> (parser . lexer) "(3 + (6 * 1))"
Plus (Number 3) (Times (Number 6) (Number 1))
Finally, once we have an Expr
, we can pass it on to our interpreter, interp
:
*Main> (interp . parser . lexer) "(3 + (6 * 1))"
9
We intentionally kept the grammar for our language excruciatingly simple in order to make it as easy as possible to create the Happy grammar file. This leads to some annoyances, like the fact that we have to parenthesize all operations. For instance, "(3 + 4)"
parses, but "3 + 4"
doesn’t. The parser doesn’t handle negative integers well, either.
*Main> (parser . lexer) "(3 + 4)"
Plus (Number 3) (Number 4)
*Main> (parser . lexer) "3 + 4"
*** Exception: Parse error
*Main> (parser . lexer) "3 + (-4)"
*** Exception: Parse error
We could fix these bugs at the cost of complicating the grammar. Nothing about the interpreter itself would need to change, though; "(3 + 4)"
and "3 + 4"
both ought to parse into Plus (Number 3) (Number 4)
, and negative integers present no difficulty for interp
. The only hitch is in turning them into Expr
s in the first place.
Next time: variables
So far, we’ve managed to interpret a tiny, severely restricted language of arithmetic expressions. It would be nice to eventually grow our language into some semblance of a real programming language. To do that, sooner or later we’ll need a notion of variables. Luckily, it’s quite straightforward to add support for variables to the interpreter and parser we’ve got now. I’ll cover variables in my next post.
The code from this post, plus a little additional scaffolding, is on GitHub.
Comments