I spent last week volunteering as a resident at Hacker School, where I helped a number of students write small interpreters. For the most part, the students I worked with weren’t exactly beginning programmers. One of them, for instance, was a robotics Ph.D. student who had written a trajectory follower that helped a robot autonomously navigate through the desert.
Clearly, it’s possible to get quite far in a programming career without having written an interpreter – or, at least, without having consciously done so. So, why does it matter? Why should you write an interpreter?
To me, one reason is that an interpreter is the quintessential program that operates on programs. To be comfortable with interpreters is to be comfortable with the idea of code as data, a powerful and ubiquitous idea. Programs that operate on programs include things like interpreters and compilers, but also things like emulators and debuggers. Much of the programming world regards these kinds of programs as magical, but they aren’t.
So, if you’ve never written an interpreter, or if you’re not sure if you have, let’s write a tiny one, right now!
A tiny language of arithmetic expressions
We’ll start by interpreting simple arithmetic expressions, like
(5 + 3) or
((4 - 2) * (5 + (3 + 9))).
To interpret an expression means to evaluate it: to determine its value. For instance, the value of
(5 + 3) might be
8, and the value of
((4 - 2) * (5 + (3 + 9))) might be
34. So, our interpreter will be a program that takes an arithmetic expression as input and returns an integer as output.
Let’s write down a grammar for expressions we want to be able to interpret. For now, we’ll handle integers and a few two-argument operations on them: addition, subtraction, and multiplication.
Expr =:: int | (Expr + Expr) | (Expr - Expr) | (Expr * Expr)
Because the grammar is defined recursively, it allows for nested expressions like
(1 + ((3 * 7) - 8)), as well as simple ones like
(1000 * 3).
If we use a language that allows us to define our own data types, we can turn our grammar into a data type for expressions. In Haskell, for instance, it might look like this:
data Expr = Number Int | Plus Expr Expr | Minus Expr Expr | Times Expr Expr
Times can be thought of as “tags” that we’ve put on each variant of
Expr so that we can tell them apart. These “tags”, though, are really value constructor functions that we can call to construct various values of type
Plus value constructor takes two arguments, each of which must itself be of type
Expr; it’s the same for
Number value constructor, though, only expects an
Int, which is a type built in to Haskell.
Expr type defined, we can construct a few
Exprs by calling the value constructors. We can use GHCi’s
:t command to see the type of each one, making sure that it is in fact an
*Interp> :t Number 3 Number 3 :: Expr *Interp> :t Plus (Number 3) (Number 6) Plus (Number 3) (Number 6) :: Expr *Interp> :t Plus (Number 3) (Times (Number 6) (Number 1)) Plus (Number 3) (Times (Number 6) (Number 1)) :: Expr
Do things like
(3 + 6), and
(3 + (6 * 1)) count as
Exprs? Not quite. We’ve skipped a step, which is to write a parser that will convert a string like
"(3 + (6 * 1))" into its
Expr representation of
Plus (Number 3) (Times (Number 6) (Number 1)).
We’ll come back to parsing in a little while. For now, let’s concern ourselves with
Exprs rather than strings. We want to be able to interpret
Plus (Number 3) (Times (Number 6) (Number 1)), evaluating them to integer values like
We know that any combination of addition, subtraction, and multiplication operations on integers will result in another integer. (We conveniently left out division, which would mean dealing with non-integral values.) So, our interpreter can be a function that takes an expression and returns an integer. In Haskell, its signature would look like this:
interp :: Expr -> Int
Since there are four value constructors for
interp function will have four cases to handle. The first case is easy: If
interp is passed a
Number, then all it has to do is return the
Int wrapped up inside.
interp :: Expr -> Int interp e = case e of Number n -> n
Trying out our code, we see that a
Number just evaluates to itself, represented as an
*Interp> interp (Number 4) 4
At this point, though, if we try to call
interp on any value that isn’t a
Number, we’re in trouble:
*Interp> interp (Plus (Number 4) (Number 5)) *** Exception: Interp.hs:(10,12)-(14,42): Non-exhaustive patterns in case
GHCi is complaining that we haven’t covered all the possibilities for
Exprs. So, let’s write the rest of the cases:
interp :: Expr -> Int interp e = case e of Number n -> n Plus e1 e2 -> (interp e1) + (interp e2) Minus e1 e2 -> (interp e1) - (interp e2) Times e1 e2 -> (interp e1) * (interp e2)
Just as the
Times value constructors in the
Expr type are defined recursively, so are the
Times clauses of
interp. To interpret a
Plus expression, we interpret its two subexpressions
e2, then just add the results together using Haskell’s built-in
+ operation. We do the analogous thing for
(Plus (Number 4) (Number 5)) works fine now:
*Interp> interp (Plus (Number 4) (Number 5)) 9
And fancier, nested expressions work too:
*Interp> interp (Plus (Number 3) (Times (Number 6) (Number 1))) 9 *Interp> interp (Plus (Number 3) (Times (Number 6) (Number (-3)))) -15
Hooray! We’ve written a tiny, ten-line interpreter. Here’s the whole thing:
data Expr = Number Int | Plus Expr Expr | Minus Expr Expr | Times Expr Expr interp :: Expr -> Int interp e = case e of Number n -> n Plus e1 e2 -> (interp e1) + (interp e2) Minus e1 e2 -> (interp e1) - (interp e2) Times e1 e2 -> (interp e1) * (interp e2)
It’s cumbersome to have to write long
(Plus (Number 3) (Times (Number 6) (Number 1))) to pass to
interp. Being able to write the string
"(3 + (6 * 1))" would be much more convenient. This is where parsing comes in. The job of a parser is to turn strings of concrete syntax like
"(3 + (6 * 1))" into abstract syntax representations like
(Plus (Number 3) (Times (Number 6) (Number 1))).
Some people enjoy writing parsers. Others find it unbearably tedious. Fortunately, if you belong to the latter group, parser generators help automate the work. We can use Happy, a parser generator for Haskell, to generate a parser for our tiny language.
In a Happy grammar file of a few dozen lines, we provide a grammar for our language, plus a little supporting code. The Happy documentation explains how to go about creating such a file. (In fact, I had never used a parser generator before writing this post, but I didn’t have any trouble creating the Happy grammar file. It was largely a matter of paring down the provided example in the Happy docs to the smaller grammar we’re using here!)
Happy will take this grammar file and generate a Haskell module that provides
parser functions. The
lexer function turns a string into a list of tokens:
*Main> lexer "(3 + (6 * 1))" [TokenOB,TokenInt 3,TokenPlus,TokenOB,TokenInt 6,TokenTimes,TokenInt 1,TokenCB,TokenCB]
parser turns this list of tokens into an
*Main> (parser . lexer) "(3 + (6 * 1))" Plus (Number 3) (Times (Number 6) (Number 1))
Finally, once we have an
Expr, we can pass it on to our interpreter,
*Main> (interp . parser . lexer) "(3 + (6 * 1))" 9
We intentionally kept the grammar for our language excruciatingly simple in order to make it as easy as possible to create the Happy grammar file. This leads to some annoyances, like the fact that we have to parenthesize all operations. For instance,
"(3 + 4)" parses, but
"3 + 4" doesn’t. The parser doesn’t handle negative integers well, either.
*Main> (parser . lexer) "(3 + 4)" Plus (Number 3) (Number 4) *Main> (parser . lexer) "3 + 4" *** Exception: Parse error *Main> (parser . lexer) "3 + (-4)" *** Exception: Parse error
We could fix these bugs at the cost of complicating the grammar. Nothing about the interpreter itself would need to change, though;
"(3 + 4)" and
"3 + 4" both ought to parse into
Plus (Number 3) (Number 4), and negative integers present no difficulty for
interp. The only hitch is in turning them into
Exprs in the first place.
Next time: variables
So far, we’ve managed to interpret a tiny, severely restricted language of arithmetic expressions. It would be nice to eventually grow our language into some semblance of a real programming language. To do that, sooner or later we’ll need a notion of variables. Luckily, it’s quite straightforward to add support for variables to the interpreter and parser we’ve got now. I’ll cover variables in my next post.
The code from this post, plus a little additional scaffolding, is on GitHub.