I’m excited to announce a new paper draft, “LVars: Deterministic Parallelism through Shared Monotonic Data Structures”, that my advisor Ryan and I just submitted to ICFP 2013!

The abstract:

Programs written using a deterministic-by-construction model of parallel computation are guaranteed to always produce the same observable results, offering programmers freedom from subtle, hard-to-reproduce nondeterministic bugs that are the scourge of parallel software. We present a new model for deterministic-by-construction parallel programming that generalizes existing single-assignment models to allow multiple assignments that are monotonically increasing with respect to a user-specified partial order. Our model achieves determinism by using a novel shared data structure that allows only monotonic writes and “threshold” reads that block until a lower bound is reached. We give a proof of determinism for our model and describe how to extend it to support a limited form of nondeterminism that admits failures but never wrong answers.

LVars, short for lattice variables, have been the focus of my research for the last year and a half. I’ve written about LVars before, but the basic idea is that LVars are a generalization of IVars, which are variables that can only be written to once. (The “I” is for “immutable”.) LVars, on the other hand, allow multiple writes, as long as those writes are monotonically increasing with respect to a user-specified lattice. That is, the state an LVar is in can only “increase” with time, where the user specifies what “increase” means. Meanwhile, LVar reads have to specify what we call a threshold set of possible states. The threshold set forms a kind of “tripwire” across the user-specified lattice. A read will block until one of the states in the threshold set has been reached, and then return that state.

IVars already serve as the foundation for a number of existing deterministic parallel programming models. In our work, we show that LVars enforce determinism, regardless of the lattice you instantiate them with; IVars turn out to be a special case of LVars. That part of the work has been solid for a while, but until recently, we’ve been lacking meaty examples of how to program with LVars, or a good justification for why someone would want to use LVars instead of IVars or some other existing deterministic parallel model. So, for this paper, we put together a prototype implementation of an LVar library in Haskell, and we used it to implement a solution to the following problem, which we felt posed a challenge for existing models.

In a directed graph, find the connected component containing a vertex v (or the connected component of all nodes within k hops of v), and compute a (possibly expensive) function f over all vertices in that component, making the set of results available asynchronously to other computations.

Have a look at the paper for more details!

Update (October 3, 2013): This paper ended up being published at FHPC ‘13 under the title “LVars: Lattice-based Data Structures for Deterministic Parallelism”. The slides from the accompanying talk are available, too.

A debugging story

In my last post, I wrote about pattern matching in Rust, emphasizing the importance of making patterns non-overlapping and therefore order-independent. Some folks wondered what the advantage of order-independence is. In response, here’s a little story. The prototype LVar library we implemented for this paper includes a work-stealing scheduler, in which processors that don’t have enough to do are allowed to “steal” tasks from other processors, to balance out the load; the implementation of work-stealing is pretty much exactly what’s described in the 2011 Haskell Symposium paper about the monad-par library. (In fact, our LVar library is based on monad-par, but replaces some of its guts with LVar guts.)

One of the things we wanted to demonstrate with our prototype, of course, is that it actually did enable parallelism – that if we gave our code more parallel resources to run on, then it would, in fact, run faster! At first, that didn’t seem to be happening. My first couple of experiments looked like this:

$ ./ntimes_minmedmax 5 ./lvar_pure +RTS -N1
[...a bunch of output elided here...]
REALTIME 0.968513s 1.004537s 1.030448s

$ ./ntimes_minmedmax 5 ./lvar_pure +RTS -N2
[...a bunch of output elided here...]
REALTIME 0.97458s 0.981732s 0.991555s

Here, lvar_pure is a program that starts at one corner of a three-dimensional grid graph of 125,000 nodes, finds the connected component within 25 hops of that node1 and applies a particular function f to each node in that component. As the code runs, the results of all those applications of f are gradually written into an LVar, where they are available to other computations. The +RTS -N1 and +RTS -N2 options tell the GHC run-time system to run on 1 or 2 cores, respectively, and ntimes_minmedmax is a little script that runs the program a given number of times (here, 5 times) and returns the minimum, median, and maximum running times of the bunch.2

This isn’t a serious attempt at benchmarking, but it should give a rough idea of what was going on; as you can see, I didn’t seem to be getting any kind of parallel speedup at all when I added a second core. I was distraught, until I realized that the scheduler’s steal function looked like this:

-- | Attempt to steal work or, failing that, give up and go idle.
steal :: Sched -> IO ()
steal _ = return ()
steal q@Sched{ idle, scheds, no=my_no } = do
  -- elided: a bunch of code that, you know,
  -- actually implements work-stealing ...

That steal _ = return () line, with its catch-all _ pattern, was making it so that the steal function never actually did anything. We never even got to the code that actually did the stealing!

In fact, the extra line was there because we had stubbed out the steal function, then forgotten to remove the stub after implementing the real thing. Now, GHC is kind enough to warn you about overlapping patterns:

    Warning: Pattern match(es) are overlapped
             In an equation for `steal':
                 steal q@(Sched {idle, scheds, no = my_no}) = ...

But we had blithely ignored that warning as it rolled by in a flurry of other build messages. Oops.

After getting rid of the steal _ = return () line, we got much nicer parallel running times.3 Hooray!

$ ./ntimes_minmedmax 5 ./lvar_pure +RTS -N1
[...a bunch of output elided...]
REALTIME 1.001359s 1.013753s 1.022911s

$ ./ntimes_minmedmax 5 ./lvar_pure +RTS -N2
[...a bunch of output elided...]
REALTIME 0.529671s 0.53914s 0.542408s

There are lots of examples I could give of situations where I’ve been bitten by overlapping patterns, but I thought this was a particularly insidious one. They usually manifest as programs producing wrong answers, which are relatively easy to catch; this might be the first one I’ve seen that manifested as a program running more slowly. (Then again, that might just be a sign of not having done a lot of RTS or scheduler hacking in my career so far.) I do think that GHC’s “Pattern match(es) are overlapped” warning is nice; I wonder if we could get the Rust compiler to do that.

Thanks for reading! I’d appreciate any feedback on the paper. The code also needs a lot of work, but we’re making it available anyway, in the spirit of the CRAPL.

  1. Exercise for the reader: how much of the graph can you cover in 25 hops from a corner node? 

  2. I should probably mention that this is just an example, and that the experiments we actually ran and describe in the paper used a different graph, a different number of hops, a different f, and a more principled approach to benchmarking (which is to say, I didn’t just run them on my laptop while a bunch of other junk was running). 

  3. Well, that’s a lie; we then had to fix some other problems, mainly having to do with laziness and parallelism not playing nicely together. But it helped to have an implementation of work-stealing that actually ever stole work.