Last week, I presented a talk at JuliaCon 2016 about ParallelAccelerator.jl. Thanks to everyone at JuliaCon for being a great audience for this talk!

Everything in the latter twenty minutes of my talk is already pretty well covered in my blog post from March that introduced the ParallelAccelerator package, so in this post, I’ll focus on just the first ten minutes or so.


Laura, a geophysicist who studies the physics behind ice-penetrating radar.  This picture probably wasn't taken in Antarctica.
Laura, a geophysicist who studies the physics behind ice-penetrating radar. This picture probably wasn't taken in Antarctica.

This is my friend Laura. She’s a geophysicist who studies the physics behind ice-penetrating radar. Her research group collects radar data by flying a plane over Antarctica, transmitting radar pulses, and listening for and recording the result, which appears in radargrams like this one.

By studying the radar profile, they can learn more about the ice and the rocks beneath. Right now, she’s working on trying to develop signal processing techniques that will allow them to discriminate between echoes originating directly below the aircraft and off to the side.

The mathematical tools in her toolbox for doing these things include convolutions, FFTs, and so on. She’s an expert in those things, and they’re what she wants to be thinking about when she writes code.

She hates doing manual memory management, and she doesn’t know how to schedule parallel tasks across compute resources – or even if she did, she still wouldn’t want to have to worry about it when she could be thinking about signal processing. So, she prefers to use a programming language that frees her from having to think about that stuff, at least when she’s at the prototyping stage. In other words, she wants what I’ll call a “productivity” language – a language that enables her to get stuff done at the level of abstraction that matches her expertise.

The productivity-performance tradeoff: Productivity languages: Matlab, Python, R, Julia, ... . How do you "scale up" a productivity-language prototype? The answer today: Get an expert to port the code to an efficiency language. The result is fast...and also brittle, hard to experiment with, and hard to maintain. Can we do better?
The productivity-performance tradeoff: Productivity languages: Matlab, Python, R, Julia, ... . How do you "scale up" a productivity-language prototype? The answer today: Get an expert to port the code to an efficiency language. The result is fast...and also brittle, hard to experiment with, and hard to maintain. Can we do better?

So Laura, and her colleagues, and others like them, typically use a language like MATLAB, Python, R, or Julia. These languages give you a high-level programming model; they generally don’t make you think about things like memory management or thread scheduling; and they typically come with an extensive set of libraries for doing the kinds of things you might want to do, like matrix multiplication or FFTs.

But once you’ve written a working prototype in one of these languages, a dilemma arises when you want to scale up your code to run on a larger input size, or for more iterations, or at a finer granularity. The productivity language that you’re using turns out to be too slow.

So now what do you do? Well, the usual approach is that you have to port the code from the productivity language to what I’ll call an “efficiency language”, like C or C++, or, more likely, you enlist the help of someone to do that for you. In fact, maybe some of you have been the unfortunate grad student who was brought on to do this for some project.

This takes considerable time and effort. Doing it right requires expertise in the efficiency language, plus expertise in high-performance computing tools like MPI. You end up with code that is indeed much faster and more efficient than the original, but it’s also more brittle, harder to experiment with, and harder to maintain. Plus, you’re likely to introduce bugs at this stage.

Is there any way around this tradeoff between productivity and efficiency?

How about high-performance EDSLs? Idea: trade off generality for productivity and performance. Delite (Brown et al., 2011), SEJITS (Catanzaro et al., 2009), ... . Great results! But, two challenges: The learning curve, and the rest of the productivity story...
How about high-performance EDSLs? Idea: trade off generality for productivity and performance. Delite (Brown et al., 2011), SEJITS (Catanzaro et al., 2009), ... . Great results! But, two challenges: The learning curve, and the rest of the productivity story...

It turns out that there is a well-regarded way to avoid this productivity/efficiency dilemma, and it’s using high-performance domain-specific languages, or DSLs. Quite often these DSLs are embedded in a general-purpose language, so we speak of embedded DSLs, or EDSLs. A few examples are the languages developed using the Delite framework from Stanford for building EDSLs in Scala. Then there’s the SEJITS project at Berkeley, which is a framework for developing high-level, high-performance DSLs embedded in Python. And there are others as well.

The three-way tradeoff that the Delite team at Stanford formulated was between generality, productivity, and performance: in other words, by using DSLs we give up generality to gain productivity and performance. And the idea is that the combination of all three is the “magical unicorn” that doesn’t exist.

What the Delite people found is that by restricting things to a particular domain, you can offer both a high-level programming model and high performance, because domain-specificity allows the compiler to make smarter optimization choices.

But, if these embedded DSLs for scientific computing are so great, why are they not more widely used by practitioners?

We think that there are two problems that hinder adoption. First, there’s the learning curve. Computational scientist programmers are likely already familiar with a general-purpose productivity language for scientific computing such as MATLAB, R, or Julia, and they already have a body of code written in that language. Learning an EDSL and porting code to it, although not as painful as porting directly to an efficiency language, is still a burden. And it’s hard for even the most well-designed DSL to anticipate every use case that the programmer will eventually encounter.

And then there’s the rest of the productivity story.

The rest of the EDSL productivity story: Several dimensions to productivity beyond offering the “right” abstractions for a domain. Fast compilation time; robust to a wide variety of inputs; debuggable using familiar techniques; available on the platforms users want to use.
The rest of the EDSL productivity story: Several dimensions to productivity beyond offering the “right” abstractions for a domain. Fast compilation time; robust to a wide variety of inputs; debuggable using familiar techniques; available on the platforms users want to use.

In the three-way tradeoff between generality, productivity, and performance, by “productivity” we usually mean something like “offering the right high-level abstractions work at the level of your expertise”. But what we find is that there are several dimensions to “productivity” in an EDSL beyond just the question of whether it offers the right high-level language abstractions for the domain (which is already hard enough).

In order to be a useful replacement for a general-purpose productivity language, an EDSL needs to have fast compilation time.

It needs to be robust – that is, the compiler must be able to at least handle, if not optimize, a wide variety of inputs that a non-expert user might throw at it. A lot of EDSLs fall down here; if the compiler encounters something it can’t deal with, then it just fails.

Programs in the EDSL must be reasonably debuggable, using debugging techniques that the domain programmer is already familiar with. This is related to the previous point. EDSLs often have terrible debugging support. The error messages are often really confusing, because they’re in terms of the host language rather than the embedded language.

Finally, the EDSL must be available on the domain programmer’s platform of choice. There are a lot of people who are using Matlab on Windows! If you give them something that’s difficult or impossible to run on Windows, they won’t use it.

Failure to satisfy any of these criteria is likely to hinder the adoption of a high-performance EDSL, no matter how good its performance is and no matter how suitable its high-level abstractions are. So, what can we do about this?

ParallelAccelerator: A combination compiler-library solution. Accelerate existing language constructs: map, reduce, comprehension. Support additional domain-specific constructs (runStencil) ...with two implementations: library-only and native. Run in library-only mode during development and debugging. Run in native mode for high performance at deployment.
ParallelAccelerator: A combination compiler-library solution. Accelerate existing language constructs: map, reduce, comprehension. Support additional domain-specific constructs (runStencil) ...with two implementations: library-only and native. Run in library-only mode during development and debugging. Run in native mode for high performance at deployment.

ParallelAccelerator is a system that aims to overcome these challenges. It’s a combination of a compiler and a library working together to provide high performance in a productivity language.

As with embedded DSLs, our starting point is an existing programming language – in this case Julia, which I’ll say more about in a bit. But, ParallelAccelerator mostly focuses on speeding up execution of the constructs already present in the host language. In particular, it’s targeting numeric code that can be expressed with implicitly parallel array operations like map, reduce, and comprehension.

It’s also more than just an optimizing compiler for an array language,though – ParallelAccelerator supports additional domain-specific constructs, such as, for example, a runStencil construct for stencil computations. These domain-specific constructs have two implementations in ParallelAccelerator: a library function, and a high-performance native implementation in the ParallelAccelerator compiler, using the same parallel execution framework as the constructs that are part of the Julia language. So, any program that leverages ParallelAccelerator is still a valid Julia program that can be executed without acceleration.

This means that it’s possible to turn off the compiler and run in library-only mode during development and debugging, and then, enable the compiler when it’s time to deploy. And that lets us conveniently sidestep – not solve, but sidestep – all the issues that I mentioned earlier that hinder adoption of high-performance embedded DSLs.


That’s it for the first ten minutes of my talk! For the rest, see my post on the Julia blog, or take a look at the rest of the slides. Sooner or later, there should also be a video. Thanks to the JuliaCon organizers for having me, to Laura Lindzey for graciously agreeing to appear in my talk, and to the Northeastern PRL students for hosting a practice talk and offering lots of helpful feedback.

Comments