Notes on Halide and Distributed Halide
Halide is a newish domain-specific language for writing high-performance image processing pipelines, originally created by a group of researchers at MIT and Adobe and now used in lots of places. Although the original paper on Halide appeared at SIGGRAPH 2012, I like the PLDI 2013 paper better as an introduction to this line of work.
Halide is really two languages: one for expressing the algorithm you want to compute, and one for expressing how you want that computation to be scheduled. In fact, the key idea of Halide is this notion of separating algorithm from schedule.
Understanding the scheduling tradeoff space
The paper uses a two-stage image processing pipeline as a running example. Let’s say you want to blur an image by averaging each pixel with its neighboring pixels to the left and right, as well as with the pixels above and below. We could write an algorithm for this 3x3 box blur something like the following, where in
is the original image, bh
is the horizontally blurred image, and bv
is the final image, now also vertically blurred:1
bh(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y)) / 3;
bv(x, y) = (bh(x, y-1) + bh(x, y) + bh(x, y+1)) / 3;
The Halide paper illustrates how, even with a simple two-stage algorithm like this, there are lots of challenging scheduling choices to be made. For instance, you could do all of the horizontal blurring before doing the vertical blurring (the “breadth-first” approach), which offers lots of parallelism. Or you could compute each pixel of the horizontal blur just before you need it for the vertical blur (the “total fusion” approach), which offers good data locality. Or you could take a “sliding window” approach that interleaves the horizontal and vertical stages in a way that avoids redundant computation. Or you could do some combination of all those approaches in search of a sweet spot. The choices are overwhelming, and that’s just for a toy two-stage pipeline! For more sophisticated pipelines, such as the one described in the paper that computes the local Laplacian filters algorithm, the scheduling problem is much harder.2
To me, perhaps the most interesting contribution of the PLDI ‘13 Halide paper – perhaps more interesting than the design of Halide itself – is its model of the three-way tradeoff space among parallelism, locality, and avoiding redundancy that arises when scheduling image processing pipelines.
Separating algorithm from schedule
That said, I think the design of Halide itself is also pretty awesome! The scheduling-language part of Halide allows one to express programs that fall anywhere we like in that three-dimensional space of choices. For instance, for the 3x3 box blur algorithm above, we could express a breadth-first schedule with the Halide scheduling directive bh.compute_root()
, which would compute all of bh
before moving on to bv
. For the “total fusion” approach, we could instead write bh.compute_at(bv, y)
, which would compute each value from bh
only when it is needed for computing bv
. But you don’t have to try each of these possibilities yourself by hand; there’s an autotuner that, given an algorithm, can come up with “high-quality” (although not necessarily optimal) schedules for an algorithm, starting from a few heuristics and using stochastic search.
There’s been lots of work on attempts to free programmers from having to schedule parallelizable tasks manually. We all wish we could just write an algorithm in a high-level way and have the computer do the work of figuring out how to parallelize it on whatever hardware resources are at hand. But decades of research haven’t yet produced a general-purpose parallelizing compiler; automatic parallelization only works in a few limited cases. The cool thing about Halide is that it makes the scheduling language explicit, instead of burying scheduling choices somewhere deep inside a too-clever-by-half compiler. If you know how you want your algorithm to be scheduled, you can write both the algorithm and the schedule yourself in Halide. If someone else wants to schedule the algorithm differently (say, to run on a different device), they can write their own schedule without touching the algorithm you wrote. And if you don’t know how you want your algorithm to be scheduled, you can just write the algorithm, let the Halide autotuner loose on it, and (after many hours, perhaps) have a pretty good schedule automatically generated for you. The advantage of separating algorithm from schedule isn’t just prettier, cleaner code; it’s the ability to efficiently explore the vast space of possible schedules.
Processing really large images with Halide
A follow-up paper, “Distributed Halide”, appeared last year at PPoPP.3 This work is about extending Halide for distributed execution, making it possible to take on image processing tasks that are too big for one machine.
Running a bunch of independent tasks on a cluster for, say, processing individual frames of video is one thing, but what we’re talking about here are single images that are too big to process on one machine. Is it ever really necessary to distribute the processing of a single image? Sometimes! As the paper explains, we’re entering the age of terapixel images.4 For example, the Microsoft Research Terapixel project involved stitching together hundreds of 14,0002-pixel images, then using a global solver to remove seams, resulting in a single terapixel-sized image of the night sky. Wikipedia has a list of other very large images, such as this 846-gigapixel panoramic photo of Kuala Lumpur. So it’s not out of the question that a single image would be too large to process on a single machine. That’s the problem that the Distributed Halide work addresses.
Distributed Halide extends Halide by adding a couple new distributed scheduling directives to the scheduling language, and it also adds new MPI code generation to the Halide compiler. The new additions to the scheduling language are distribute()
and compute_rank()
. distribute()
applies to dimensions of individual pipeline stages and allows for distributing individual states or not. compute_rank()
is like compute_root()
, but for a particular MPI rank (“rank” being MPI-speak for “process ID”).
Adding distribution to the mix adds yet another dimension to the scheduling tradeoff space: communication between nodes. We can save time and energy by minimizing communication, but less communication generally means more redundant recomputation. Let’s suppose we want to run our two-stage 3x3 box blur on a really large image. The schedule
bh.compute_at(bv, y);
bv.parallel(y).distribute(y);
results in no communication between compute nodes, but lots of redundant recomputation at different nodes. On the other hand, the schedule
bh.compute_root().distribute(y);
bv.parallel(y).distribute(y);
results in lots of communication, but no redundant recomputation: all pixels of bh
are computed only once.
A few questions
For most Halide programs, the optimal schedule is probably somewhere in the middle of the space of possible schedules. Increasing the dimensionality of the tradeoff space by adding distributed scheduling directives unfortunately means that automatically generating schedules gets a lot harder, and so the Distributed Halide work covers handwritten schedules only. Figuring out how to handle schedule synthesis for this larger space of possible schedules still seems to be an open question.
Something else I’m curious about is the idea of synthesizing Halide schedules from incomplete sketches of schedules. At this point, Halide auto-scheduling is all-or-nothing: you either write a schedule yourself by hand, or you give your algorithm to the autotuner to do all the work. But if we could combine human and machine effort? Ravi Teja Mullapudi and co.’s 2016 paper on improved automatic scheduling for Halide pipelines concluded with the sentence, “We are interested in exploring interfaces for the auto-scheduler to accept partially written schedules by experts, and then fill in the missing details”, and it’s here that I wonder if sketch-based synthesis could be useful.
-
This isn’t necessarily what Halide syntax actually looks like; for that, see the tutorials. ↩
-
The local-Laplacian-filters pipeline apparently has 99 stages, not all of which are obvious to me from looking at Figure 1 in the PLDI ‘13 paper. A while back, after spending some time staring at the figure and trying to get the stages to add up to 99, I eventually went to the halide-dev mailing list and asked about it, and someone there was able to account for 80 of the alleged 99 stages. The point is, it’s a lot of stages. This example also demonstrates that despite its name, an image processing “pipeline” is not necessarily linear! It’s really a graph of interdependent stages, which further complicates the scheduling task. ↩
-
There are more details in first author Tyler Denniston’s 2016 master’s thesis. ↩
-
A terapixel is 1012 pixels, or 1000 gigapixels, or 1,000,000 megapixels. By comparison, an iPhone 7 can take 12-megapixel photos. ↩
Comments