Research roundup, 2022-2023

August 1, 2023

My research group was busy this past year. Here’s a summary of what we’ve been up to in 2022-2023!

HasChor: Choreographic programming in Haskell

This year, my students and I started getting into choreographic programming. In choreographic programming, one implements a distributed system by writing a single program – a “choreography” – that describes the behavior of all nodes in the system, instead of writing individual programs that run on each node. The choreography is then compiled to programs that run on individual nodes, via a compilation step called endpoint projection (EPP).¹ If EPP is correct, the resulting collection of programs will be deadlock-free.

Several choreographic programming languages exist, but they’re mostly research languages, and some only exist on paper or in a proof assistant. We created HasChor, a Haskell library that implements choreographic programming as an embedded domain-specific language in Haskell. In HasChor, choreographies are expressed as computations in a monad. To our knowledge, choreographies-as-monads is a new idea, and HasChor is the first implementation of choreographic programming as a library, monadic or otherwise, that we are aware of.

Because HasChor is “just a library”, it can be installed like any Haskell library, compiled and run like any Haskell program, and can use any Haskell tools for development and debugging. HasChor also naturally supports higher-order choreographies and location polymorphism, both cutting-edge features of choreographic languages that enable code reuse, but we didn’t have to do a lot of work to get these features – we get them almost “for free” from the host language.

We also find that Haskell lends itself to an especially nice implementation of endpoint projection. In particular, my student Gan Shen had the idea to use freer monads to make it so that one choreography can be interpreted to different programs depending on which participant you are (say, a server or a client). The choreographic primitives of HasChor are deeply embedded so that we can freely interpret them, while other parts of the language are shallowly embedded so that we can reuse existing Haskell constructs as much as possible. (I used to think that such a mixed embedding was something weird and exotic, but now that I’ve seen how it works in HasChor, it feels very natural!) Our implementation approach also means that HasChor supports swappable backends that implement message transport. We provide an HTTP backend that’s intended for web programming, but it’s completely decoupled from the implementation of EPP, and programmers can implement their own backend if they want to use HasChor in other message-passing settings.

As a case study, we used HasChor to implement a distributed key-value store as a choreography. We start with a simple client-server setup with no replication, then add primary-backup replication, and then move to a version that factors the choice of replication strategy out to its own choreography. My student Shun Kashiwa designed and implemented all these examples.

I’m thrilled to announce that our paper on HasChor will appear at ICFP this year – don’t miss Gan’s talk in Seattle this September, and in the meantime, you can read the paper on arXiv or play with the HasChor implementation!

We’re newcomers to choreographic programming, and we’re grateful to experts like Fabrizio Montesi and Andrew Hirsch for all the guidance they’ve given us over the last year. If you want to learn about choreographies, I recommend checking out Fabrizio’s new book.

Causal message delivery as a refinement type

This year, we finally published “Verified Causal Broadcast with Liquid Haskell” at IFL ‘22! We used Liquid Haskell’s refinement types to verify the safety of a classic algorithm for causal message broadcast, resulting in a running Haskell implementation that we actually deployed. In this setting, “safety” means that messages will never be delivered out of causal order. I’ve already written and spoken at length elsewhere about the specifics of this particular project, so here I’ll just say something about refinement types and Liquid Haskell more broadly.

A refinement type can be thought of as the type of a dependent pair where the second component is some kind of proof about the first component. So in that sense, refinement types are just a particular use case of dependent types (and we can find them in, for instance, the Agda standard library). However, to construct such a pair, you’d need to supply both the value and the proof, and when people talk about “refinement types” these days², they usually mean they’re asking some under-the-hood system to supply the proof, typically with help from an external SMT solver. This is an implementation detail in some sense, but an extremely important one in terms of programmer experience. (And you could imagine hooking it into a typical dependently typed language – see our proposal/plea for SMT-powered refinement types in Agda.)

I think Liquid Haskell is fascinating because it comes from the other direction. It was originally “just” an SMT-backed decidable refinement type system, then evolved the ability to supply one’s own manual (but still solver-assisted!) proofs. It can now be said to be an honest-to-goodness proof assistant – albeit one that’s pretty hard to use, and so far lacking any interactive proof development features (which is the topic of another proposal/plea from our group). But – with effort – you can actually prove things about deployable code with it!

I was new to research-level mechanized verification – a few Software Foundations exercises don’t count – when we started this verified causal message delivery project³, and for the most part, so were my students. I thought that using Liquid Haskell would make things easy because we were already familiar with Haskell, but it turns out that mechanized verification is hard regardless of the mechanization tool. Writing idiomatic Haskell and then trying to verify it didn’t work; we had to think about verification from the start. At the talk I gave at Dagstuhl this spring, I joked that this project is what happens when a group of stubborn Haskell programmers attempt to do mechanized verification of distributed systems. We flailed around, made false starts, learned a lot, and finally got a result we’re proud of: we can write down a refinement type to express the property that all messages delivered by the processes in a distributed system are delivered in causal order, and we can use Liquid Haskell to prove that this property is actually true of our implementation. Of course, by “we”, I mostly mean my student, Patrick Redmond, without whom this project couldn’t have happened. I am very proud of our paper (led by Patrick, with contributions from Gan and from our collaborator, Liquid Haskell creator Niki Vazou), which distills what we learned over two years of hard work!

What goes wrong in serverless runtimes?

So-called “serverless” computing is a cloud computing abstraction in which programmers specify functions to run, and the platform provides everything else (like an operating system, resource allocation, load balancing, and fault tolerance). Knative is a popular open-source serverless platform whose development is backed by Google; paid Google products like Cloud Run are (said to be) based on Knative.

This spring, my student Tim Goodwin (co-advised by my colleague Andrew Quinn) dug into the issue tracker for Knative Serving, the runtime component of Knative, with the goal of getting an an idea of the ways in which serverless computing platforms tend to be buggy. Tim considered every single bug reported on the Knative Serving public issue tracker between January 2020 and February 2023. He narrowed this set of bugs down to those that pertained to runtime behavior (as opposed to, say, installation bugs), ending up with a dataset of 103 bugs, and then looked at each of those bugs in more detail and taxonomized them. The most common classes of bugs were: bugs relating to Knative’s use of status conditions (e.g., issue #10267), bugs relating to how Knative interacts with the underlying Kubernetes container orchestration platform (e.g., issue #13204), bugs relating to user-provided Knative configuration parameters (e.g., issue #11926), and bugs relating to Knative’s autoscaling behavior (e.g., issue #8610). These categories aren’t mutually exclusive; bugs could (and lots did) fall into more than one of them. This is the first time I’ve ever worked on a bug-study paper like this, and I learned a lot! Unlike a lot of my work, which proceeds more or less from first principles, this project involved actually observing bugs in the wild – being naturalists. (Entomologists, if you will.)

To me, configuration parameter bugs are a particularly interesting class of bugs. These aren’t situations where the user could be said to have misconfigured Knative; rather, they are situations where the user picked a certain combination of configuration parameters that didn’t work as expected, due to a genuine issue with the platform. Issue #11926 is representative of this category of bugs. A high-level summary of what’s going on with this bug: during periods of high load, Knative tries to speed up request processing by turning on a particular performance optimization; what “high load” means in this context depends on a combination of user-configurable settings. Unfortunately, under a particular combination of settings – including two of the default settings – the performance optimization would flip on and off with every request, which actually ended up degrading performance. (One Knative developer coined the term “flipageddon” for this behavior.) The Knative developers resolved this particular issue by changing the default value for one of the settings in a way that presumably makes the bug less likely to be triggered, though not impossible. Perhaps many such bugs could be found through systematic testing.

Tim presented our bug study paper at the SESAME workshop at EuroSys in May. The paper describes our methodology and findings in more detail, discusses more bugs that are representative of the classes we identified, and suggests some future research directions that interest us (and, hopefully, you!).

CRDTs as coalgebras

Conflict-free replicated data types, or CRDTs, are data structures designed for replication and high availability across multiple physical nodes in a distributed system.⁴ A well-known result about CRDTs is that so-called “state-based” CRDTs, in which replicas send copies of their state to other replicas, and “operation-based” CRDTs, in which replicas broadcast copies of the operations that have been applied to them, are equivalent, in the sense that any CRDT specified in the state-based style can be emulated by one specified in the op-based style, and vice versa. However, while Shapiro et al. give a recipe for constructing an op-based CRDT given a state-based one and vice versa, they don’t go so far as to prove that an object constructed using this recipe is in fact equivalent to the original object, in the sense of having the same behavior from the perspective of clients. To tackle this question, my students Nathan Liittschwager and Jonathan Castello began investigating the semantics of CRDTs from a new perspective: that of viewing CRDTs as coalgebras. (If you don’t know what coalgebras are and you want a computer-science-y definition, “a generalization of transition systems” is close enough for rock n’ roll.)

At first glance, the question looks like a typical contextual equivalence problem: do the two implementations have the same behavior when dropped into the same context? However, state-based and op-based CRDTs have different interfaces, exposing different operations. (For instance, state-based CRDTs have a merge operation, and op-based CRDTs don’t.) They have different types! How do we talk about equivalence in this setting? One approach is to separate the query and update operations that clients can do – which are part of the interface of both state-based and op-based CRDTs – from internal operations, like merge, which differ between the state-based and op-based flavors. We can therefore think about the equivalence of different types of CRDTs by only looking at the parts that both types have in common.⁵ Moreover, by modeling CRDTs as coalgebras, we get a nice notion of equivalence of CRDTs in terms of bisimulation between coalgebras. We outline this idea in an abstract that Nathan presented at CALCO⁶ this spring, and we’re currently working on fleshing out the idea further, together with our collaborator Stelios Tsampas. We think that the coalgebraic perspective will be useful not only for thinking about CRDTs, but for all kinds of replicated systems, including those that (unlike CRDTs) provide strong consistency among replicas.

Exceptional actors

And one more thing!

Asynchronous exceptions are an interesting feature of the GHC runtime system (RTS). Unlike typical (“synchronous”) exceptions, which come from the current thread, asynchronous exceptions can be thrown to the current thread by a different thread (or by the runtime system itself, as in the case of timeouts). Perhaps worryingly, asynchronous exceptions can therefore be repurposed for general inter-thread communication. My student Patrick demonstrated this by using asynchronous exceptions to implement an actor framework – or, perhaps more accurately, he discovered that this actor framework already exists within the GHC RTS. Sending a message is implemented with throwTo; receiving a message is implemented with catch. While we don’t actually condone using asynchronous exceptions this way, we do think it’s a neat hack, so we wrote a paper about it (as a literate Haskell program). Patrick will be presenting our “exceptional actor system” at the Haskell Symposium in Seattle this September – stay for an extra day after ICFP and catch his talk!

Work on multiparty session types (MPSTs) also uses the term “endpoint projection” to describe projecting global types to local types. Choreographic programming has its roots in the session types community, and its version of EPP is similar in spirit to the MPST version, but in the setting of choreographic programming, it means projecting global programs to local programs. ↩
Although, as Michael Greenberg points out, it hasn’t always been this way. ↩
I did my PhD between 2008 and 2014, as the current golden age of PL research was getting rolling, and I now think that I was part of the last generation who could escape with a PL PhD without really learning how to use either a proof assistant or an SMT solver (more’s the pity). ↩
I’ve been interested in CRDTs for a long time; my first experience with Liquid Haskell was on a CRDT verification project, and one reason I was interested in verifying causal broadcast in the first place was because I suspected that a separately verified causal broadcast component would simplify the end-to-end verification of op-based CRDTs. Before we could confirm that suspicion ourselves, Abel Nieto and collaborators wrote a cool paper that beat us to it, using a separation-logic-based approach rather than refinement types. ↩
Indeed, in a new paper, Nieto et al. observe that state-based and op-based CRDTs both turn out to be implementations of the same abstract data type that supports get_state and update operations. ↩
This year, CALCO was in Bloomington, Indiana, on the campus of Indiana University, where I did my PhD. I was bummed I couldn’t go myself. A lot of the talks took place in the very classroom where I took Kent Dybvig’s compilers class fifteen years ago! ↩

Lindsey Kuper