In my distributed systems course a couple weeks ago, we were discussing how the anti-entropy mechanism in the Dynamo key-value store uses Merkle trees to efficiently compare the data stored on different replicas. The idea is this: suppose that two replicas (on different physical machines, perhaps on opposite sides of the planet) are each storing some large number of key-value pairs. Suppose you want to make sure that the values of the keys on replica A are the same as those of the corresponding keys on replica B. (They might not agree if, say, replica A got updated at a time when replica B was unreachable.)

So, you need to compare the values of keys somehow, and this comparison must be computed somewhere – either on replica A or replica B. If you’re going to do it on replica A, then the value to be compared with has to be sent over the network from replica B, and vice versa.1 Since these values could potentially be quite large, you can instead send a hash of the value to be compared and then compare the values by their hashes instead of by their actual contents. But if you want to do this comparison for every key-value pair, even just sending the hashes could take more bandwidth than you want to use.

Merkle trees address this issue by storing data blocks (in this case, key-value pairs) at the leaves of a tree. Each leaf node is labeled with the hash of its data block, and each non-leaf node is labeled with the hash of its children’s labels. The root of the tree, then, is labeled with a single hash value, and this single value is what you send over the network when you need to do a comparison. If the root-level hashes on replicas A and B agree, great! The replicas are in the same state. If they disagree, though, then you need to make more comparisons, one for each child of the root (two comparisons if your tree is a binary tree; k comparisons if it’s a k-ary tree). If any of those comparisons show that the hashes agree, you can ignore the children of those nodes because you know that they agree on the data. You only have to continue working down the tree in the places where the hashes don’t match. In this way, Merkle trees enable efficient comparison of large data sets. In asymptotic terms, if you have N data blocks and there happens to be one block where your replicas disagree on the data, then you’d only need to make O(log N) comparisons of hashes to find the point of disagreement, rather than O(N) comparisons.

When we talked about this in class, the example I drew on the board had two binary trees, each with four leaves and a key-value pair at each leaf, and with each tree having the same four keys and differing in only one of the values. I made a big show of pointing out how we could ignore the side of the tree that was the same. But then a student in my class asked a great question that I didn’t know the answer to: how does this Merkle tree thing work if the replicas don’t know about the same set of keys?

Put on the spot in class, I didn’t have a good answer. When I go back and look at the Dynamo paper, it seems like the answer is that since all the replicas know which key ranges they’re responsible for, they can just have an entry in the Merkle tree for each key in the range, whether or not there’s actualy a value associated with that key on a given replica. (Because these ranges are pretty big, you might want to use an efficient “sparse Merkle tree” representation for the tree to be of tractable size.)

A couple of days later, though, I happened to be hanging out with Graydon and posed the question to him, and he had a more interesting answer. He said that what you could do is store the data in a bitwise prefix tree (or radix tree, or trie, or whatever it is in your idiolect), keyed on hashes of the keys. But you also hash the values themselves and label leaves with hashes of values, and internal nodes with hashes of children’s label, in Merkle tree fashion. This took a while for me to get my head around, because there are two different things to hash. (It wouldn’t make any sense to only hash the keys for the Merkle tree, because it’s the values themselves that you actually want to compare. But for the prefix tree representation, you do want to only hash the keys, because you need a given key-value pair to be stored at the same place in the tree, regardless of the value. And the two hashes are unrelated; you wouldn’t need to use the same hash function.)

The resulting data structure is called a “merkleized” (or “merkelized”; spellings vary) prefix tree, a prefix tree with a “Merkle overlay”, or my favorite, a Merklix tree. But it isn’t only for blockchain and blockchain-related program activities, despite what those search results might have you believe. Graydon told me that this was in fact the data structure he implemented for synchronizing copies of objects in the Monotone version control system, circa 2004 or so! I asked what name he used for it, and he said that at the time he just called it a Merkle tree because it didn’t occur to him that there was a version that wasn’t a prefix tree. That amuses me because these days, if I do, say, a Google image search for “merkle trees”, all I see are boring old perfect binary trees like the one I drew on the board in class.

Look at all those perfect little trees.
Look at all those perfect little trees.

In any case, if replicas had different sets of keys and you didn’t have a priori knowledge of key ranges: you could use Merklix trees, determine that the replicas have different sets of keys by looking only at the prefix tree structure, and then use the Merkle overlay to sync up the keys that disagree on value. This doesn’t seem to be what Dynamo actually does, but I was happy to learn about it anyway.

  1. Presumably you would be also storing some kind of timestamp for each key, whether logical or physical, that would tell you which copy of it had been updated more recently, but that’s a whole other big ol’ can of worms. 

Comments