Willy Tarreau's stuff: On the design of Compact Elastic Binary Trees (cebtree)

Those who often hear me discuss about my week-end projects have been accustomed to hearing about deuterium fusion (that's for another post), laser engraving, and the compact version of the ebtrees, aka compact elastic binary trees, without knowing all the details. That's what we'll be discussing here.

A long history

Self-organizing trees

It started very early during the development of ebtrees, in which the same node appears twice, once as a node and once as a leaf. In 2007, when I managed to build self-organizing trees to handle duplicates, and noticed that all that was needed to insert, delete, and visit nodes, was to remember the node from which to restart or where to attach, and that it was never needed to go upwards in the tree for such operations, that made me think that maybe this type of structure would also work to arrange data without uplink and without keeping any relation between the node's depth and the key it conveys. Adding a new duplicate value would only require to measure the depth of the penultimate branch and allow to grow the next one until it becomes as long. If it was already as long, then another one just had to be added above it. For example below, we're adding 4 successive entries to a duplicate sub-tree that already contains 4 identical keys ordered by their insertion sequence. The number in the leaf indicates the arrival order for each of them:

XORing branches

18 months later, when optimizing the lookup algorithm by comparing the XOR between the looked up key and the node's key with the node's position, I figured that the XOR between the two branches was always necessarily lower given that these branches have more bits in common, and that maybe we could afford not to store the depth if the XOR between the key and node, and the XOR between the two next branches already allowed us to figure where we're located in the tree. An intuition was that, just like in the ebtrees, we can detect that the searched key is not there when the XOR of the two branches is no longer smaller than between the key and current node, or that we're reaching a leaf if the XOR is higher than the previous one.

In the example below, we successively insert the keys 6, 4, then 5. 6 being the first entry, it's the node-less entry (it only has a leaf). Then 4 is inserted, it has 4 and 5 below it, then 5 is inserted below 4, so it has 4 and 5 as its leaves:

In red is indicated the XOR between the branches. The node-less leaf is a special case that loops to itself, so its descendants are the same. 4 splits between 5 and 5, so the XOR of the branches here is 5^6=3. 5 splits 4 and 5 so its XOR is 4^5=1.

When descending along the branch to key 4, a XOR of 3 is first met at node 4, then a XOR of 1 is met at node 5, then when going down to node 4 (which is the same as the one on top), we're seeing a XOR of 3 again. 3 being higher than previous value 1 proves that we have looped, so 4 here is no longer a node but a leaf.

Similarly, when looking up a value, XORing the searched value with the node's value indicates if we have a chance to find it below or not. As soon as the XOR between a node's branches is higher than the XOR between the looked up value and the node's value, we're certain it cannot be there anymore.

All of this was fun and interesting, but turning intuitions into new week-end projects is never trivial nor necessarily desired when there are already plenty of other activities, so aside a few ideas from time to time when improving ebtrees, not much progress was done in this area for a while.

First limited implementation

Several years later, in 2014, while thinking about a super-compact memory allocator that I'd call "water drop alloc" ("wdalloc") to illustrate how releasing a drop of water in the middle of two other ones that touches them instantly becomes a larger one, I was thinking that a more compact variant of ebtrees would be particularly suited for that purpose if they would index the pointer itself as the key instead of a stored key. An example of the principle is seen below where freeing a block of length 4 at address 5 results in a new free block of 11 by fusing it with the two adjacent blocks, thanks to trivial adjacent address lookups:

Thus I implemented the "space efficient trees" by then for that purpose, proving that the concept of descending with only two pointers and nothing else is sufficient to build a complete tree:

That's it, only two pointers and nothing else, i.e. 16 bytes on 64-bit architectures. Compared to ebtrees which take 40 bytes (or even 48 with alignment) on such architectures, it's a nice saving!

Here the leaf keys would just be the nodes' addresses, so there was no additional storage, and two pointers were really the only required storage for each allocatable area, with nothing having to be stored for as long as the node is in use.

That application was well suited to the limited experience I had accumulated by then about these new trees:

no need to look up the pointer upon free(), the node is at a fixed place relative to the pointer passed to free(). An optional lookup would allow to detect double-free, though.
lookup_ge() and lookup_le() were sufficient to find adjacent entries, as well as to index free objects by their size.
there was no duplicate by definition since the key is the object's address
all operations require to know the root, which in this case was the origin of the allocation arena.
space saving: only two pointers per entry, thus limiting granularity to dual-word, which preserves the dual-word alignment guarantees generally assumed by C ABIs.

The first tests showed that the performance was significantly lower than with other allocators, and the idea of scaling that to threads and having to lock the tree achieved to put the final nail in the project's coffin.

Next attempts

However this confirmed that the basic algorithm worked. I then started to try to better organize the ebtree code to support the relative adressing variants that had been lying in their own branch for 3 years, to support multiple addressing modes, and try to merge both the standard and the compact models. That was tough work that spanned from 2014 to about 2017, while gaining in complexity over time and slowing progressively down. Worse, every time I had an idea about that topic, it required a full week-end to re-enter that code in the process of being reorganized, and the week-end was often over before I had the time to evaluate my idea.

In early 2020, we were locked down with week-ends becoming particularly boring, and I had to keep my mind busy, so it was a sane week-end activity to get back to that project and try to complete it. I tried to force myself to proceed in baby steps, in order to avoid the trap of getting stuck into the code being reorganized:

first, I actually had to re-learn how that whole thing worked, verify the properties that I can use, and demonstrate them. That kept me busy with pen and paper for a several week-ends in the garden during a fortunately sunny spring.
second, I had to verify if storing ordered duplicates was possible at all, given that my wdalloc experiment never made use of them. It took me a while to convince myself that the self-organizing tree was not suitable here since we were not storing the depth to distinguish between them:

It's visible above that the XOR between all branches is 0 in the duplicates, so there's no way to get out of them and find a leaf. In ebtrees actually what breaks the loop is the fact that we rely on the depth counter associated with each node: when finding a node whose depth is higher than the previous one, it indicates we have looped, thus that the last node was in fact a leaf.

For a long time I also considered storing a couple made of the node's pointer and the key to differentiate them, but I didn't like it because keys would be retrieved in their in-memory order, not in the insertion order, and that was not great for certain operations where it's mandatory to know which one will be used. In addition, that could have costed a lot to delete an element, requiring to visit many duplicates before finding the right one.
Finally I managed to design a list-based mechanism that worked on paper using tagged pointers. I had the intuition that it might possibly also work without tagged pointers, but at least I knew I was having a solution which allowed to delete any duplicate in O(1) after locating it in O(logN). There were many special cases and that was significantly complicating the design.

I left that quest for a while. I had proven my design, and the lockdown period was not something that could be associated with desires of creativity :-/ Adding to that the fact that diving into this on week-ends is often non-productive as it takes a whole week-end to figure again how all of this works, it became discouraging after a few attempts.

I forced myself to work on it again 9 months later during the new year's vacation, continuing to integrate it into the reorganized branch of ebtree. And that was a big mistake: the state of that branch made it too hard to make any progress efficiently, to the point that I'm now wondering if there's anything to reuse in the branch collecting that long work.

Restart from scratch with first success

Finally I decided on a summer vacation of 2023 that the compact tree code needed to be forked into its own project, and restarted from scratch there, intentionally leaving the complexity of dealing with duplicates aside for a first implementation, ignoring some of the initially planned optimizations, trying to make it work without tagged pointers so as to keep them for later if needed (e.g. for duplicates), and all done with a single function to deal with all types and operations in order not to reproduce the maintenance difficulties met with ebtrees. Another element taken into consideration is the extension of the API to support a different key offset, permitting in some cases to avoid repeating a key in certain structures when it's already present but not immediately after the tree node:

After all this, v0.1 was finally released mid-september 2024 with support for all types but no duplicates, as planned for the first version. I could find a good use for this in haproxy 3.1: variables generally have a short name with a small value, and couldn't afford to use a huge ebtree indexing node, so they were still in a linked list. We already knew that variables were gaining in popularity and that their cost would increase linearly with their number. Thus it was a perfect opportunity to migrate their indexing to cebtree at no extra cost. This change revealed that a config with 100 variables improved its performance by 37% with no size increase. Great!

Need for duplicates again

Studying opportunities for other areas of haproxy like maps, acls, timers etc revealed that it was not that trivial, because all of these need ordered duplicates. So that was a good opportunity to try to attack that problem again.

The previous work done during lockdown consisted in inserting lists between a node and a leaf. The idea was that in all cases we're dealing with pointers to nodes, and that duplicates can be arranged as a list without losing information nor needing extra storage, if we only use list elements as a spacer:

The principle is not very complex. When adding a duplicate key, we're inserting it as a list element between the previous leaf and the node that pointed to it. The inserted node has its left pointer pointing to the leaf and the right one pointing to itself. When adding new duplicates, the new duplicate is inserted exactly the same way, and the previous list element points to from its right pointer it. The list can thus be extended infinitely. What is nice is that when looking up the value, it is detected that there are multiple keys because regardless of the number of list elements, the left and the right pointers point to a node having the same value (possibly the same node itself). When such a duplicate entry is detected, it is always the last inserted one (e.g. the blue one on the diagram above), then the first leaf is always reachable via last->right->left. Then in order to visit the next duplicate, either the current node is this leaf, in which case the next one is last->right, or it isn't, then it is necessarily one of the duplicates, and the next node is the always node->right until the last one is reached (the one that was first found). This construct is super interesting because it provides the following properties:

appending a new entry always places it at the end
visiting entries always start from the first inserted ones
it is always possible from a node to instantly find its neighbors in order to delete it, thus making it relatively easy to "pick" elements (i.e. find and detach the next one) in insertion order.
list entries behave like nodes in that they always point to adjacent values that fall within as large a prefix length as was previously encountered (since they are equal).

Finally v0.2 was issued with this support for duplicates mid february 2025. It's now considered feature-complete!

Working on performance

Performance comparisons show that, as expected, this cebtree is less performant than ebtree due to having to compare two nodes per level during the descent, only one of which is to be useful, thus effectively doubling the number of memory reads compared to ebtree.

In addition, large strings are severely affected because just like in non-radix trees such as rbtrees, they have to be compared from the beginning at each level in order to detect a leaf, so the average lookup cost is L*log2(N) where L is the average string length and N the number of strings. Since we didn't make use of tagged pointers yet, it was time to try to introduce them to designate leaves, allowing to restart the comparison from the previous split point so that strings are now entirely read only once during the lookup.

The approach that seemed to work on paper during lockdown proved its value without affecting duplicate lists, in that it is super simple: only pointers to a leaf are tagged. This means that pointers to lists are not tagged, and that a tagged pointer inside a list of duplicates necessarily indicates the first leaf. Also, insertion/removal operations remain extremely simple as the pointer is preserved. The total number of changes in the code to support tagged pointers remained very low, most of them were related to the addition of wrappers to avoid directly dereferencing tagged pointers. For this, pointer types have been changed in the code so as to always know what points to a node and what is a tagged node pointer. This requires to change the code where a cebtree was created by a "struct ceb_node*", to change it to "struct ceb_root*", but that's all.

An update of the "6,4,5,5,5" tree above with tagged pointers for leaves now gives the following (tagged pointers are on red links):

Now it's only slightly slower than ebtree on large strings (still due to twice the keys being read during the descent), but sometimes even slightly faster on small strings, probably due to the lower memory bandwidth needed thanks to the much smaller nodes. This is currently found in the master branch of the project.

Current state

The current state looks functional. The code contains some functional and performance tests. The usage remains very close to ebtree except that the tree root has to be passed to all operations (in ebtree, delete/next/prev etc didn't require the root since it was possible to navigate upwards).

The choice between duplicates or unique keys is specified in the function name now, we don't tag the root to indicate how it will be used anymore; it was indeed found after using ebtrees like this for a while that it makes the code not particularly obvious to understand, as one needs to locate the initialization of the tree to know what to expect after an insert() call. Now at least the intent will be explicit in the code.

This also simplifies the root initialization as the root node now only needs to be set to a NULL pointer for an empty tree, then some insert() calls can be performed to insert elements. Example below (warning! no error checking):

struct ceb_root *root;

struct ceb_node *node;

struct entry {

struct ceb_node node;

char *string;

} *entry;

/* index all arguments */

while(arg < argc) {

entry = calloc(1, sizeof(*entry));

entry->string = strdup(argv[arg]);

cebis_insert(&root, &entry->node);

arg++;

}

/* list all arguments, sorted */

node = cebis_first(&root);

while (node) {

printf("@%p = %s\n", node, container_of(node, struct entry, node)->string);

node = cebis_next(&root, node);

}

Just use cebuis_* above to enforce unique keys.

Cost comparison with ebtree

All operations are in O(logN) while ebtree has several operations in O(1) (next/prev/delete). In addition, for certain operations such as next/prev/delete, cebtree requires to record the restart pointer during a descent, reach the desired leaf and complete the operation from the recorded pointer. Just like with a next/prev, this extra cost is amortized to one extra level, but it can reach up to logN for 1/N nodes and could be averaged to logN/N.

In addition, the extra cost compared to ebtree comes from the fact that cebtree, just like other non-radix trees (rbtree etc), needs to visit two branches to figure which one to descend. ebtree avoids this by storing the depth value inside the node. These extra reads may consume more memory bandwidth for large keys, but given that the node itself is 2.5-3 times smaller, the extra cost for small keys can still be absorbed by the lower cost of retrieving the node from memory. However, this extra cost could have an impact on TLB misses when using indirect accesses (cebis_* etc).

Tests performed on haproxy on a laptop with the map/acl pattern reference showed:

a tiny performance increase when loading a map made of 17 million IPv4 addresses and masks (22.0 seconds down to 21.5 seconds); these are short 17-character strings. The resulting memory saving was 400 MB.
a moderate increase when loading a map made of 1 million user agents (much larger strings). The time went up from 2.475 seconds to 2.711 seconds. The memory saving was 24 MB.

Thus it can make sense to use this for stuff that's mostly related to configuration or that doesn't change often and is mostly looked up (i.e. config object names).

Update (2025-05-04): performance measurements were done here.

What's left to be done

First, the documentation in the code is still incomplete. Now that the API got clearer, the doc should be should be finished. In the worst case some types will get renamed, which is no big deal. The doc must be done before 0.3 is released anyway.

Some things are still in the state of reflection:

shouldn't we use a different root type for duplicates vs unique trees ? It could avoid mistakes caused by copy-paste. We could very well do that in exposed function prototypes only so as not to carry these down the code.
some operations that exist in ebtree were not implemented. One of them is the longest match, typically used to look up IP addresses within ranges. It's still unknown at this point if this can be implemented with the current model (maybe it would require to slightly change the key representation, in ebtree the prefix length is passed and stored separately).
some long but still incomplete studies indicate that the code could almost support properly ordered atomic operations. The only real limitation is deletion that is not compatible with parallel insertion or lookup. One first approach could be to use upgardable rwlocks like progressive locks and take a shared lock for all operations, and upgrade it to an exclusive lock only when committing the removal. This would result in a very short locking time, though all visitors would still need to write-access a shared memory area to grab the read lock, and a moderate deletion rate could suffice to purge visitors too often. Another solution might be to consider alternate approaches such as RCU.
implementation of relative pointers was started but put to a pause for now. It's convenient to index files in-place (e.g. logs etc), but I didn't have immediate use for this and it further complicates the development. It will have to be done, though.

Future work

Simplifying the trees to the extreme of requiring only two pointers, just like a list, opens some perspectives:

maybe it would be useful to support to optionally implement the parent pointers in the node so as to decide between performance and space.
the work initially done on ebtrees to try to factor the code went in the wrong direction by making it too difficult to use and maintain. On the opposite, the work done on cebtree preserved maintainability despite supporting multiple types (u32, u64, long, addr, blocks, strings, indirect blocks, indirect strings). It looks like ebtree would benefit from rolling back to before the change and adopting a cebtree-like approach.
the duplicate lists of cebtree look way better than the duplicate trees of ebtree, and maybe we could significantly improve ebtree's performance in schedulers by adopting the same duplicates (duplicate values are frequent in scheduler due to the timer's resolution).
maybe ebtree could be built as an extended cebtree in the end, then it could be nice to re-merge them into a single project.
studying the interest of self-locking nodes following the same principle as mt_lists could also be interesting, though the visit cost might become high.

Links

The code in its current state is available here:

cebtree : version 0.1 is the most basic (no duplicates). Version 0.2 supports duplicates without using tagged pointers. It can be seen as a reference about what can be done when tagged pointers are not desired. The current version supports everything including tagged pointers and is still in development.
ebtree : version 6 is the production one. Version 7 is the one that tries to factor all the code and that is not considered production ready.
plock: the upgradable R/W progressive locks will probably be the next step to support parallel insertions and lookups with exclusivity only on the delete operation.
mt_list: the self-locked lists have concepts that might be reusable here, probably even lighter since nodes are accessible in a single direction at a time. This deserves to be studied.

Willy Tarreau's stuff

2025-03-09

On the design of Compact Elastic Binary Trees (cebtree)