Willy Tarreau's stuff

2025-06-22

Real-world performance comparison of ebtree/cebtree/rbtree

In the previous synthetic comparison between ebtree and cebtree it was quite clear that performance varies quite a bit between key types, key distribution, lookup-to-update ratio, and tree type. While the initial utility was fine to get a rough performance estimate, it appeared important to measure the cost per operation (insert, lookup, delete). Also, the impact of the key distribution made me want to compare how balanced trees compare, since they can sometimes be interesting as well. Thus a new tool was written ("ops-time") and stored with the previous stress utility into a new project, treebench. The utility was written so that all basic operations are defined by macros, easing the integration of various tree primitives in the comparison.

Principles

The tool generates a pre-defined number of keys whose distribution matches a specific profile. Key may be read from stdin, can be full randoms, ordered with small increments in order to match timers used in schedulers, poorly distributed randoms with long chains, or short strings; Key types may be u32, u64 or strings.

Then these keys are inserted into a tree, the timing is measured over the second half (so as not to be affected by too fast operations on empty trees), then they're all looked up, then they're all deleted, and again, the timing is taken only during the first half when the tree is mostly full.

The tests chosen here were representative of specific known use cases:

insertion/deletion of u32/u64 timer-like keys: this is what happens in schedulers. Lookups are basically inexistent (only a first() or a lookup_ge() happens then only next() is used). Nodes are inserted when the timer is created and deleted when it's removed, most of the time without ever having been visited when this is used to store timeouts (which rarely trigger). One property of such keys is that they're groupped and inserted mostly following an ever-increasing pattern. Here we're using the small positive increments (we avoid duplicates in order to support all tree types).
insertion/lookup/deletion of 64 bit randoms. This matches what happens in caches where hashes are looked up and stored if non-existing. Keys are normally very well distributed since deriving from a secure enough hashing algorithm. Lookup are very frequent, and if the cache has a good hit ratio, they can represent close to 100% of the operation. However in case of low hit ratio, insertion and deletion become important as well since such trees are usually of fixed size and entries are quickly recycled.
short strings: the idea here is to have the same distribution as above with the randoms, but represented as strings. It emphasizes the cost of string comparison operations in the tree and of accessing larger nodes. This is commonly used for a few use cases: looking up session cookies (which are normally secure randoms since non-predictable), where lookups and creations are important, and they can be used to look up config elements such as UUIDs, which are created once and looked up often. In the case of haproxy for example, lookup of backend and server names is done like this. In such cases, insertion of many objects can directly impact a service's startup time, the lookup impacts runtime performance, and deletions are never used.
real IPv4 addresses: it's known that IP addresses are not well distributed on the net, with some ranges having many servers and others many clients, as well as some addresses never being assigned, or sometimes being assigned to routing equipments, thus leaving holes. Here a sample of ~880k real IPv4 addresses was collected from the logs of a public web service, and these were used to run a test to see the impact of a different distribution. Note that the frequency of these addresses was not considered for the test, and each address appeared only once.
real user-agents: similarly, user-agents strings are a perfect example of strings that do not vary much and have long chains in common. Often variations are only a single digit in an advertised plugin version. And such strings can be large (even if these days browsers have considerably trimmed them). The test was run against ~46k distinct user-agents collected from logs as well, ignoring their frequency, for the sole purpose of observing behaviors under such real-world conditions. The average user-agent length was 138 bytes.

All these tests will be run for tree sizes from 512 keys to 1M keys (1048576), except for real input values where the test will stop at the size of the input data. Also, when relevant (i.e. all cases but timers), the tests will also be run with multiple tree heads selected based on a hash of the key. Indeed, trees are commonly used to build robust hash tables when there's no ordering constraints, which allow to reduce the lookup cost thanks to the hash that reduces the average number of levels, while not suffering from the risk of attacks that regular list-based hash tables suffer from. The tests will be run with 1, 256 and 65536 tree heads, selected by an XXH3 hash of the key, in order to reduce the tree sizes.

Implementations under test

Three implementations were compared:

Elastic Binary Trees (ebtree), using stable branch 6.0. This is the commonly used branch. Tree nodes consume 40 bytes on 64-bit machines. Being a prefix tree the depth depends on differences between keys and is bound by their size. Insertions are O(logN), deletions are O(1). Duplicates are supported with a O(logN) insertion time.
Compact Elastic Binary Trees (cebtree), using the master branch (3 commits after cebtree-0.3), which is currently considered as feature-complete. Tree nodes only consume 16 bytes on 64-bit machines (only two pointers, just like a list element). Just like ebtrees, insertions are O(logN), however deletions are O(logN) since there's no up pointer. Duplicates are supported and inserted in O(1) time.
Red-Black tree (rbtree), using pvachon's implementation. This one was found to be of good quality, clean, easy to integrate and a good performer. The tree node consumes 24 bytes (3 pointers), thus a generic approach takes 32 bytes due to alignment (though by having key-specific types it's possible to store the key or a pointer to it inside the node). Insertions are O(logN), and deletions are an amortized O(logN) that's closer to O(1) with a higher cost due to tree rebalancing that doesn't happen often but that can be costly if multiple nodes have to be rebalanced.

Test results

The tests were run on an Arm-v9 device (Radxa Orion-O6) and on a recent Xeon W3-2435 (8 Sapphire Rapids cores at 4.5 GHz). The patterns were mostly identical on both devices, showing different inflexion points due to cache sizes, and higher general costs for the Arm running at a lower frequency.

We're showing graphs for insert time, lookup time, deletion time, and what we call "replace", which is in fact the sum of the deletion and the insertion, which corresponds to a node recycling in a cache, or to the only operations that matter for timers (one insertion and one deletion).

The graphs are plotted with full lines for the single-head mode, dashed for the 256-head mode, and dotted for the 65536-head mode. Just click on the images to see the details.

1- Timers

Here we're essentially interested in the last "replace" graph. It perfectly illustrates something that we had observed in haproxy a long time ago (in version 1.2) when rbtrees were being used for the scheduler: the cost of rebalancing and the resulting imbalance are overkill for such regularly distributed keys. Here ebtrees are 3x faster, which is approximately in line with observations made by then when replacing them in haproxy. Some instrumentation was made on the tree traversal code to measure the number of levels, and it appears that in this specific case, ebtrees are shorter than rbtrees. Indeed, rbtrees try not to rebalance too often to amortize the cost, and end up guaranteeing that no branch will be more than twice as long as any other one, which means that it's possible to end up with 2*log(N) levels in the worst case. And for keys that are constantly renewed, this is exactly what happens. With ebtrees on the opposite, the good distribution ensures that the tree depth is also mostly equal between branches, and roughly matches log(N)+1..2 most of the time. It's interesting to note that except for the removal where cebtrees are much more expensive than the two others, cebtrees are as fast as ebtrees for insertion and as slow as rbtrees for lookups, and end up between the two for the whole pattern. This means that in cases where timers are not under heavy stress and rbtrees could be considered to save memory, then cebtrees would in fact be both smaller and faster for this type of keys and distribution. This could concern cache or certificate expiration for example where key size is more important than raw performance.

2- 64-bit hashes

Here on randomly distributed keys, we're observing the same phenomenon as in the previous test. It took a while to figure the cause but again it's the imbalance of rbtrees vs the better balance of ebtrees on such well distributed keys that make the difference. It's also visible that insertion is faster for cebtrees than ebtrees when the tree grows (256k keys and above), clearly due to the significantly higher memory footprint of ebtrees. It's interesting to note that single-headed ebtrees here are inserted as fast as 256-headed rbtrees and follow the curve all along. It's also visible that all trees insert much faster when hashed enough, since randomly distributed keys result in properly balanced sub-trees. It's visible that using 64k heads divides the insertion time by roughly 4 for all types.

For the lookups however, cebtrees are almost as slow as rbtrees. The reason seems to be related to the need to consult adjacent nodes all along the descent, resulting in reading 3 times more data before knowing where to go, which doesn't cope well with prefetching and branch prediction. and here the rbtree and cebtree curves are mostly fused all along the tests. Hashing does considerably help but doesn't benefit as much to cebtrees as it benefits to others: even at 256 heads, cebtree lookups become the slowest of the 3. And while at 256 heads, ebtrees are much faster than rbtrees, at 64k heads rbtrees win by a small margin. It is possible that a few imbalanced chains are longer in ebtrees than the roughly balanced ones in rbtrees in this case, as there should be on average 16 keys per tree a 1M entries. But once we consider the replacement cost (since such hashed keys typically have to be recycled in caches, tables etc), then the gain is well in favor of ebtrees which can still be 30% faster in total. Overall for such use cases, it looks like ebtree remains the best choice, and that cebtrees do not make much sense, unless space saving is the primary driver.

3- short strings

For short strings, ebtrees remain faster than rbtrees to insert and lookup, by roughly 30% for insertion and 20% for lookups. One reason might simply be that we don't need to re-compare the whole string, the comparison only restarts from the current level's position. However these are typical cases where it makes sense to hash the keys (unless they have to be delivered ordered, but session cookies do not have this problem for example). In this case we see that when hashing with sufficient heads (64k), then rbtree becomes twice as fast for lookups with a steep divergence around 256k entries, very likely because again, there might exist longer chains in the resulting small ebtrees than in the similarly sized rbtrees. In all cases, cebtrees follow ebtrees, except for removal where the O(logN) deletion time increases the cost.

The conclusion on this test would be that if ordering is required, then one must go with ebtree or cebtree, which show roughly equivalent performance except for deletion where cebtrees are not good. If ordering is not a concern and topmost speed is required and hashing over a large number of buckets is acceptable in order to end up with small trees, then better go with rbtrees that will have shorter long chains on average.

4- IPv4 addresses

IPv4 address storage and lookup is much more chaotic. Due to the anticipated poor address distribution, it's clearly visible that below 256k IPv4 per tree head, lookups via ebtree are significantly faster than rbtree, and that above, rbtree is significantly faster than ebtree. This clearly indicates the existence of long chains (up to 32 levels for IPv4). Also, insertion is always faster on ebtrees even when hashing at 64k. And if that wasn't complicated enough, cebtree is always faster both for insertions and lookups (but as usual, removal is way more expensive). However, this also means that it would likely be sufficient to apply a bijective (non-colliding) transformation like an avalanche hash to the stored keys to flatten everything and get a much nicer distribution, and get back to the same scenario as for 64-bit hashes above.

So what this means is that if IPv4 keys are needed (e.g. store per-client statistics), no single implementation is better over the whole range, and it would be better to first hash the key and use it like this, provided that the hashing algorithm is cheap enough. Given that XXH3 is used with good results for hashed heads, it's likely that XXH32 would be efficient here. But overall, indexing raw IPv4 addresses is suboptimal.

5- user agents

Here the first observation is that for large mostly-similar strings, one must not use cebtrees, which are twice slower than rbtrees, both for inserts and lookups. And ebtrees are not good either, being ~10% slower to insert and ~30% slower to look up. This is not surprising, because as anticipated, user-agents only differ by a few chars, and can be long, so there can be long chains of a single-char difference, implying many more levels in the prefix trees and in the balanced tree. However, here again, if sorting is not a requirement, it could make sense to hash the inputs using a fast non-colliding hash and index the hashes instead of the raw keys. Also, all operations made on cebtrees are significantly slower, again likely due to having to fetch adjacent keys, hence increasing the required memory bandwidth.

First conclusions

Prefix trees are much more sensitive to the input keys distribution, so it is not surprising to see ebtrees and cebtrees perform much less well when facing poorly distributed keys like client IP addresses or user agents. IPv6 addresses could behave worse, even letting the client choose to create 64-deep sub-trees for their own network. However, it's not like lists, because the slowest path is quickly reached and is bounded by the key size.

When high performance on large keys is desired, better use rbtrees, or split the trees using a hash if ordering is not a requirement. In this case it can even make sense to index the hashed key instead of the key itself in order to regain a good distribution as is observed in the 64-bit hashes tests.

Compact trees were initially expected to be way slower, while it's mostly the delete operation which is slower, but overall, they perform quite well, often better than rbtrees on small keys. They suffer from large keys due to the descent algorithm that requires to XOR the two next branches in order to figure where the split bit is located, hence consuming more memory bandwidth. For cases where removals are rare and keys are reasonably small, they can be very interesting. Tests on haproxy's pattern reference subsystem (map/acl indexing) even showed a boot speedup compared to existing code that uses ebtrees, and a significant memory saving. In general, using them to look up config-level elements seems to make sense. For more dynamic keys (session cookies etc), it may depend on the expected average lookup hit ratio. And they seem to have an interesting role to play in hash tables where they could replace lists and avoid the linear performance degradation that occurs when the load factor increases. This could be a drop-in replacement using the same storage model as the lists.

Rbtrees remain good overall (which was expected, there's a reason why they've been so widely adopted), and their performance doesn't really depend on the nature of the keys. This is also why, while they are much better on large keys, they cannot benefit from certain properties such as the timers or even well-distributed small keys.

It really looks like running a non-colliding hash on inputs and indexing only that could be very interesting with all trees. One must just make sure not to meet collisions. A 128-bit hash could be sufficient (though it would double the cebtree size). Another option would be to stay to 64-bit with a keyed hash, and change the key when a collision is encountered. It would be so rare that it might make sense, but could possibly not be compatible with all use cases. And some bijective algorithms exist that will not cause collisions on inputs of their native size (avalanche, CRC etc), allowing a great input mixing without introducing collisions.

Other trees like AVL could be evaluated. Their insertion/removal cost is higher as they will try hard to stay well balanced, but these could be the fastest performers for pure lookups.

Also here we've only spoken about indexing unique keys, but sometimes duplicate keys are needed (particularly for timers). The mechanism created for cebtrees would be portable to other trees by stealing one bit in the pointers. It would add to their complexity, but could theoretically work for rbtrees, making them suitable for more use cases. The could also be adapted to ebtrees to further speed up insertion of duplicates, which exists a lot in timers.

It's also visible that all trees degrade very quickly when the data set doesn't fit into the L3 cache anymore, and cebtree even more due to the need to check adjacent nodes. And there is a double problem: not fitting means more cache misses at each level, but not fitting also means a large tree hence more levels. This explains why the performance systematically degrades very fast with the tree size: we increment both the number of levels and the risk of a cache miss at each level, making the degradation grow faster than the tree size. This indicates that trees alone are not necessarily relevant on super-large datasets (e.g. database indexing). Again that's where combining them with fast hashes can make sense.

2025-05-04

Performance comparison of EB trees and CEB trees

Background

This is a follow up for the previous article on CEB trees. It documents a performance test that was run on both libraries while measuring the time spent in the different operations in order to compare how some operations behave on one of the other. This is mostly a copy of the performance test report stored in the tree here.

Tests were run on a single core of an Intel Core i7-6700K running at 4.4 GHz equipped with dual-channel DDR4-2400, using the "bench" utility from the "tests/" directory, comparing the performance of the two libraries at the following commits:

ebtree stable-6.0 @ccf9d9f ("TESTS: bench: enable the unique-keys tests as well")
cebtree master @6b5cd28 ("OPTIM: descent: stop at the node when the path to the leaf is not used")

The principle of the test consists in creating 1 million nodes, and randomly picking one of them, if the node is not in the tree, then a random value is assigned (of the type of the key) and the node is inserted, otherwise if it's in the tree, it is either looked up or deleted. The distribution between lookups and deletes is affected by a config parameter indicating how many lookups are performed per delete, in order to simulate different use cases where a certain number of lookups are performed before a node is removed.

The program reports statistics about operation rates and tree state. In parallel, perf record is run at 4000 samples per second to measure the average ratio of time spent in each function. After processing, the results below indicate the average time spent in each function, in nanoseconds.

The raw results are stored in /results/bench-eb-ceb/lk*, where "lk* " represents the average number of lookups performed per node, and where we find the following files:

bench-ceb*.out: the output of "bench" on ceb trees
bench-eb*.out: the output of "bench" on eb trees
perf-ceb*.out: the filtered output of "perf report" on ceb trees
perf-eb*.out: the filtered output of "perf report" on eb trees
graph.gnup: the gnuplot script used to produce the graphs
meas.out: the whole test
meas-m.out: the tests involving multiple keys
meas-u.out: the tests involving unique keys
multi.png: graph of operations on trees supporting multiple identical keys
unique.png: graph of operations on trees only supporting unique keys

Tests

1 average lookup between insert() and delete()

The first test is run with one single lookup between insert and delete. This is convenient because it results in the following distribution of operations:

for trees storing unique keys: 30% inserts, 30% deletes, 40% lookups. The tree stabilizes at 60% fill rate.
for trees storing duplicate keys: 33.3% inserts, 33.3% deletes, 33.3% lookups. The tree stabilizes at 66.6% fill rate.

Thus it allows to roughly compare the individual cost of each operation under load.

unique keys	duplicates

What's seen on the graphs is that compact trees are much slower on deletion since that's one area where ebtrees shine by being O(1). The slow down on insertion and lookups is minimal (around 10% for integers), and for memory- based keys (blocks, strings), both insertion and lookups are faster on compact trees than on eb trees, probably thanks to the reduced memory bandwidth usage, and a bit also for the O(1) duplicate handling while ebtrees are in O(logN), which explains why the difference looks a bit more important with duplicates.

10 average lookups between insert() and delete()

The second test concerns keys that are looked up 10 times before being deleted. This results in more keys in the tree since less operations are deletion (916k keys for dups, 892k for unique).

unique keys	duplicates

The insert and delete operations are much rarer so their amortized average time is much lower. In addition the strong reduction on the number of writes seems to improve the overall walk time through the trees, probably just due to better data caching, since it's unlikely that branch prediction plays a role when descending a tree.

1000 average lookups between insert() and delete()

In the final test, 1000 lookups are performed between insertion and removal. The goal essentially is to illustrate how the trees behave on lookup-mostly workloads. The tree is much more full, at 999k entries in both tests. Insertions and deletions are pretty rare, roughly 1.5k per second each for 1.5M operations.

unique keys	duplicates

The 64-bit integer test still doesn't benefit from extra lookups, which is not surprising since the lookup was already more expensive from the start due to the necessity to consult both branches during the descent, hence delay the operation by one extra memory lookup. However the gap between eb and ceb continues to widen for memory storage like strings, where ceb now becomes 33% faster with dups, and 12% faster with unique keys (which is more likely to be used for multiple lookups).

Preliminary conclusions

At first glance this at least confirms initial suspicions that these much more compact trees will not be used for heavy changes due to the O(logN) deletion, so schedulers are out of the equation. However often looked up information such as config elements, GUID etc can make quite some sense since the storage size is the same as for a list but with O(logN) lookups. More generally, the benefit can appear when it comes to indexing any data that's never deleted (config elements, logs etc). What was initially considered as not very likely was that it would even be faster than ebtrees on strings or memory blocks. It's worth noting that this has been made possible thanks to the tagging of leaf pointers, as in the most generic design, keys have to be re-read at each layer to determine if it's a node or a leaf, which is more expensive than ebtree or the current cebtree which are a form of prefix trees (comparison at each layer restarts from the previous layer).

Future work

The performance gap on string lookups has widened since the last commit which tries to stop as soon as the looked up key is found on any node. It is interesting because it restarts a potential interest in minimizing the longest length by trying to optionally balance the tree for lookup-mostly use cases, and/or to reconsider the option to connect up to 4 branches per layer, that was once envisioned for ebtrees, in order to cut the average length in half, at the expense of storing 4 pointers in each node.

2025-03-09

On the design of Compact Elastic Binary Trees (cebtree)

Those who often hear me discuss about my week-end projects have been accustomed to hearing about deuterium fusion (that's for another post), laser engraving, and the compact version of the ebtrees, aka compact elastic binary trees, without knowing all the details. That's what we'll be discussing here.

A long history

Self-organizing trees

It started very early during the development of ebtrees, in which the same node appears twice, once as a node and once as a leaf. In 2007, when I managed to build self-organizing trees to handle duplicates, and noticed that all that was needed to insert, delete, and visit nodes, was to remember the node from which to restart or where to attach, and that it was never needed to go upwards in the tree for such operations, that made me think that maybe this type of structure would also work to arrange data without uplink and without keeping any relation between the node's depth and the key it conveys. Adding a new duplicate value would only require to measure the depth of the penultimate branch and allow to grow the next one until it becomes as long. If it was already as long, then another one just had to be added above it. For example below, we're adding 4 successive entries to a duplicate sub-tree that already contains 4 identical keys ordered by their insertion sequence. The number in the leaf indicates the arrival order for each of them:

XORing branches

18 months later, when optimizing the lookup algorithm by comparing the XOR between the looked up key and the node's key with the node's position, I figured that the XOR between the two branches was always necessarily lower given that these branches have more bits in common, and that maybe we could afford not to store the depth if the XOR between the key and node, and the XOR between the two next branches already allowed us to figure where we're located in the tree. An intuition was that, just like in the ebtrees, we can detect that the searched key is not there when the XOR of the two branches is no longer smaller than between the key and current node, or that we're reaching a leaf if the XOR is higher than the previous one.

In the example below, we successively insert the keys 6, 4, then 5. 6 being the first entry, it's the node-less entry (it only has a leaf). Then 4 is inserted, it has 4 and 5 below it, then 5 is inserted below 4, so it has 4 and 5 as its leaves:

In red is indicated the XOR between the branches. The node-less leaf is a special case that loops to itself, so its descendants are the same. 4 splits between 5 and 5, so the XOR of the branches here is 5^6=3. 5 splits 4 and 5 so its XOR is 4^5=1.

When descending along the branch to key 4, a XOR of 3 is first met at node 4, then a XOR of 1 is met at node 5, then when going down to node 4 (which is the same as the one on top), we're seeing a XOR of 3 again. 3 being higher than previous value 1 proves that we have looped, so 4 here is no longer a node but a leaf.

Similarly, when looking up a value, XORing the searched value with the node's value indicates if we have a chance to find it below or not. As soon as the XOR between a node's branches is higher than the XOR between the looked up value and the node's value, we're certain it cannot be there anymore.

All of this was fun and interesting, but turning intuitions into new week-end projects is never trivial nor necessarily desired when there are already plenty of other activities, so aside a few ideas from time to time when improving ebtrees, not much progress was done in this area for a while.

First limited implementation

Several years later, in 2014, while thinking about a super-compact memory allocator that I'd call "water drop alloc" ("wdalloc") to illustrate how releasing a drop of water in the middle of two other ones that touches them instantly becomes a larger one, I was thinking that a more compact variant of ebtrees would be particularly suited for that purpose if they would index the pointer itself as the key instead of a stored key. An example of the principle is seen below where freeing a block of length 4 at address 5 results in a new free block of 11 by fusing it with the two adjacent blocks, thanks to trivial adjacent address lookups:

Thus I implemented the "space efficient trees" by then for that purpose, proving that the concept of descending with only two pointers and nothing else is sufficient to build a complete tree:

That's it, only two pointers and nothing else, i.e. 16 bytes on 64-bit architectures. Compared to ebtrees which take 40 bytes (or even 48 with alignment) on such architectures, it's a nice saving!

Here the leaf keys would just be the nodes' addresses, so there was no additional storage, and two pointers were really the only required storage for each allocatable area, with nothing having to be stored for as long as the node is in use.

That application was well suited to the limited experience I had accumulated by then about these new trees:

no need to look up the pointer upon free(), the node is at a fixed place relative to the pointer passed to free(). An optional lookup would allow to detect double-free, though.
lookup_ge() and lookup_le() were sufficient to find adjacent entries, as well as to index free objects by their size.
there was no duplicate by definition since the key is the object's address
all operations require to know the root, which in this case was the origin of the allocation arena.
space saving: only two pointers per entry, thus limiting granularity to dual-word, which preserves the dual-word alignment guarantees generally assumed by C ABIs.

The first tests showed that the performance was significantly lower than with other allocators, and the idea of scaling that to threads and having to lock the tree achieved to put the final nail in the project's coffin.

Next attempts

However this confirmed that the basic algorithm worked. I then started to try to better organize the ebtree code to support the relative adressing variants that had been lying in their own branch for 3 years, to support multiple addressing modes, and try to merge both the standard and the compact models. That was tough work that spanned from 2014 to about 2017, while gaining in complexity over time and slowing progressively down. Worse, every time I had an idea about that topic, it required a full week-end to re-enter that code in the process of being reorganized, and the week-end was often over before I had the time to evaluate my idea.

In early 2020, we were locked down with week-ends becoming particularly boring, and I had to keep my mind busy, so it was a sane week-end activity to get back to that project and try to complete it. I tried to force myself to proceed in baby steps, in order to avoid the trap of getting stuck into the code being reorganized:

first, I actually had to re-learn how that whole thing worked, verify the properties that I can use, and demonstrate them. That kept me busy with pen and paper for a several week-ends in the garden during a fortunately sunny spring.
second, I had to verify if storing ordered duplicates was possible at all, given that my wdalloc experiment never made use of them. It took me a while to convince myself that the self-organizing tree was not suitable here since we were not storing the depth to distinguish between them:

It's visible above that the XOR between all branches is 0 in the duplicates, so there's no way to get out of them and find a leaf. In ebtrees actually what breaks the loop is the fact that we rely on the depth counter associated with each node: when finding a node whose depth is higher than the previous one, it indicates we have looped, thus that the last node was in fact a leaf.

For a long time I also considered storing a couple made of the node's pointer and the key to differentiate them, but I didn't like it because keys would be retrieved in their in-memory order, not in the insertion order, and that was not great for certain operations where it's mandatory to know which one will be used. In addition, that could have costed a lot to delete an element, requiring to visit many duplicates before finding the right one.
Finally I managed to design a list-based mechanism that worked on paper using tagged pointers. I had the intuition that it might possibly also work without tagged pointers, but at least I knew I was having a solution which allowed to delete any duplicate in O(1) after locating it in O(logN). There were many special cases and that was significantly complicating the design.

I left that quest for a while. I had proven my design, and the lockdown period was not something that could be associated with desires of creativity :-/ Adding to that the fact that diving into this on week-ends is often non-productive as it takes a whole week-end to figure again how all of this works, it became discouraging after a few attempts.

I forced myself to work on it again 9 months later during the new year's vacation, continuing to integrate it into the reorganized branch of ebtree. And that was a big mistake: the state of that branch made it too hard to make any progress efficiently, to the point that I'm now wondering if there's anything to reuse in the branch collecting that long work.

Restart from scratch with first success

Finally I decided on a summer vacation of 2023 that the compact tree code needed to be forked into its own project, and restarted from scratch there, intentionally leaving the complexity of dealing with duplicates aside for a first implementation, ignoring some of the initially planned optimizations, trying to make it work without tagged pointers so as to keep them for later if needed (e.g. for duplicates), and all done with a single function to deal with all types and operations in order not to reproduce the maintenance difficulties met with ebtrees. Another element taken into consideration is the extension of the API to support a different key offset, permitting in some cases to avoid repeating a key in certain structures when it's already present but not immediately after the tree node:

After all this, v0.1 was finally released mid-september 2024 with support for all types but no duplicates, as planned for the first version. I could find a good use for this in haproxy 3.1: variables generally have a short name with a small value, and couldn't afford to use a huge ebtree indexing node, so they were still in a linked list. We already knew that variables were gaining in popularity and that their cost would increase linearly with their number. Thus it was a perfect opportunity to migrate their indexing to cebtree at no extra cost. This change revealed that a config with 100 variables improved its performance by 37% with no size increase. Great!

Need for duplicates again

Studying opportunities for other areas of haproxy like maps, acls, timers etc revealed that it was not that trivial, because all of these need ordered duplicates. So that was a good opportunity to try to attack that problem again.

The previous work done during lockdown consisted in inserting lists between a node and a leaf. The idea was that in all cases we're dealing with pointers to nodes, and that duplicates can be arranged as a list without losing information nor needing extra storage, if we only use list elements as a spacer:

The principle is not very complex. When adding a duplicate key, we're inserting it as a list element between the previous leaf and the node that pointed to it. The inserted node has its left pointer pointing to the leaf and the right one pointing to itself. When adding new duplicates, the new duplicate is inserted exactly the same way, and the previous list element points to from its right pointer it. The list can thus be extended infinitely. What is nice is that when looking up the value, it is detected that there are multiple keys because regardless of the number of list elements, the left and the right pointers point to a node having the same value (possibly the same node itself). When such a duplicate entry is detected, it is always the last inserted one (e.g. the blue one on the diagram above), then the first leaf is always reachable via last->right->left. Then in order to visit the next duplicate, either the current node is this leaf, in which case the next one is last->right, or it isn't, then it is necessarily one of the duplicates, and the next node is the always node->right until the last one is reached (the one that was first found). This construct is super interesting because it provides the following properties:

appending a new entry always places it at the end
visiting entries always start from the first inserted ones
it is always possible from a node to instantly find its neighbors in order to delete it, thus making it relatively easy to "pick" elements (i.e. find and detach the next one) in insertion order.
list entries behave like nodes in that they always point to adjacent values that fall within as large a prefix length as was previously encountered (since they are equal).

Finally v0.2 was issued with this support for duplicates mid february 2025. It's now considered feature-complete!

Working on performance

Performance comparisons show that, as expected, this cebtree is less performant than ebtree due to having to compare two nodes per level during the descent, only one of which is to be useful, thus effectively doubling the number of memory reads compared to ebtree.

In addition, large strings are severely affected because just like in non-radix trees such as rbtrees, they have to be compared from the beginning at each level in order to detect a leaf, so the average lookup cost is L*log2(N) where L is the average string length and N the number of strings. Since we didn't make use of tagged pointers yet, it was time to try to introduce them to designate leaves, allowing to restart the comparison from the previous split point so that strings are now entirely read only once during the lookup.

The approach that seemed to work on paper during lockdown proved its value without affecting duplicate lists, in that it is super simple: only pointers to a leaf are tagged. This means that pointers to lists are not tagged, and that a tagged pointer inside a list of duplicates necessarily indicates the first leaf. Also, insertion/removal operations remain extremely simple as the pointer is preserved. The total number of changes in the code to support tagged pointers remained very low, most of them were related to the addition of wrappers to avoid directly dereferencing tagged pointers. For this, pointer types have been changed in the code so as to always know what points to a node and what is a tagged node pointer. This requires to change the code where a cebtree was created by a "struct ceb_node*", to change it to "struct ceb_root*", but that's all.

An update of the "6,4,5,5,5" tree above with tagged pointers for leaves now gives the following (tagged pointers are on red links):

Now it's only slightly slower than ebtree on large strings (still due to twice the keys being read during the descent), but sometimes even slightly faster on small strings, probably due to the lower memory bandwidth needed thanks to the much smaller nodes. This is currently found in the master branch of the project.

Current state

The current state looks functional. The code contains some functional and performance tests. The usage remains very close to ebtree except that the tree root has to be passed to all operations (in ebtree, delete/next/prev etc didn't require the root since it was possible to navigate upwards).

The choice between duplicates or unique keys is specified in the function name now, we don't tag the root to indicate how it will be used anymore; it was indeed found after using ebtrees like this for a while that it makes the code not particularly obvious to understand, as one needs to locate the initialization of the tree to know what to expect after an insert() call. Now at least the intent will be explicit in the code.

This also simplifies the root initialization as the root node now only needs to be set to a NULL pointer for an empty tree, then some insert() calls can be performed to insert elements. Example below (warning! no error checking):

struct ceb_root *root;

struct ceb_node *node;

struct entry {

struct ceb_node node;

char *string;

} *entry;

/* index all arguments */

while(arg < argc) {

entry = calloc(1, sizeof(*entry));

entry->string = strdup(argv[arg]);

cebis_insert(&root, &entry->node);

arg++;

}

/* list all arguments, sorted */

node = cebis_first(&root);

while (node) {

printf("@%p = %s\n", node, container_of(node, struct entry, node)->string);

node = cebis_next(&root, node);

}

Just use cebuis_* above to enforce unique keys.

Cost comparison with ebtree

All operations are in O(logN) while ebtree has several operations in O(1) (next/prev/delete). In addition, for certain operations such as next/prev/delete, cebtree requires to record the restart pointer during a descent, reach the desired leaf and complete the operation from the recorded pointer. Just like with a next/prev, this extra cost is amortized to one extra level, but it can reach up to logN for 1/N nodes and could be averaged to logN/N.

In addition, the extra cost compared to ebtree comes from the fact that cebtree, just like other non-radix trees (rbtree etc), needs to visit two branches to figure which one to descend. ebtree avoids this by storing the depth value inside the node. These extra reads may consume more memory bandwidth for large keys, but given that the node itself is 2.5-3 times smaller, the extra cost for small keys can still be absorbed by the lower cost of retrieving the node from memory. However, this extra cost could have an impact on TLB misses when using indirect accesses (cebis_* etc).

Tests performed on haproxy on a laptop with the map/acl pattern reference showed:

a tiny performance increase when loading a map made of 17 million IPv4 addresses and masks (22.0 seconds down to 21.5 seconds); these are short 17-character strings. The resulting memory saving was 400 MB.
a moderate increase when loading a map made of 1 million user agents (much larger strings). The time went up from 2.475 seconds to 2.711 seconds. The memory saving was 24 MB.

Thus it can make sense to use this for stuff that's mostly related to configuration or that doesn't change often and is mostly looked up (i.e. config object names).

Update (2025-05-04): performance measurements were done here.

What's left to be done

First, the documentation in the code is still incomplete. Now that the API got clearer, the doc should be should be finished. In the worst case some types will get renamed, which is no big deal. The doc must be done before 0.3 is released anyway.

Some things are still in the state of reflection:

shouldn't we use a different root type for duplicates vs unique trees ? It could avoid mistakes caused by copy-paste. We could very well do that in exposed function prototypes only so as not to carry these down the code.
some operations that exist in ebtree were not implemented. One of them is the longest match, typically used to look up IP addresses within ranges. It's still unknown at this point if this can be implemented with the current model (maybe it would require to slightly change the key representation, in ebtree the prefix length is passed and stored separately).
some long but still incomplete studies indicate that the code could almost support properly ordered atomic operations. The only real limitation is deletion that is not compatible with parallel insertion or lookup. One first approach could be to use upgardable rwlocks like progressive locks and take a shared lock for all operations, and upgrade it to an exclusive lock only when committing the removal. This would result in a very short locking time, though all visitors would still need to write-access a shared memory area to grab the read lock, and a moderate deletion rate could suffice to purge visitors too often. Another solution might be to consider alternate approaches such as RCU.
implementation of relative pointers was started but put to a pause for now. It's convenient to index files in-place (e.g. logs etc), but I didn't have immediate use for this and it further complicates the development. It will have to be done, though.

Future work

Simplifying the trees to the extreme of requiring only two pointers, just like a list, opens some perspectives:

maybe it would be useful to support to optionally implement the parent pointers in the node so as to decide between performance and space.
the work initially done on ebtrees to try to factor the code went in the wrong direction by making it too difficult to use and maintain. On the opposite, the work done on cebtree preserved maintainability despite supporting multiple types (u32, u64, long, addr, blocks, strings, indirect blocks, indirect strings). It looks like ebtree would benefit from rolling back to before the change and adopting a cebtree-like approach.
the duplicate lists of cebtree look way better than the duplicate trees of ebtree, and maybe we could significantly improve ebtree's performance in schedulers by adopting the same duplicates (duplicate values are frequent in scheduler due to the timer's resolution).
maybe ebtree could be built as an extended cebtree in the end, then it could be nice to re-merge them into a single project.
studying the interest of self-locking nodes following the same principle as mt_lists could also be interesting, though the visit cost might become high.

Links

The code in its current state is available here:

cebtree : version 0.1 is the most basic (no duplicates). Version 0.2 supports duplicates without using tagged pointers. It can be seen as a reference about what can be done when tagged pointers are not desired. The current version supports everything including tagged pointers and is still in development.
ebtree : version 6 is the production one. Version 7 is the one that tries to factor all the code and that is not considered production ready.
plock: the upgradable R/W progressive locks will probably be the next step to support parallel insertions and lookups with exclusivity only on the delete operation.
mt_list: the self-locked lists have concepts that might be reusable here, probably even lighter since nodes are accessible in a single direction at a time. This deserves to be studied.

2024-12-08

Adding a cheap and simple RTC to Rockchip devices

Background

There are plenty of nice devices these days designed around ARMv8.2 SoCs such as RK3568 or variations around RK3588. Many of them have been using the HYM8563 I2C RTC chip for about a decade. This device is reasonably cheap, requires few components, consumes very little power and is long proven to work well. Despite its low price, entry-level devices are often lacking it and only have the pads on the board, which is understandable when every dollar counts. Some such devices include Radxa's E20C and E52C mini routers/versatile servers, which are absolutely awesome devices, which come with either dual-1Gbps or dual-2.5Gbps, feature a USB console so as to always provide a local access, and have a metal enclosure. But they're lacking the RTC, which is quite annoying for a firewall or mini-server, as a power outage always has a painful effect on the dependency chain at home (typically the OS boots faster than the ISP's box and has to start with a bad date). At least once that issue was reported, the products' creator, Tom Cubie (aka @hipboi) acknowledged the problem and suggested that new devices should have it.

Can I do it myself ?

How to proceed with existing devices in the short term ? Isn't it possible to just order the chip and solder it on board ?

One problem with HYM8563 is that it's almost always only found in TSSOP8 format, which means it's a few millimeters wide, with a pitch of 0.65mm, which means that pins are roughly 0.35mm wide with a spacing of 0.3mm between. Actually that's not that difficult to deal with, provided that you have a fine enough soldering iron. The real problem is the components that come around are also of the same scale and very close to each other, as can be seen on the photo of the E52C below (click on the photo to zoom):

The resistors and capacitors are in 0201 format, which is 0.6mm tall by 0.3mm wide, and are very difficult to solder without causing a short circuit. To get a sense of the scale above, the chip is 3mm*3mm. The crystal oscillator is a 32.768 kHz in a flat format that's not easy to find for a hobbyist.

However, the I2C pads (SDA and SCL) are "large enough" and moderately accessible, and there's power on the other side, let's keep that in mind.

Finding a pre-made board

There are various I2C RTC boards available on the net. Most of them are DS1307, and a very nice small one is based on DS3231 with a battery like this one. However, there's no HYM8563 one, and I'd prefer to stay on the same that is referenced in the DTS so that it works out of the box. But I could find the chip alone.

Making a new board instead

Then I started to think whether I could make a board myself based on the chip. Looking at the HYM8563 datasheet above reveals that it's actually not that hard to assemble the few components around. I drew a schematic on by hand (takes one minute vs one hour in Eagle):

I found a few rare occurrences of the chip in SOP8 package, which is easy to deal with, so I could make a PCB with that chip and the 6 components around it. I ordered a pack of 10pcs of that chip, and unfortunately received the tiny TSSOP8 ones instead :-(

But that reminded me that I had some small TSSOP8/SOP8/DIL PCB adapters, which support SOP8 on one side and TSSOP8 on the other side, with the DIL pads in the holes. These are convenient for on-air cabling since pads of both sides are connected together and to the holes:

After all, there were so few components that I could probably solder them directly on that PCB on either side. So let's try.

Bill of materials

Based on the schematic, I'll need this:

1 such PCB
1 HYM8563TS chip
1 diode in 0402 format
2 4.7k resistors in 0402 format
1 82pF capacitor
1 crystal oscillator
1 "large enough" capacitor (I counted about 6mn per 100µF), supporting at least 3.3V

The secret to avoid shorts when soldering TSSOP chips is to use solder paste. I have some in a syringe that I keep cool in the fridge (otherwise it dries in a few months and is unusable the day you need it):

Assembly steps

As a first step, we'll need to cut a trace on the PCB so as to isolate the VCC input pin from its connection to the IC's pin 8, as we'll want to place the diode there instead. We'll need to keep the resistors connected to the external VCC so that the capacitors doesn't discharge its energy into them when off. Thus we'll cut the trace after the via that connects to the other side. What's nice with these boards is that the pads of both sides correspond to the ones of the other side at the same location, so they're easy to match. The final pinout of the board will look like this:

So let's cut the trace:

Now we're going to place the chip before the diode, because it's already painful enough to solder such small chips, we don't want to be hindered by the diode. The approach for this is to place a very little drop of solder paste over all the pads on each side of the chip. Think in terms of volume, considering that the solder paste is mostly flux that will disappear (the flux will avoid shorts by making it difficult for the solder to make bridges between pins). Count that the resulting volume will be roughly 1/3 of the disposed one. Making a small trail roughly as wide as a pad is a good estimate. Make very sure not to leave any beneath the chip, as it will never melt and will stay them forever, risking to make shorts later. Only cover the pads and areas you can clean later. Once done, just place the chip over the paste:

Now solder everything with the soldering iron, without worrying about the risk of shorts, just focus on aligning the chip as best as possible with the pads, and melt absolutely all the paste. Then check with a magnifier or better, a microscope that everything's OK. With a multi-meter you need to check there's no short between adjacent pins:

Now's time to scratch the right side of the track and to place the diode, with the positive on the right and the negative on the left:

Let's now flip the board to solder the resistors. They will attach to the two bottom left pins, and connect to the beneath pad corresponding to the Vdd pin, which is in fact connected to the positive pin at the bottom right. It has the pleasant advantage of being placed immediately next to the two pins, and close enough to have the resistors directly touching on both ends:

Now let's connect the capacitor between the chip's Vss pin and the top rightmost hole that's connected to the chip's pin 1:

On the other side, the oscillator can be soldered. The pins on the left (when reading the reference) or under the notch are the useful ones. The two other ones on the right are not connected so I could connect them to the pin1 pad, though that will depend on the model since some have integrated capacitors and might not work well when doing so, but I could verify with my multi-meter that there was no capacitor there. In order to solder these pins beneath the component, solder paste is your friend again, being cautious again not to put too much:

We're only left with the capacitor that serves as an energy storage during outages. Ideally you'd use 5.5V super capacitors of 0.1F for several days, or the new 3.8V lithium-based supercaps that have even higher capacities (typically 10F and above). But given that my goal here really to just cover occasional power cuts from the mains, when a power supply dies or when I need to move cables inside my rack, I don't need more than a few minutes. And I did happen to have a 4V/150µF capacitor that perfectly matched my needs (should support 7 to 10mn of outage, and was super flat).

Soldering it is just like for the oscillator above except that it's not easy to use solder paste. The positive terminal (the one with the colored bar) needs to be connected to pin 8 of the chip (the top leftmost here) and the negative to the topmost right hole. Soldering that one is not too hard, just melt some solder inside the hole. Alternately, using thin wires is OK as well.

Now's time to connect wires and test:

I connected these to another board that I'm using for testing I2C, and ran i2cdetect:

Good, the device appears at address 0x51, so at least it's detected. Now we need to verify that its oscillator is properly ticking. For this we'll have to write the current time to the chip (which appears as rtc1 on this board). It will emit an "invalid argument" because it first tries to read the time which contains a bit "VL" ("Voltage Low") set in the second field indicating the the device lost power since last time it was set. But we can ignore it when writing the date, reading it next will indicate whether it works or not (the time must change):

Perfect, it works! Now's time to replace my testing wires with thin ones, to put all of that into shrink tube to protect it and solder it in the device.

Installation

Let's spot the 4 pads we'll need inside the E52C. First, at the bottom of the board, we'll find the I2C SDA and SCL pins at the bottom on these photos, where we'll solder our SDA/SCL wires (violet and green here). They'll have to pass between the board and the enclosure so they must be very thin, but where I'm passing them, there's enough room:

Next step is to install the module on the other side of the board. We're going to glue it on top of the micro SD card reader with some double-sided tape. The capacitor close to it has both positive (3.3V) and negative and is large enough to support soldering directly to it:

Conclusion

It was fun to make but took me most of the day to build; soldering small components requires delicate manipulations, and dealing with flux and solder paste requires lots of cleaning along the operations. I've made two modules so I still have one extra left. This will allow me to migrate another of my machines to a new one, but I'm impatient to see them produced with the chip already soldered so that I don't have to do this anymore!