Willy Tarreau's stuff: ebtree

Showing posts with label ebtree. Show all posts

2025-05-04

Performance comparison of EB trees and CEB trees

Background

This is a follow up for the previous article on CEB trees. It documents a performance test that was run on both libraries while measuring the time spent in the different operations in order to compare how some operations behave on one of the other. This is mostly a copy of the performance test report stored in the tree here.

Tests were run on a single core of an Intel Core i7-6700K running at 4.4 GHz equipped with dual-channel DDR4-2400, using the "bench" utility from the "tests/" directory, comparing the performance of the two libraries at the following commits:

ebtree stable-6.0 @ccf9d9f ("TESTS: bench: enable the unique-keys tests as well")
cebtree master @6b5cd28 ("OPTIM: descent: stop at the node when the path to the leaf is not used")

The principle of the test consists in creating 1 million nodes, and randomly picking one of them, if the node is not in the tree, then a random value is assigned (of the type of the key) and the node is inserted, otherwise if it's in the tree, it is either looked up or deleted. The distribution between lookups and deletes is affected by a config parameter indicating how many lookups are performed per delete, in order to simulate different use cases where a certain number of lookups are performed before a node is removed.

The program reports statistics about operation rates and tree state. In parallel, perf record is run at 4000 samples per second to measure the average ratio of time spent in each function. After processing, the results below indicate the average time spent in each function, in nanoseconds.

The raw results are stored in /results/bench-eb-ceb/lk*, where "lk* " represents the average number of lookups performed per node, and where we find the following files:

bench-ceb*.out: the output of "bench" on ceb trees
bench-eb*.out: the output of "bench" on eb trees
perf-ceb*.out: the filtered output of "perf report" on ceb trees
perf-eb*.out: the filtered output of "perf report" on eb trees
graph.gnup: the gnuplot script used to produce the graphs
meas.out: the whole test
meas-m.out: the tests involving multiple keys
meas-u.out: the tests involving unique keys
multi.png: graph of operations on trees supporting multiple identical keys
unique.png: graph of operations on trees only supporting unique keys

Tests

1 average lookup between insert() and delete()

The first test is run with one single lookup between insert and delete. This is convenient because it results in the following distribution of operations:

for trees storing unique keys: 30% inserts, 30% deletes, 40% lookups. The tree stabilizes at 60% fill rate.
for trees storing duplicate keys: 33.3% inserts, 33.3% deletes, 33.3% lookups. The tree stabilizes at 66.6% fill rate.

Thus it allows to roughly compare the individual cost of each operation under load.

unique keys	duplicates

What's seen on the graphs is that compact trees are much slower on deletion since that's one area where ebtrees shine by being O(1). The slow down on insertion and lookups is minimal (around 10% for integers), and for memory- based keys (blocks, strings), both insertion and lookups are faster on compact trees than on eb trees, probably thanks to the reduced memory bandwidth usage, and a bit also for the O(1) duplicate handling while ebtrees are in O(logN), which explains why the difference looks a bit more important with duplicates.

10 average lookups between insert() and delete()

The second test concerns keys that are looked up 10 times before being deleted. This results in more keys in the tree since less operations are deletion (916k keys for dups, 892k for unique).

unique keys	duplicates

The insert and delete operations are much rarer so their amortized average time is much lower. In addition the strong reduction on the number of writes seems to improve the overall walk time through the trees, probably just due to better data caching, since it's unlikely that branch prediction plays a role when descending a tree.

1000 average lookups between insert() and delete()

In the final test, 1000 lookups are performed between insertion and removal. The goal essentially is to illustrate how the trees behave on lookup-mostly workloads. The tree is much more full, at 999k entries in both tests. Insertions and deletions are pretty rare, roughly 1.5k per second each for 1.5M operations.

unique keys	duplicates

The 64-bit integer test still doesn't benefit from extra lookups, which is not surprising since the lookup was already more expensive from the start due to the necessity to consult both branches during the descent, hence delay the operation by one extra memory lookup. However the gap between eb and ceb continues to widen for memory storage like strings, where ceb now becomes 33% faster with dups, and 12% faster with unique keys (which is more likely to be used for multiple lookups).

Preliminary conclusions

At first glance this at least confirms initial suspicions that these much more compact trees will not be used for heavy changes due to the O(logN) deletion, so schedulers are out of the equation. However often looked up information such as config elements, GUID etc can make quite some sense since the storage size is the same as for a list but with O(logN) lookups. More generally, the benefit can appear when it comes to indexing any data that's never deleted (config elements, logs etc). What was initially considered as not very likely was that it would even be faster than ebtrees on strings or memory blocks. It's worth noting that this has been made possible thanks to the tagging of leaf pointers, as in the most generic design, keys have to be re-read at each layer to determine if it's a node or a leaf, which is more expensive than ebtree or the current cebtree which are a form of prefix trees (comparison at each layer restarts from the previous layer).

Future work

The performance gap on string lookups has widened since the last commit which tries to stop as soon as the looked up key is found on any node. It is interesting because it restarts a potential interest in minimizing the longest length by trying to optionally balance the tree for lookup-mostly use cases, and/or to reconsider the option to connect up to 4 branches per layer, that was once envisioned for ebtrees, in order to cut the average length in half, at the expense of storing 4 pointers in each node.

2025-03-09

On the design of Compact Elastic Binary Trees (cebtree)

Those who often hear me discuss about my week-end projects have been accustomed to hearing about deuterium fusion (that's for another post), laser engraving, and the compact version of the ebtrees, aka compact elastic binary trees, without knowing all the details. That's what we'll be discussing here.

A long history

Self-organizing trees

It started very early during the development of ebtrees, in which the same node appears twice, once as a node and once as a leaf. In 2007, when I managed to build self-organizing trees to handle duplicates, and noticed that all that was needed to insert, delete, and visit nodes, was to remember the node from which to restart or where to attach, and that it was never needed to go upwards in the tree for such operations, that made me think that maybe this type of structure would also work to arrange data without uplink and without keeping any relation between the node's depth and the key it conveys. Adding a new duplicate value would only require to measure the depth of the penultimate branch and allow to grow the next one until it becomes as long. If it was already as long, then another one just had to be added above it. For example below, we're adding 4 successive entries to a duplicate sub-tree that already contains 4 identical keys ordered by their insertion sequence. The number in the leaf indicates the arrival order for each of them:

XORing branches

18 months later, when optimizing the lookup algorithm by comparing the XOR between the looked up key and the node's key with the node's position, I figured that the XOR between the two branches was always necessarily lower given that these branches have more bits in common, and that maybe we could afford not to store the depth if the XOR between the key and node, and the XOR between the two next branches already allowed us to figure where we're located in the tree. An intuition was that, just like in the ebtrees, we can detect that the searched key is not there when the XOR of the two branches is no longer smaller than between the key and current node, or that we're reaching a leaf if the XOR is higher than the previous one.

In the example below, we successively insert the keys 6, 4, then 5. 6 being the first entry, it's the node-less entry (it only has a leaf). Then 4 is inserted, it has 4 and 5 below it, then 5 is inserted below 4, so it has 4 and 5 as its leaves:

In red is indicated the XOR between the branches. The node-less leaf is a special case that loops to itself, so its descendants are the same. 4 splits between 5 and 5, so the XOR of the branches here is 5^6=3. 5 splits 4 and 5 so its XOR is 4^5=1.

When descending along the branch to key 4, a XOR of 3 is first met at node 4, then a XOR of 1 is met at node 5, then when going down to node 4 (which is the same as the one on top), we're seeing a XOR of 3 again. 3 being higher than previous value 1 proves that we have looped, so 4 here is no longer a node but a leaf.

Similarly, when looking up a value, XORing the searched value with the node's value indicates if we have a chance to find it below or not. As soon as the XOR between a node's branches is higher than the XOR between the looked up value and the node's value, we're certain it cannot be there anymore.

All of this was fun and interesting, but turning intuitions into new week-end projects is never trivial nor necessarily desired when there are already plenty of other activities, so aside a few ideas from time to time when improving ebtrees, not much progress was done in this area for a while.

First limited implementation

Several years later, in 2014, while thinking about a super-compact memory allocator that I'd call "water drop alloc" ("wdalloc") to illustrate how releasing a drop of water in the middle of two other ones that touches them instantly becomes a larger one, I was thinking that a more compact variant of ebtrees would be particularly suited for that purpose if they would index the pointer itself as the key instead of a stored key. An example of the principle is seen below where freeing a block of length 4 at address 5 results in a new free block of 11 by fusing it with the two adjacent blocks, thanks to trivial adjacent address lookups:

Thus I implemented the "space efficient trees" by then for that purpose, proving that the concept of descending with only two pointers and nothing else is sufficient to build a complete tree:

That's it, only two pointers and nothing else, i.e. 16 bytes on 64-bit architectures. Compared to ebtrees which take 40 bytes (or even 48 with alignment) on such architectures, it's a nice saving!

Here the leaf keys would just be the nodes' addresses, so there was no additional storage, and two pointers were really the only required storage for each allocatable area, with nothing having to be stored for as long as the node is in use.

That application was well suited to the limited experience I had accumulated by then about these new trees:

no need to look up the pointer upon free(), the node is at a fixed place relative to the pointer passed to free(). An optional lookup would allow to detect double-free, though.
lookup_ge() and lookup_le() were sufficient to find adjacent entries, as well as to index free objects by their size.
there was no duplicate by definition since the key is the object's address
all operations require to know the root, which in this case was the origin of the allocation arena.
space saving: only two pointers per entry, thus limiting granularity to dual-word, which preserves the dual-word alignment guarantees generally assumed by C ABIs.

The first tests showed that the performance was significantly lower than with other allocators, and the idea of scaling that to threads and having to lock the tree achieved to put the final nail in the project's coffin.

Next attempts

However this confirmed that the basic algorithm worked. I then started to try to better organize the ebtree code to support the relative adressing variants that had been lying in their own branch for 3 years, to support multiple addressing modes, and try to merge both the standard and the compact models. That was tough work that spanned from 2014 to about 2017, while gaining in complexity over time and slowing progressively down. Worse, every time I had an idea about that topic, it required a full week-end to re-enter that code in the process of being reorganized, and the week-end was often over before I had the time to evaluate my idea.

In early 2020, we were locked down with week-ends becoming particularly boring, and I had to keep my mind busy, so it was a sane week-end activity to get back to that project and try to complete it. I tried to force myself to proceed in baby steps, in order to avoid the trap of getting stuck into the code being reorganized:

first, I actually had to re-learn how that whole thing worked, verify the properties that I can use, and demonstrate them. That kept me busy with pen and paper for a several week-ends in the garden during a fortunately sunny spring.
second, I had to verify if storing ordered duplicates was possible at all, given that my wdalloc experiment never made use of them. It took me a while to convince myself that the self-organizing tree was not suitable here since we were not storing the depth to distinguish between them:

It's visible above that the XOR between all branches is 0 in the duplicates, so there's no way to get out of them and find a leaf. In ebtrees actually what breaks the loop is the fact that we rely on the depth counter associated with each node: when finding a node whose depth is higher than the previous one, it indicates we have looped, thus that the last node was in fact a leaf.

For a long time I also considered storing a couple made of the node's pointer and the key to differentiate them, but I didn't like it because keys would be retrieved in their in-memory order, not in the insertion order, and that was not great for certain operations where it's mandatory to know which one will be used. In addition, that could have costed a lot to delete an element, requiring to visit many duplicates before finding the right one.
Finally I managed to design a list-based mechanism that worked on paper using tagged pointers. I had the intuition that it might possibly also work without tagged pointers, but at least I knew I was having a solution which allowed to delete any duplicate in O(1) after locating it in O(logN). There were many special cases and that was significantly complicating the design.

I left that quest for a while. I had proven my design, and the lockdown period was not something that could be associated with desires of creativity :-/ Adding to that the fact that diving into this on week-ends is often non-productive as it takes a whole week-end to figure again how all of this works, it became discouraging after a few attempts.

I forced myself to work on it again 9 months later during the new year's vacation, continuing to integrate it into the reorganized branch of ebtree. And that was a big mistake: the state of that branch made it too hard to make any progress efficiently, to the point that I'm now wondering if there's anything to reuse in the branch collecting that long work.

Restart from scratch with first success

Finally I decided on a summer vacation of 2023 that the compact tree code needed to be forked into its own project, and restarted from scratch there, intentionally leaving the complexity of dealing with duplicates aside for a first implementation, ignoring some of the initially planned optimizations, trying to make it work without tagged pointers so as to keep them for later if needed (e.g. for duplicates), and all done with a single function to deal with all types and operations in order not to reproduce the maintenance difficulties met with ebtrees. Another element taken into consideration is the extension of the API to support a different key offset, permitting in some cases to avoid repeating a key in certain structures when it's already present but not immediately after the tree node:

After all this, v0.1 was finally released mid-september 2024 with support for all types but no duplicates, as planned for the first version. I could find a good use for this in haproxy 3.1: variables generally have a short name with a small value, and couldn't afford to use a huge ebtree indexing node, so they were still in a linked list. We already knew that variables were gaining in popularity and that their cost would increase linearly with their number. Thus it was a perfect opportunity to migrate their indexing to cebtree at no extra cost. This change revealed that a config with 100 variables improved its performance by 37% with no size increase. Great!

Need for duplicates again

Studying opportunities for other areas of haproxy like maps, acls, timers etc revealed that it was not that trivial, because all of these need ordered duplicates. So that was a good opportunity to try to attack that problem again.

The previous work done during lockdown consisted in inserting lists between a node and a leaf. The idea was that in all cases we're dealing with pointers to nodes, and that duplicates can be arranged as a list without losing information nor needing extra storage, if we only use list elements as a spacer:

The principle is not very complex. When adding a duplicate key, we're inserting it as a list element between the previous leaf and the node that pointed to it. The inserted node has its left pointer pointing to the leaf and the right one pointing to itself. When adding new duplicates, the new duplicate is inserted exactly the same way, and the previous list element points to from its right pointer it. The list can thus be extended infinitely. What is nice is that when looking up the value, it is detected that there are multiple keys because regardless of the number of list elements, the left and the right pointers point to a node having the same value (possibly the same node itself). When such a duplicate entry is detected, it is always the last inserted one (e.g. the blue one on the diagram above), then the first leaf is always reachable via last->right->left. Then in order to visit the next duplicate, either the current node is this leaf, in which case the next one is last->right, or it isn't, then it is necessarily one of the duplicates, and the next node is the always node->right until the last one is reached (the one that was first found). This construct is super interesting because it provides the following properties:

appending a new entry always places it at the end
visiting entries always start from the first inserted ones
it is always possible from a node to instantly find its neighbors in order to delete it, thus making it relatively easy to "pick" elements (i.e. find and detach the next one) in insertion order.
list entries behave like nodes in that they always point to adjacent values that fall within as large a prefix length as was previously encountered (since they are equal).

Finally v0.2 was issued with this support for duplicates mid february 2025. It's now considered feature-complete!

Working on performance

Performance comparisons show that, as expected, this cebtree is less performant than ebtree due to having to compare two nodes per level during the descent, only one of which is to be useful, thus effectively doubling the number of memory reads compared to ebtree.

In addition, large strings are severely affected because just like in non-radix trees such as rbtrees, they have to be compared from the beginning at each level in order to detect a leaf, so the average lookup cost is L*log2(N) where L is the average string length and N the number of strings. Since we didn't make use of tagged pointers yet, it was time to try to introduce them to designate leaves, allowing to restart the comparison from the previous split point so that strings are now entirely read only once during the lookup.

The approach that seemed to work on paper during lockdown proved its value without affecting duplicate lists, in that it is super simple: only pointers to a leaf are tagged. This means that pointers to lists are not tagged, and that a tagged pointer inside a list of duplicates necessarily indicates the first leaf. Also, insertion/removal operations remain extremely simple as the pointer is preserved. The total number of changes in the code to support tagged pointers remained very low, most of them were related to the addition of wrappers to avoid directly dereferencing tagged pointers. For this, pointer types have been changed in the code so as to always know what points to a node and what is a tagged node pointer. This requires to change the code where a cebtree was created by a "struct ceb_node*", to change it to "struct ceb_root*", but that's all.

An update of the "6,4,5,5,5" tree above with tagged pointers for leaves now gives the following (tagged pointers are on red links):

Now it's only slightly slower than ebtree on large strings (still due to twice the keys being read during the descent), but sometimes even slightly faster on small strings, probably due to the lower memory bandwidth needed thanks to the much smaller nodes. This is currently found in the master branch of the project.

Current state

The current state looks functional. The code contains some functional and performance tests. The usage remains very close to ebtree except that the tree root has to be passed to all operations (in ebtree, delete/next/prev etc didn't require the root since it was possible to navigate upwards).

The choice between duplicates or unique keys is specified in the function name now, we don't tag the root to indicate how it will be used anymore; it was indeed found after using ebtrees like this for a while that it makes the code not particularly obvious to understand, as one needs to locate the initialization of the tree to know what to expect after an insert() call. Now at least the intent will be explicit in the code.

This also simplifies the root initialization as the root node now only needs to be set to a NULL pointer for an empty tree, then some insert() calls can be performed to insert elements. Example below (warning! no error checking):

struct ceb_root *root;

struct ceb_node *node;

struct entry {

struct ceb_node node;

char *string;

} *entry;

/* index all arguments */

while(arg < argc) {

entry = calloc(1, sizeof(*entry));

entry->string = strdup(argv[arg]);

cebis_insert(&root, &entry->node);

arg++;

}

/* list all arguments, sorted */

node = cebis_first(&root);

while (node) {

printf("@%p = %s\n", node, container_of(node, struct entry, node)->string);

node = cebis_next(&root, node);

}

Just use cebuis_* above to enforce unique keys.

Cost comparison with ebtree

All operations are in O(logN) while ebtree has several operations in O(1) (next/prev/delete). In addition, for certain operations such as next/prev/delete, cebtree requires to record the restart pointer during a descent, reach the desired leaf and complete the operation from the recorded pointer. Just like with a next/prev, this extra cost is amortized to one extra level, but it can reach up to logN for 1/N nodes and could be averaged to logN/N.

In addition, the extra cost compared to ebtree comes from the fact that cebtree, just like other non-radix trees (rbtree etc), needs to visit two branches to figure which one to descend. ebtree avoids this by storing the depth value inside the node. These extra reads may consume more memory bandwidth for large keys, but given that the node itself is 2.5-3 times smaller, the extra cost for small keys can still be absorbed by the lower cost of retrieving the node from memory. However, this extra cost could have an impact on TLB misses when using indirect accesses (cebis_* etc).

Tests performed on haproxy on a laptop with the map/acl pattern reference showed:

a tiny performance increase when loading a map made of 17 million IPv4 addresses and masks (22.0 seconds down to 21.5 seconds); these are short 17-character strings. The resulting memory saving was 400 MB.
a moderate increase when loading a map made of 1 million user agents (much larger strings). The time went up from 2.475 seconds to 2.711 seconds. The memory saving was 24 MB.

Thus it can make sense to use this for stuff that's mostly related to configuration or that doesn't change often and is mostly looked up (i.e. config object names).

Update (2025-05-04): performance measurements were done here.

What's left to be done

First, the documentation in the code is still incomplete. Now that the API got clearer, the doc should be should be finished. In the worst case some types will get renamed, which is no big deal. The doc must be done before 0.3 is released anyway.

Some things are still in the state of reflection:

shouldn't we use a different root type for duplicates vs unique trees ? It could avoid mistakes caused by copy-paste. We could very well do that in exposed function prototypes only so as not to carry these down the code.
some operations that exist in ebtree were not implemented. One of them is the longest match, typically used to look up IP addresses within ranges. It's still unknown at this point if this can be implemented with the current model (maybe it would require to slightly change the key representation, in ebtree the prefix length is passed and stored separately).
some long but still incomplete studies indicate that the code could almost support properly ordered atomic operations. The only real limitation is deletion that is not compatible with parallel insertion or lookup. One first approach could be to use upgardable rwlocks like progressive locks and take a shared lock for all operations, and upgrade it to an exclusive lock only when committing the removal. This would result in a very short locking time, though all visitors would still need to write-access a shared memory area to grab the read lock, and a moderate deletion rate could suffice to purge visitors too often. Another solution might be to consider alternate approaches such as RCU.
implementation of relative pointers was started but put to a pause for now. It's convenient to index files in-place (e.g. logs etc), but I didn't have immediate use for this and it further complicates the development. It will have to be done, though.

Future work

Simplifying the trees to the extreme of requiring only two pointers, just like a list, opens some perspectives:

maybe it would be useful to support to optionally implement the parent pointers in the node so as to decide between performance and space.
the work initially done on ebtrees to try to factor the code went in the wrong direction by making it too difficult to use and maintain. On the opposite, the work done on cebtree preserved maintainability despite supporting multiple types (u32, u64, long, addr, blocks, strings, indirect blocks, indirect strings). It looks like ebtree would benefit from rolling back to before the change and adopting a cebtree-like approach.
the duplicate lists of cebtree look way better than the duplicate trees of ebtree, and maybe we could significantly improve ebtree's performance in schedulers by adopting the same duplicates (duplicate values are frequent in scheduler due to the timer's resolution).
maybe ebtree could be built as an extended cebtree in the end, then it could be nice to re-merge them into a single project.
studying the interest of self-locking nodes following the same principle as mt_lists could also be interesting, though the visit cost might become high.

Links

The code in its current state is available here:

cebtree : version 0.1 is the most basic (no duplicates). Version 0.2 supports duplicates without using tagged pointers. It can be seen as a reference about what can be done when tagged pointers are not desired. The current version supports everything including tagged pointers and is still in development.
ebtree : version 6 is the production one. Version 7 is the one that tries to factor all the code and that is not considered production ready.
plock: the upgradable R/W progressive locks will probably be the next step to support parallel insertions and lookups with exclusivity only on the delete operation.
mt_list: the self-locked lists have concepts that might be reusable here, probably even lighter since nodes are accessible in a single direction at a time. This deserves to be studied.

2018-02-16

Progressive Locks : fast, upgradable read/write locks

Context

A few years after I created the Elastic Binary Trees (ebtree), I started to get some feedback from users who needed to use them in multi-threaded environments, asking if an MT-safe version was available. By then, nothing was available, and all I could suggest them was to place some locks around all the tree manipulation code. While this generally works well, it was sub-optimal because these trees feature some really nice properties that are partially impaired if surrounded by frequent, heavy locks. Thus it required some extra thinking, leading to a first minimal solution emerging in 2012 and having improved since.

Common use cases

It looks like the most common use cases of ebtrees are :

data caches (like the LRU cache example from ebtree that ended up in HAProxy as well)
log parsing / analysis (i.e.: count number of occurrences of a number of criteria and their relation with other criteria)
data conversion (like in HAProxy's maps, mostly used for IP-based Geolocation)

These use cases share a few important properties :

the trees are mostly read from, and seldom updated (typically less than 2-3% updates)
these operations are performed as a very small part of something more complex, and often a single high-level operation will require multiple tree accesses. Thus these operations need to remain extremely fast.
the workloads involving these operations tend to be heavy and to benefit from multi-threading

At first glance, some R/W locks would perfectly fit this purpose. But R/W locks tend to be slow because they involve two atomic operations when taking the lock for reading and when dropping it. This means performing 4 atomic operations per tree lookup. This can end up being slower than the tree lookup itself in certain cases. Often taking the lock for writing is faster but that's the case we're the least interested in here. And I got the confirmation by some users that due to low concurrency on their tree, they preferred to use a faster spinlock around the tree operations, even if that implied not doing any parallelism at all.

For other users requiring parallelism and some insertions (caches, log processing), R/W locks limit the scalability to the ratio of time spent writing vs the time spent doing something else not requiring access to the tree. Indeed, the write lock being exclusive, each time there is a write operation, the tree is not usable and all readers have to wait.

Observing how the time is spent in ebtrees

Ebtrees present an average lookup and insertion time of O(log(N)) where N is the number of entries in the tree. In practice this is not totally exact for small trees, because an ebtree is a modified radix tree, so the number of visited nodes during a lookup or an insertion in fact corresponds to the number of common prefixes along the chain to the requested node, which can be as high as the number of bits in the data element. The tree descent is done by comparing some bit fields with the data found at each location and deciding to go left or right based on this. The final operation (returning the data found, or inserting/deleting a node) is very fast as it's only a matter of manipulating 4 pointers in the worst case. The sequence below shows how a node "3" would be added to a tree already containing nodes 1, 2, 4, and 5 :

On a core i7-6700K running at 4.4 GHz, a lookup in a tree containing 1 million nodes over a depth of 20 levels takes 40 ns when all the traversed nodes are hot in the L1 cache, and up to 600 ns when looking up non-existing random values causing mostly cache misses. This fixes a limit to 1.67 million lookups per second in the worst case (e.g. cache being intentionally attacked).

This ratio is interesting because it shows that for some use cases, a thread would possibly want to benefit from the fast lookup of common keys (e.g. 40ns to look up the source IP address of a packet belonging to a recent communication), without being too much disturbed by another thread having to insert some random values locking the tree for 600 ns.

For a workload consisting in 1% writes in the worst case conditions above, a single core would spend 87% of its time in reads, and 13% in writes, for a total of 22 million lookups per second. This means that even if all reads could be parallelized on an infinite number of cores, the maximum performance that could be achieved would be around 160 million lookups per second, or less than 8 times the capacity of a single core.

Below is a graphical representation of how the time would be spent between fast reads and slow writes on this worst-case workload, on a single core machine, then on a 8-core machine, showing that the 8-core machine would be only 4 times faster than the single-core one, spending 48% of its total CPU resources waiting while the write lock is held. The 100 operations are performed every 1095 ns, bringing 91 million operations per second.

Note that this problem is not specific to trees in fact. It also affects other structures such as hash tables, to certain extents. Hash tables are made of linked lists attached to an array of list heads. Most of the time, elements are not ordered, so additions are performed to the head (or tail) of the list and it is possible to lock a list head (or the whole hash table depending on requirements) during a simple insertion. But there is a special case which concerns uniqueness : if you want to insert an element and ensure it is unique (like a cache usually does), you need to atomically look it up then insert the new one, and the same problem happens, except that this time the worst lookup time can be O(N) :

Designing a solution

Since most of the time is spent looking up a position in the storage structure (tree, list, etc), and this lookup by itself doesn't result in modifying the structure, it seems pointless to prevent other readers from accessing the structure. In an ideal case, we would like to have the ability to "upgrade" the lock from Read to Write only after the lookup. This cannot be done without some constraints since if it were universally possible, it would mean that all readers could be upgraded to writer some of which would be stuck at certain positions while a writer is currently modifying these same locations.

But most of the time we know that we're performing a lookup planning for a write operation. So we could have an intermediary lock state that does not exclude readers, and which at the same time is guaranteed to be atomically upgradable to a write lock once all readers are gone.

This is exactly what the initial design of the Progressive Locks (plock) does. The new intermediary lock state is called "S" for "seek" as it is only used while seeking over the storage area and without performing any modification. Just like for W locks, a single S lock may be held at a time, but while the W and R locks are exclusive, the S and R locks are compatible.

In the worst case above, is we consider that among the 600ns it takes to insert a key into the tree, 595ns are used for the lookup and 5ns are used to update the pointers, it becomes obvious that the whole tree is locked only for 5ns every 600ns, so 595ns out of the 600ns are usable for read operations that happen in parallel. Thus the sequence becomes the following one :

a slow writer starts to seek through the whole tree using the S lock, taking 595ns
in parallel, other threads continue their read operations. During the same time frame, each thread may initiate 15 read operations of 40ns
once the writer finds the location where it wants to insert its new node, it upgrades its S lock to W.
other threads stop initiating read operations and complete existing ones
the writer waits for all readers to be gone. In the worst case it takes the time it takes for a full read operation, hence 40ns max (20ns avg)
after 635ns max, the writer is alone and can modify the tree
after 640ns max, the write operation is completed, the lock is removed and readers can enter again.

With 8 total cores, all cores can stay busy handling a 99% read ratio and 1% write ratio, since during the 595-635ns the S lock is held by one core, 7 cores can process 105 read operations of 40ns each.

The writer will wait between 0 and 40ns to get exclusive access, thus 20ns on average. As a result, this will bring one write every 620ns and all read requests in parallel. This means that 100 operations are performed every 620ns, bringing a throughput of 161 million operations per second, or 77% faster than when using an R/W lock as above :

Initial implementation

It appeared critical that a single atomic operation could be used to take a lock, so as not to repeat the performance problem of classical R/W locks. Given that the number of readers needs to be known all the time, the lock needs to be large enough to store a number as large as the maximum number of readers, in addition to other information like the other lock states, into a single word-sized memory area guaranteeing the possibility of atomic accesses.

Usually locks are implemented using a compare-and-swap operation (e.g. CMPXCHG on x86), which allows to check for contention before performing an operation. But while using compare-and-swap is convenient when dealing with a few distinct values, it's not convenient at all when dealing with an integer which can vary very quickly depending on the number of readers, because it would require many failed attempts and could even never succeed for some users.

On x86 systems, we have a very convenient XADD instruction. This is an example of fetch-and-add operation : it atomically reads a value from a memory location, adds it to a register, and stores the result back to memory, while returning the original value into the register.

Using XADD definitely helps counting readers, but it is subject to overflows, and doesn't appear very convenient to count write requesters. While we do know that we'll never have many write requesters at once, a safe design mandates to be able to let all threads compete in parallel in this situation.

Thus the idea emerged of splitting an integer into 3 parts :

one to count the readers
one to count the demands for S locks
one to count the demands for W locks

By splitting a 64-bit integer into 3 equal parts, it becomes possible to assign 21 bits per part, hence support up to 2097151 threads without ever causing an overflow. On 32-bit systems, this would only support 1023 threads without overflowing. This limit on 32-bit systems would definitely make the solution unsuitable to some existing software, so some improvements were needed.

As mentionned, the W locks need to be exclusive to all other locks. The S locks need to be exclusive to S and W locks. The R locks need to be exclusive to W locks. This means that it is possible to overflow certain values into the other ones without a risk :

if the R bits overflow into the S bits, since S locks are stricter than R locks, the guarantees offered to readers are preserved
if the S bits overflow into the W bits, since W locks are stricter than S, the guarantees offered to S users are preserved

But we need to protect W against overflowing or to assign it enough bits.

So in practice the limit is really dictated by the number of bits to represent W. And in order to get optimal performance, and avoid forcing readers to wait, it seems important to have the same number of bits for R. In theory, a single bit is needed for S. But given that concurrent accesses on S would result in occasional appearance of W, this would occasionally make some readers wait while not needed. Thus the choice was made to use two bits for S, supporting up to 3 concurrent seek attempts without perturbing readers, which leaves enough time for these seekers to step back.

This provides the following distribution :

on 32-bit platforms :

14 bits for W
2 bits for S
14 bits for R

on 64-bit platforms :

30 bits for W
2 bits for S
30 bits for R

The astute reader will have noted that this results in 30 bits for 32-bit platforms and 62 bits on 64-bit platforms. This leaves us with 2 extra bits used for any application purpose, particularly the ebtree configuration bits (unicity etc) :

The initial naive encoding showed that certain combinations correspond to non-existing cases in practices and that a more advanced encoding could help store one extra state.

Indeed, the progress made on ebtrees makes it apparent that certain O(1) operations like the removal of a node might one day be implemented as atomic operations, without requiring any locking. But how is this related to a lock implementation then ? Porting an application or a design from locked to lock-free is never a straightforward operation and can sometimes span over several years making small progress at each attempt. Being able to use a special lock state to mention that it is compatible only with itself and exclusive to all other ones can be very convenient during the time it takes to transform the application.

So a new "A" (for "atomic") state was introduced and all the encoding of the states based on the bits above was redefined to be a bit more complex but still fast and safe, as described below.

Current implementation

The 5 known states of Progressive Locks are now :

U (unlocked) : nobody claims the lock
R (read-locked) : some users are reading the shared resource
S (seek-locked) : reading is still OK but nobody else may seek nor write
W (write-locked) : exclusive access for writing
A (atomic-locked) : some users are performing safe atomic operations which are potentially not compatible with the ones covered by R/S/W.

The U, R and W state are the common states for classical R/W locks. The S state is an intermediary step between R and W, which still allows readers to enter, but rejects other W and S requests and guarantees a quick upgrade to (as well as return from) the W state. The S lock also makes it possible to implement some "lockless" operations following a "single-writer multiple-readers" model where a single careful writer can perform changes without preventing readers from accessing the data. The A state is for atomic operations mainly consisting in ordered writes, that can be performed in parallel but must not be mixed with R, S nor W.

The S lock cannot be taken if another S or W lock is already held. But once the S lock is held, the owner is automatically granted the right to upgrade it to W without checking for other writers. And it can take and release the W lock multiple times atomically if needed. It must only wait for last readers to
leave.

The A lock supports concurrent write accesses and is used when certain atomic operations can be performed on a structure which also supports non-atomic operations. It is exclusive with the other locks. It is weaker than the S/W locks (S being a promise of "upgradability"), and will wait for any readers to leave before proceeding.

The lock compatibility matrix now looks like below. Compatible locks are immediately obtained. Incompatible locks simply result in the requester to wait for the incompatible ones to leave :

Many transitions are possible between these states for a given thread, as represented on the transition graph below :

Some additional state transitions may be attempted but are not guaranteed to succeed. This is the case when holding a Read lock and trying to turn it into another lock which is limited to a single instance : another thread might attempt to perform exactly the same transition and by definition this can not work for both of them. Moreover, the one succeeding will wait for readers to quit, so the one failing will also have to drop its Read lock in this case before trying again. Nevertheless, the plock API provides an easy access to these uncommon features.

The transitions to W and A are performed in two phases :

request the lock
wait for previous incompatible users to leave

Claiming the Read lock is performed by first checking if nobody holds the W lock. The rationale behind this is to ensure that a thread holding the Write lock and waiting for readers to quit doesn't have to wait for too long seeing new readers come.
The locks are implemented using cumulable bit fields representing from the
lowest to the highest bits :

R: the number of readers (read, seek, write)
S: the number of seek requests
W: the number of write requests

Each regular lock (R, S, W) counts as a reader. Since all of them may operate
on a concurrent read, the maximum number of concurrent threads supported is dictated by the number of bits used to encode the readers.

Since the writers are totally exclusive to any other activity, they require the same number of bits in order to accept the worst possibly imaginable scenario where all threads would request a write lock at the exact same instant so the number of bits used to represent writers is the same as the number used to represent the number of readers.

The number of seek requests remains on a low bit count and this number is placed just below the write bit count so that if it overflows, it temporarily overflows into the write bits and appears as requesting an exclusive write access. This allows the number of seek bits to remain very low, 1 technically, but 2 to avoid needless lock/unlock sequences during common conflicts.

In terms of representation, we now have this :

R lock is made of the R bits
S lock is made of S+R bits
W lock is made of W+S+R bits
A lock is made of W bits only

Due to W being made of W+S+R, and S overflowing into W, each time a W lock is directly taken, both W and S bits are added resulting in the W field growing faster than the number of write requests. Indeed, with two bits for S below W, the W:S word is increased by 0b101 = 5 while W represents 1/4 of the W:S word, so each writer consumes 5/4 of a position. But given that 5 doesn't divide any power-of-two quantity, we're guaranteed to always keep at least one non-null bit in the W part of the word for any supported number of writers, and given that the W lock is stronger than the S lock, other access types are protected.

Having only the W bits set in W:S could be confused with the A lock except that the A lock doesn't hold the R bits and is exclusive with regular locks so this situation is unambiguous as well.

In practice, it is useful to understand all states that can be observed in the lock, including the transient ones shown below. "N" means "anything greater than zero". "M" means "greater than 1 (many)". "X" means "anything including zero" :

The lock can be upgraded and downgraded between various states at the demand of the requester :

U⇔A : pl_take_a() / pl_drop_a() (adds/subs W)
U⇔R : pl_take_r() / pl_drop_r() (adds/subs R)
U⇔S : pl_take_s() / pl_drop_s() (adds/subs S+R)
U⇔W : pl_take_w() / pl_drop_w() (adds/subs W+S+R)
S⇔W : pl_stow() / pl_wtos() (adds/subs W)
S⇒R : pl_stor() (subs S)
W⇒R : pl_wtor() (subs W+S)

This is where the XADD instruction really shines : it is possible to take or upgrade any lock using a single atomic instruction by carefully setting the constant value. And while doing this, the current value is retrieved so that the requester can check whether or not it is causing a conflict and has to rollback the change. Dropping, downgrading a lock, or cancelling a failed attempt to take a lock simply requires an atomic subtract.

In case of conflict, the requester has to periodically observe the lock's status by reading it, and trying again once the lock's status appears to be compatible with the request. It is very important not to try to take the lock again without reading first because of the way CPU caches work : each write attempt to a shared memory location (here the lock) will broadcast a request to all other caches to flush their pending changes and to invalidate the cache line. If the lock is located close to a memory location being modified under this lock's protection, repeated write attempts to the lock will significantly slow down the progress of the other thread holding the lock since the cache line will have to bounce back and forth between competing threads.

Instead, by only reading the lock, the other caches will still have to flush their pending changes but not to invalidate the cache line. This way it will still make progress, albeit not as fast as what it could do without this.

In order to improve the situation, Progressive locks implement an exponential backoff : each failed attempt causes the failing thread to wait twice as long as the previous time before trying again. This well known method used in locks and in network protocols comes with three immediate benefits :

it limits the disturbance caused to the thread holding the lock
it avoids cache storms when using many concurrent threads
it provides a much better fairness to all threads by increasing the randomness of their attempts

When a thread is silently waiting, it uses some CPU-specific instructions like the PAUSE instruction on x86 processors, which stops execution for a few cycles and leaves the CPU pipelines available to the other thread of the same CPU core, thus further speeding up processing on simultaneous multi-threaded environments.

Performance measurements

A new locking mechanism couldn't work without some performance benchmarks. One such utility implementing a simple LRU cache using a hash table is provided with the software package, under the name lrubench. This utility looks up and inserts random values into a cache if they are not present yet. It allows to set the number of concurrent threads, the cache size (number of keys kept before eviction), the key space to control the cache hit ratio, the cost of a cache miss (number of loops to perform on a simple operation mimicking a real, more expensive operation like a regex call), and the locking strategy.

For each locking strategy, the principle is the same :

perform a lookup of the value in the cache
if the value is present, use it
otherwise perform the expensive operation
insert the result of this expensive operation into the cache, only if it was not added in the mean time by another thread, otherwise replace this older entry with the new one
in case of insertion, remove possible excess elements from the cache to keep it to the desired size

Eight locking strategies are supported :

pthread spinlock : the first lookup is protected using pthread_spinlock(). During the insertion, the lookup, the possible removal, and the insertion are atomically performed using pthread_spinlock(). No parallel operation is possible on the cache.
pthread rwlock : the first lookup is protected using pthread_rwlock_rdlock(). During the insertion, the lookup, the possible removal, and the insertion are atomically performed using pthread_rwlock_wrlock(). Initial lookups may be processed in parallel but not the second ones.
plock W lock : the first lookup is protected using a W lock. During the insertion, the lookup, the possible removal, and the insertion are atomically performed using a W lock. No parallel operation is possible on the cache. This is comparable to the pthread spinlock strategy.
plock S lock : the first lookup is protected using an S lock. During the insertion, the lookup, the possible removal, and the insertion are atomically performed using an S lock. No parallel operation is possible on the cache. This is comparable to the plock W lock above but is slightly faster because the S lock doesn't have to check for readers. It is the closest plock alternative to the pthread spinlock.
plock R+W lock : the first lookup is protected using a R lock. During the insertion, the lookup, the possible removal, and the insertion are atomically performed using a W lock. Initial lookups may be processed in parallel but not the second ones. This is comparable to the pthread rwlock strategy, but is much faster due to the single atomic operation.
plock R+S>W lock : the first lookup is protected using a R lock. During the insertion, the lookup is performed using an S lock, then the possible removal, and the insertion are atomically performed by upgrading the S lock to a W lock. Initial lookups may be processed in parallel as well as in parallel to the second ones. Only one second lookup may be performed at a time, and initial lookups are only blocked during the short write period at the end.
plock R+R>S>W lock : the first lookup is protected using a R lock. During the insertion, the lookup is performed using an R lock, then the possible removal, and the insertion are atomically performed by upgrading trying to upgrade the R lock to an S lock, or performing the operation again using an S lock in case of conflict. Then the S lock is upgraded to aa W lock to complete the operation. Initial lookups may be processed in parallel as well as in parallel to the second ones. The second lookups may be performed in parallel as well, unless a write access is already claimed, forcing a second lookup to be done.
plock R+R>W lock : the first lookup is protected using a R lock. During the insertion, the lookup is performed using an R lock, then the possible removal, and the insertion are atomically performed by upgrading trying to upgrade the R lock to a W lock, or performing the operation again using a W lock in case of conflict. Initial lookups may be processed in parallel as well as in parallel to the second ones. The second lookups may be performed in parallel as well, unless a write access is already claimed, forcing a second lookup to be done.

The utility has been run on a 12-core Xeon E5-2680v3 at 2.5 GHz with HyperThreading enabled.

The graphs below represent the number of cache lookup operations per second achieved using these different locks, for 6 different cache hit ratios (50%, 80%, 90%, 95%, 98%, 99%), for 3 different cache miss costs (30, 100, 300 snprintf() loops), and with a varying number of threads. The left part of the graph uses only one thread per CPU core (no HyperThreading involved), and the right part of the graph uses two threads per CPU core (HyperThreading).

Low miss cost (30 loops)

(click on the graphs to zoom)

Medium miss cost (100 loops)

(click on the graphs to zoom)

High miss cost (300 loops)

(click on the graphs to zoom)

It is obviously not a surprise that the difference between locking mechanisms fades away when the miss cost increases and the hit ratio diminishes since more time is proportionally spent computing a value than competing for a lock. But the difference at high hit ratios for large numbers of threads is stunning : at 24 threads, the plock can be respectively 5.5 and 8 times faster than R/W locks and spinlocks! And even at low hit ratios and high miss cost, the plocks are never below the pthread locks and become better past 8-10 threads.

The more time the lookup takes, the wider the gap between the different mechanisms, since in the case of spinlocks or R/W locks, exclusive access is required during the insertion, making even a single thread doing an occasional insert stop all other threads from using the cache. This is not the case with plocks, except in the R+R>W mode upon failure to upgrade due to two threads trying to concurrently insert an entry. This is what causes the R+R>W mechanism to converge to the R+W mechanism when the thread count increases.

It looks like the upgrade attempts from the R lock are pointless on this workload, because there is rarely a contention between two insertions. It starts to be slightly beneficial only when dealing with important lookup times and cache miss ratios. But in this case one of the threads has to retry the operation, which is not much different from being serialized around an S lock.

At low hit ratios (50%), the spinlocks are cheaper than R/W locks (less operations, balance around 80% and 10 threads), and the plock S and W variants are slightly faster than the spinlocks and converge towards them with more threads. One explanation might be the use of exponential back-off in plock that makes it behave slightly better for a comparable workload. But at even only 80% cache hit ratio (hence 20% miss), all strict locks behave similarly and the benefit of the upgradable locks becomes obvious.

It is also visible on graphs with high hit ratios that the spinlock performance degrades as the number of threads increases. One could think that this would indicate that the spinlocks do not use an exponential back-off to avoid polluting the bus and caches, but the curve is perfectly merged with the plock-S and plock-W curves, and plock does use exponential back-off. In fact the reason here is different. It's simply that plock-S and plock-W start with an aggressive locked XADD operation, and that the spinlocks use a locked CMPXCHG operation. Both operations unconditionally issue a bus write cycle, even the CMPXCHG! This is explained in the Intel Software Developer's manual in the CMPXCHG instruction description : "To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)". These write operations have a negative effect on the other CPU's caches, forcing invalidations, which explains why the performance slowly degrades.

For plock, an attempt was made to read before writing, but this results in a small performance loss for most common cases which isn't worth the change. In the end this preliminary read is only performed before taking the R lock, because in general this lock is taken before a long operation so a double read is easily amortized, and because not doing it increases the time it takes for a W lock holder to see the last reader quit.

Integrating Progressive locks into a program

The plock Git tree can be consulted and cloned from here. The code was made to have the least possible dependencies. It only consists in an include file bringing all the code as macros, and a second include file providing the atomic operations. The choice of macros was made so that the calls may automatically be used on 32- and 64- bit locks. 32-bit platforms may only use 32-bit locks, but 64-bit platforms may use both. For most use cases, it's likely that even a 64-bit platform will not need more than 16383 threads and may prefer to save space by using only 32-bit locks.

A program only needs to do this to use plocks :

#include <plock.h>

From this point, adding a lock to a structure simply requires an integer, long or even a pointer :

struct cache {
     struct list head;
     int plock;
};

A lock value of zero describes and unlocked state, which is easy to initialize since any structure allocated with calloc() starts with an initialized lock.

Appending an entry without requiring any lookup would consist in simply taking the W lock around the cache access :

struct cache_node *__cache_append(struct cache *, const char*);
void cache_insert(struct cache *cache, const char *str)
{
    struct cache_node *node;

    pl_take_w(&cache->plock);
    node = __cache_append(cache, str);
    pl_drop_w(&cache->plock);
}

Then reading from the cache would basically consist in taking the R lock before operating :

struct cache_node *__cache_find(const struct cache *, const char*);

int cache_find(struct cache *cache, const char *str)  {
    const struct cache_node *node;
    int ret;

    pl_take_r(&cache->plock);
    node = __cache_find(cache, str);
    ret = node ? node->value : -1;
    pl_drop_r(&cache->plock);
    return ret;
}

Deleting an entry from the cache would consist in taking the S lock during the lookup and upgrading it to a W lock for the final operation :

struct cache_node *__cache_find(const struct cache *, const char*);
void __cache_free(struct cache *, struct cache_node *);

void cache_delete(struct cache *cache, const char *str)  {
    struct cache_node *node;

    pl_take_s(&cache->plock);
    node = __cache_find(cache, str);
    if (!node) {
       pl_drop_s(&cache->plock);
       return;
    }
    pl_stow(&cache->plock);
    __cache_free(cache, node);
    pl_drop_w(&cache->plock);
}

Inserting an entry only if it doesn't already exist can be done in multiple ways (typically taking the S lock for a lookup and upgrading it if no node is found). One interesting way when the risk of collision is low and the insertion rate is high (eg: log indexing) consists in using a read lock and to attempt an upgrade to W, or falling back to an S lock in case of failure. This way many threads may insert in parallel and their atomicity checks will be parallelized yet will guarantee atomicity :

struct cache_node *__cache_find(const struct cache *, const char*);
struct cache_node *__cache_append(struct cache *, const char*);

void cache_insert_unique(struct cache *cache, const char *str)
{
    struct cache_node *node;

    pl_take_r(&cache->plock);
    node = __cache_find(cache, str);
    if (node) {
        pl_drop_r(&cache->plock);
        return;
    }
    if (!pl_try_rtow) {
        /* upgrade conflict, must drop R and try again */
        pl_drop_r(&cache->plock);
        pl_take_s(&cache->plock);
        node = __cache_find(cache, str);
        if (node) {
           pl_drop_s(&cache->plock);
           return;
        }
        pl_stow(&cache->plock);
    }
    /* here W is held */
    node = __cache_append(cache, str);
    pl_drop_w(&cache->plock);
}

The lrubench.c test program provided with plock gives a good illustration of how the locks can be used using various strategies compared to spinlocks and standard RW locks.

Extra work

Some cleanup work remains to be done on the locks, particularly on the atomic operations which confusingly are prefixed with "pl_". Some of them will move to a distinct low-level library. This is not important since users should normally not use these functions directly.

It seems useless to implement progressive locks on 16-bit values (would support at most 127 threads, or 63 if keeping the 2 application bits). But maybe some environments could make good use of them by reusing holes in existing structures.

I made a few attempts at using the same mechanism to implement tickets to grant an even better fairness to many threads, but didn't find a way to guarantee uniqueness of ticket numbers without performing multiple atomic operations.

It would seem that sort of an atomic "test-and-add" operation would be beneficial here to avoid taking the lock then rolling it back in case of conflict. It would be a mix of a test-and-set and an addition replacing the XADD so that the addition is only performed if certain bits are not set (typically S and W bits). While no such instruction exists for now, a few different options should probably be considered :

HLE (Hardware Lock Elision), which is a new mechanism introduced with Intel's TSX extensions. The idea is to define a code section where locks are ignored and the operation is attempted directly into the L1 cache. If it succeeds, we've saved some bus locks. If it fails, it is attempted again with the real locks. It seems that it could be used for the read+XADD at least when taking an R lock, and avoid having to grab the bus lock at this point. Some experimentation is required. What's nice is that code written like this is backwards compatible with older processors.
RTM (Restricted Transactional Memory) is a more complex mechanism allowing to know if a sequence of operations has a chance to succeed lockless or requires locking. In case of failure the operation will be attempted again. It matches very closely the R->S and R->M opportunistic upgrades that we have in the example above. This could also be used to insert or delete a node in the tree after a lockless lookup.

Annex A: detailed locking rules

The rights and expectations these locks impose to their users are the following :

unlocked (U) :

nobody may perform visible modifications to the protected area at all
anybody can read the protected area at their own risk
anybody may immediately turn the lock to any other state

read-locked (R) :

requester must wait for writers to leave before being granted the lock
owner may read any location from the protected structure without any requirement for ordering
others may read any location from the protected structure without any requirement for ordering
others may also be granted an R lock
one other user may also be granted an S lock
one other may take a W lock but will first have to wait for all readers to leave before proceeding with writes

seek-locked (S) :

requester must wait for writers and seekers to leave before being granted the lock
owner may read any location from the protected structure without any requirement for ordering
others may read any location from the protected structure without any requirement for ordering
others may also be granted an R lock
owner may upgrade the lock to a W lock at any instant without prior notification
owner must wait for readers to leave after the lock is upgraded before performing any write
owner may release the lock without performing an upgrade
owner may also downgrade the S lock to an R lock

write-locked (W) :

requester must wait for writers and seekers to leave before being granted the lock (this is guaranteed when upgrading from S)
others may not try to grab any lock once this lock is held
the lock is granted once all other users have left
owner may perform any modification to the protected structure
others may not access the protected structure at all
owner may downgrade the lock to an S lock
owner may downgrade the lock to an R lock

atomic (A) :

requester must wait for writers and seekers to leave before being granted the lock, but doesn't exclude other write-atomic users
the lock is granted once all other users have left
owner is not guaranteed to be alone since other users may hold the A lock in parallel and perform modifications at the same time ; observations and retries might be necessary to complete certain operations
owner may only carefully apply modifications to the protected structure using atomic operations and memory barriers in a way that is compatible with its own usage
others may only carefully apply modifications to the protected structure using atomic operations and memory barriers in a way that is compatible with others' usage
the protections involved with this type of lock are 100% application specific. Most commonly this will be used to reinitialize some elements, release a series of pointers using atomic operations and all compatible functions will use this lock instead of any of the 3 above.