2011-12-18

Elastic Binary Trees - ebtree

Administrivia

This article was initially posted on Wikipedia in 2008, which explains why it's written at the 3rd person and looks familiar to articles on binary trees. A few moderators decided that the article was only self-promotion and removed it without notification a few days after its publication. After many arguments with them, I was told that Wikipedia cannot hold original work, only copies of what can be found somewhere else, and I gave up the fight, realizing that for once I found dumber than me! At least they accepted to let me retrieve the original work so that I can publish it somewhere else (I didn't even have a copy of it). Needless to say I have stopped donating to Wikipedia since then. Three years later, I spent part of a week-end putting this online here. When I have time, I'll rewrite it differently.

Introduction

In computer science, an elastic binary tree (or EB tree or EBtree) is a binary search tree specially optimized to very frequently store, retrieve and delete discrete integer or binary data without having to deal with memory allocation. It is particularly well suited for operating system schedulers where fast time-ordering and priority-ordering are strong requirements. Insertion and lookups are performed in O(log n) while removal is done in O(1). The tree is not balanced but its height is bound by the type of data to store in it. Ordered duplicate entries are also natively supported.

The elastic binary tree was invented between 2002 and 2007 by Willy Tarreau as part of a research work on an event-based task scheduler for user-space network applications. This type of components require a sorted list of future events, and insertion and removal are very common operations which must be optimized. An early implementation relied on a naive linked list approach, soon replaced with a fast radix tree. The disadvantages of the radix tree is that it is often necessary to allocate memory for one node to store a value, and that some garbage collection must be performed after removal in order to eliminate unused nodes. There are situations where this is not desirable nor even possible. An alternative was to use a balanced tree, but the very frequent removal operation costs O(log n).

The solution was to find some sort of hybrid type of tree between both, where each inserted entry would carry both an intermediate nodes and a leaf, both of which would stray from each other as the tree leaves, thus stretching the initial node across levels, hence the name "elastic binary tree".

License

The concept and algorithms have been released under public domain. The author's implementation was first released under GPL then switched to LGPL for more openness. The algorithm is simple enough to allow other implementations to be initiated for very specific needs.

Definitions

In an EB tree, data are stored in EB nodes. An EB node contains two parts :
  • the node part, which is responsible for tying together upper and lower nodes or leaves ;
  • the leaf part, which carries the data (an integer key), and keeps a pointer to its upper node (leaf_p).

The node part itself is composed of a parent pointer (node_p), a level (bit), and a root, which itself links to lower nodes constituting a subtree.

The level part represents the lowest bit position in the key, above which all bits are equal for all keys in the lower subtrees.

A root only contains two links called branches, which represent the value of the bit below current node's. There is always a left and a right branch, except for the top of the tree. Left branch represents bit value 0 while right branch reprensents value 1.

Branches are typed. This means that a node knows whether its branches point to other nodes or to leaves. Parent pointers are typed too. This means that each node or leaf knows if it is attached to the left or the right branches of the upper root.

A typical EB tree begins with a root at its head, containing 0 or 1 branch, below which the tree is complete. The right branch of an EB tree root is always empty.

Features

EB trees offer a wide range of useful data manipulation features :
  • Lookup :
    • lookup first or last key
    • exact match: lookup exact key (signed/unsigned integer, poitner, string or memory block)
    • longest match: lookup longest matching key (network address match)
    • lookup first matching prefix: retrieve first entry matching the beginning of a key
    • lookup closest smaller value
    • lookup closest greater value
    • lookup previous or next different value: quickly skip duplicates
    • lookups of duplicate keys always performed in key insertion order
  • Insertion :
    • standard key insertion : if the key exists, create a duplicate entry
    • unique key insertion : if the key exists, return the existing one

Complexity

EB trees algorithmic complexity can be derived from 3 variables :
  • the number of possible different keys in the tree : P
  • the number of entries in the tree : N
  • the number of duplicates for one key : D
EB trees are deliberately not balanced. For this reason, the worst case may happen with a small tree (for instance, 32 distinct keys of one bit). But the operations required to manage such data are so much cheap that they make it worth using these trees even under such conditions. For instance, a balanced tree may require only 6 levels to store those 32 keys when an EB tree will require 32. But if per-level operations are 5 times cheaper, EB tree wins.
Minimal, Maximal and Average times are specified in number of operations. Minimal is given for best condition, maximal for worst condition, and the average is reported for a tree containing random keys. An operation generally consists in jumping from one node to another.
Complexity :
  • lookup  : min=1, max=log(P), avg=log(N)
  • insertion from root : min=1, max=log(P), avg=log(N)
  • insertion of dups  : min=1, max=log(D), avg=log(D)/2 after lookup
  • deletion  : min=1, max=1, avg=1
  • prev/next  : min=1, max=log(P), avg=2 :

 N/2 nodes need 1 hop=> 1*N/2
 N/4 nodes need 2 hops => 2*N/4
 N/8 nodes need 3 hops => 3*N/8
 ...
 N/x nodes need log(x) hops => log2(x)*N/x
 Total cost for all N nodes : sum[i=1..N](log2(i)*N/i) = N*sum[i=1..N](log2(i)/i)
 Average cost across N nodes = total / N = sum[i=1..N](log2(i)/i) = 2

Current EB tree design is limited to only two branches per node. Most of the tree descent algorithm would be compatible with more branches (eg: 4, to cut the height in half), but this would probably require more complex operations and the deletion algorithm would be problematic.

Properties

EB trees have several useful properties :
  • a node is always added above the leaf it is tied to, and never can get below nor in another branch. This implies that leaves directly attached to the root do not use their node part, which is indicated by a NULL value in node_p. This also enhances the cache efficiency when walking down the tree, because when the leaf is reached, its node part will already have been visited (unless it's the first leaf in the tree).
  • pointers to lower nodes or leaves are stored in "branch" pointers. Only the root node may have a NULL in either branch, it is not possible for other branches. Since the nodes are attached to the left branch of the root, it is not possible to see a NULL left branch when walking up a tree. Thus, an empty tree is immediately identified by a NULL left branch at the root. Conversely, the one and only way to identify the root node is to check that it right branch is NULL. Note that the NULL pointer may have a few low-order bits set on some implementations.
  • a node connected to its own leaf will have one and only one of its branches pointing to itself, and leaf_p pointing to itself.
  • a node can never have node_p pointing to itself.
  • a node is linked in a tree if and only if it has a non-null leaf_p.
  • a node can never have both branches equal, except for the root which can have them both NULL.
  • deletion only applies to leaves. When a leaf is deleted, its parent must be released too (unless it's the root), and its sibling must attach to the grand-parent, replacing the parent. Also, when a leaf is deleted, the node tied to this leaf will be removed and must be released too. If this node is different from the leaf's parent, the freshly released leaf's parent will be used to replace the node which must go. A released node will never be used anymore, so there's no point in tracking it.
  • the bit index in a node indicates the bit position in the key which is represented by the branches. That means that a node with (bit == 0) is just above two leaves. Negative bit values are used to build a duplicate tree. The first node above two identical leaves gets (bit == -1). This value logarithmically decreases as the duplicate tree grows. During duplicate insertion, a node is inserted above the highest bit value (the lowest absolute value) in the tree during the right-sided walk. If bit -1 is not encountered (highest < -1), we insert above last leaf. Otherwise, we insert above the node with the highest value which was not equal to the one of its parent + 1.
  • the "eb_next" primitive walks from left to right, which means from lower to higher keys. It returns duplicates in the order they were inserted. The "eb_first" primitive returns the left-most entry.
  • the "eb_prev" primitive walks from right to left, which means from higher to lower keys. It returns duplicates in the opposite order they were inserted. The "eb_last" primitive returns the right-most entry.
  • a tree which has 1 in the lower bit of its root's right branch is a tree with unique nodes. This means that when a node is inserted with a key which already exists will not be inserted, and the previous entry will be returned.

Principles of operations

Insertion



Initially, a tree is empty. It only consists in a root with two NULL pointers (one for each branch).

When the first EB node is inserted, its leaf is directly linked to the root 's left branch. Thus, its node part is not used. It is the only situation where a node part is not used. After that, a new EB node is inserted. A node part is needed between the root and existing leaf to connect the new leaf. This is where the node part provided with the new EB node will be used. The second EB node's node part will have one of its branch pointing to the previous leaf, and the other branch pointing to its own leaf. For this reason, it is not possible to have an empty branch below a node.

When the third EB node is added, it may very well insert itself between the node and the leaf of previous EB node, which will be stretched by the operation. It is important to note that the node part of a stretched EB node is always located somewhere between the root and the leaf of the same EB node. The same principle goes on with further keys, as shown on the diagrams below.
An EB tree with keys 1 to 5
Adding keys 21 to 25 to a previous tree.
In Yellow, the links which changed.

Insertion of duplicates

An EB tree supports duplicates and keeps them ordered. This is a very important property for a fair scheduler because events stored with the same wake-up date must be processed in the same sequence to avoid risks of starvation.
In order to store a duplicate key, we grow a binary sub-tree at the place where the leaf was to be located, using the same algorithm as for the upper tree, but with negative bit offsets. This allows duplicates to be stored without overhead data, and to be walked through using the same functions, without any processing overhead. The nodes order is respected and all nodes will be walked in their insertion order.

Deletion

When a leaf is deleted, its parent must be released too (unless it's the root), and its sibling must attach to the grand-parent, replacing the parent. Also, when a leaf is deleted, the node tied to this leaf will be removed and must be released too. If this node is different from the leaf's parent, the freshly released leaf's parent will be used to replace the node which must go. EB trees do not need to be rebalanced after deletion. This results in tree deletion being performed in O(1). A released node will never be used anymore, so there's no point in tracking it.
In yellow, links affected by the
removal of key 1 from previous EB tree.
In yellow, links affected by removal
of keys 2 to 5 from previous tree.

Tree walk

A tree walk may be performed from left to right or right to left, always following numerical order and duplicates insertion order if applicable. It relies on the following primitives :
  • eb_first(root): return the first node of the tree starting at [root]
  • eb_last(root): return the last node of the tree starting at [root]
  • eb_next(node): return next node after [node]
  • eb_prev(node): return previous node before [node]
All of these functions return either a node or NULL when the end of tree is reached. Additionnally, two primitives are particularly useful to skip duplicates :
  • eb_next_unique(node) : return the next node with a different key from current [node].
  • eb_prev_unique(node) : return the previous node with a different key from current [node].

Implementation specifics

The only currently known implementation is the author's. It works on 32- and 64-bit platforms, either little or big endian. It uses the fact that pointer storage is 32-bit aligned at least, causing structure pointers to always have their two lowest bits cleared. This allows for pointer type to be stored in the pointer itself :
  • a node pointer will be stored verbatim (with lowest bit cleared)
  • a leaf pointer will be stored with lowest bit set (which equals pointer + 1)
  • a parent pointer in a left branch will be stored verbatim (with lowest bit cleared)
  • a parent pointer in a right branch will be stored with lowest bit set (which equals pointer + 1)
This is an important optimisation which reduces the number of memory dereferences and the memory structures sizes.
The right branch of the root must always be NULL. However, applying the same principle as above lets us put global flags in this location. Currently, the only used flag is EB_ROOT_UNIQUE, indicated by setting the lowest bit of the right branch pointer in the root. It indicates that duplicates are not desired and that when a key already exists during an insertion, nothing may be inserted and the existing node will be returned instead.
The current implementation is split between generic code and type-specific code. This makes it easier to later add support for newer types. Currently, pointers, signed/ unsigned 32/64 bit integers, strings and arbitrary memory blocks are supported.
Sixth version of the author's implementation brought support for prefix-based matching of memory blocks, which supports prefix insertion and longest match which are still compatible with the rest of the common features (walk, duplicates, delete, ...). This is typically used for network address matching or route selection. This update is not (yet) covered by this document.
Seventh version of the author's implementation brings support for a remappable variant of the ebtree called rebtree (rremappable elastic binary tree). This version is slightly slower but works on relative pointers, making it suitable for use in mapped files.
The current implementation distinguishes between eb_node which is type-agnostic and ebXX_nodewhich is a node of type XX. In practise, an eb_node has everything except the key itself. Only insertion and lookup need the key, all other operations only operate on the structure of the tree. This allows most primitives to be shared between type-specific implementations :

/* generic types */
typedef void eb_troot_t;

struct eb_root {
 eb_troot_t*b[EB_NODE_BRANCHES]; /* left and right branches */
};

struct eb_node {
 struct eb_root branches; /* branches, must be at the beginning */
 eb_troot_t*node_p; /* link node's parent */
 eb_troot_t*leaf_p; /* leaf node's parent */
 int bit; /* link's bit position. */
};

/* 32-bit specific types */
typedef unsigned int u32;

struct eb32_node {
 struct eb_node node; /* the tree node, must be at the beginning */
 u32 key;
};

The structures were designed to be located at the very beginning of any structure making use of EB trees. It is important for cache efficiency and memory access time that the EB tree data are located at the beginning of a CPU cache line, especially the 4 first pointers.

Performance considerations

The author uses EB trees in areas where red-black trees were burning too many CPU cycles. The haproxy load balancer has switched to EB trees and shows minor CPU usage at speeds up to 500000 insertions/removals per second, and a firewall log analyser relies on it to sort logs at speeds above one million lines per second. An insertion typically takes less than 100 nanoseconds on modern machines.
EB trees are much faster than conventional balanced trees when looking up strings or memory blocks because during the tree descent, only the bits specific to the new node are compared, while in a balanced tree, the string compare has to be performed on the whole string for each step. This advantage is emphasized in the halog log analyser which is able to process more than 4 million lines per second while indexing URLs.
The ability to lookup prefixes has made EB trees able to perform an operation which previously was possible only in radix trees but not in binary trees. Having the ability to use the same tree for various purposes can sensibly reduce complexity in some implementations, and reduce the number of operations (eg: network routing and filtering). Tests have shown that network lookups within a BGP table containing 450000 routes could be performed several million times a second on a modern machine.
However, a tree node is typically 20-bytes long on a 32-bit system. On systems with very small caches or low memory bandwidth, EB trees can be slower than red-black trees if insertions/removals are not much frequent and do not compensate for the increased bandwidth usage. Also, red-black trees rely on the ability to compare values, which make them suitable for many data types. Conversely, EB trees heavily rely on bit-addressable data (typically integers), which confer them a very high performance due to the cheap operations involved, but limit their application field. Trying to use EB trees to store non-bit-adressable data would completely defeat their purpose and will likely provide poor performance.

Practical uses

The HAProxy load balancer uses EB trees for the following features :
  • timers: timers are stored ordered, which ensures that checks for timeouts are always performed in O(1).
  • scheduler: the scheduler maintains the list of runnable tasks by weighting their wake-up order with their nice value. EB trees make this operation very cheap, typically O(Log(N)), allowing priorities to be used with huge amounts of tasks without any noticeable cost.
  • ACL: string and IP address matches rely on EB trees so that even very large pattern file do not cause any noticeable slowdown (eg: full internet BGP routing table).
  • stick-tables: data learned from traffic (cookies, SSL ID, headers, URLs, ...) are stored and retrieved using EB trees
  • The multi-peer sync protocol uses the ability to lookup ranges in EB trees.

The halog log analyser provided with HAProxy uses EB trees to sort counters, stats, URLs and servers.
The Scalable TLS Unwrapping Daemon (stud) uses EB trees for its shared session cache.
At least one proprietary log analysis solution makes extensive use of the remappable version of the EB trees for fast log indexing and lookup.
At least one thread-safe variant is known to exist, though very little is known about it at this time.

See also

External links

2010-05-30

GuruPlug Server Plus : don't waste your money on it!

Introduction

Guruplug Server and Guruplug Server Plus are small computers that fit into a power plug and which are made by GlobalScale. I was first informed about them from someone who asked on the haproxy mailing list if haproxy had ever been tested on them. These devices run on a Marvell 88F6281 processor, which is a system-on-chip (SoC) powered by an ARM-derived Sheeva processor core at 1.2 GHz, coming with 512 MB of RAM and as much of flash. The Server Plus also has 2 Gigabit Ethernet ports, and the whole is supposed to consume just a few watts, so I found it very appealing for building high performance, low consumption haproxy servers.
I pre-ordered one at NewIT about two months ago and it finally arrived on Friday. The packaging was clean and the device looked compact and solid. A free JTAG+console adapter board was also provided, which is absolutely required to be able to connect to the serial port. The box also contains a european plug, as well as power and Ethernet cables. The devices come pre-installed with a debian system so that it's not needed to install anything to test it.

First connection

I plugged my Guruplug on my network and observed my DHCP server's logs in order to find the address I should connect to (no doc is shipped with the device, everything has to be downloaded from the net or guessed). Well, the dhcp client looked a bit buggy, it was sending DHCP_DISCOVER packets forever, ignoring the DHCP_OFFER it got in response to all of its packets. So I couldn't connect to the device over the network. I had to use the serial console.
Trying to connect using the serial console was another adventure. The JTAG adapter was detected by the usbserial driver, but all I got were endless "1". I finally found that I had to build and load the ftdi_siodriver too to use the JTAG adapter. Nice, I can now connect via minicom, using login root and password nosoup4u. I manually killed dhclient3 and restarted it by hand without all the strange arguments on its command line and this time it worked. I did not spend more time investigating this, probably that some of these args passed via the init script are not very functional.

First network test

Being interested in running haproxy on it, I had to run a benchmark. I installed the haproxy-1.4.4 package which was already packaged for this distro, because it was faster than building a cross-compiler on my machine. I started it with a simple test configuration which only forwards traffic to a single server without logging nor playing with headers or cookies. This is the configuration with which I push my ALIX to 2700 hits/s, and a Celeron 1.3 GHz to about 11000 hits/s. Being running at 1.2 GHz with some DDR2 memory, I expected to see something between 8000 and 10000 hits/s. I carefully unloaded iptables and fired the test. What a disappointment : 2200 hits/s at 100% CPU ! Even 20% less than my ALIX running at 500 MHz on DDR-400 and 100 Mbps ports ! I tried changing some TCP optimisations in the configuration but I could only hardly reach 2400 hits/s with a config in which my ALIX reaches 3000. Still 20% less.
Then I decided to run a bit rate test which should favor this CPU which integrates the network controllers. Instead of fetching empty objects, I fetch large ones (10 MB). Result : only 310 Mbps at 100% CPU where the ALIX gives 200 Mbps with some CPU margin (two 100 Mbps ports), and the Celeron slightly more than 800. Quite obviously this CPU is a donkey !

Test on other kernels

I finally decided to give this box a second chance by booting another distro's kernel on it, just in case we'd have a problem with debian's 2.6.32. Fedora and Slackware have pre-built kernels and images that should be able to run there. Following the Slackware's howto hangs at boot. The howto also says that some plugs were shipped with a buggy bootloader which hangs during USB detection. Mine does and is 6 months old. A new bootloader is available on plugcomputer.org so I carefully updated it, hoping not to brick the box. That was successful, I could boot the Slackware and run my test with the same haproxy binary.It was slightly faster, 2250 hits/s and 320 Mbps, but still far from what I'd expect.
After a reboot, I discovered that the debian kernel shipped from factory does not boot anymore after the bootloader has been upgraded. There is a platform identifier which is different between what is documented as being a Guruplug (0x0A63) and what was installed in this beast (0x0A29). I was starting to get a bit fed up with the low level of platform preparation at GlobalScale, so I unplugged the thing.

Massive heat dissipation

While unplugging wires, I burnt my fingers on the RJ45 plug! The metal was at something like 70 or 80 degrees celsius, I don't know. It was simply untouchable, even after a long idle period. The Geode in my ALIX runs under less than 1W at full load and around 0.1W when in idle. It does not even have a heatsink. Many people have reported major heating issues on these Guruplugs in the past, with power supplies dying after less than 3 months, but GlobalScale said they had replaced the PSUs in recent units. Check here for photos taken with a thermal camera and reporting more than 72 degrees C on external parts.

So I decided to open the unit to see how it was inside. There were two signs that the power supply got replaced. The first one is that the "Warranty void" seal was a new one stuck over the previous one, whose remains were pinched in the plastic. So this box was reopened, closed and had a new stick on it.






The second sign is that two plastic legs in it were cut to accomodate the new, larger power supply. Click on the images on the right to get a full sized view.






I ran the tests again with the box open. This power supply does not heat at all. It's only the CPU which heats like mad. The small aluminum plate it completely untouchable, it really hurts to touch it even half a second. And the power supply is located just on top of that plate. So now I realized that it was not the power supply that was killing the Guruplugs, but the CPU which is cooking the power supplies. I can't leave that running when I'm not at home, I would fear that it puts my flat into fire !

Poor design

Considering the amount of heat emitted by this Marvell CPU, the box would have needed a large heatsink. I don't think anybody would have complained if the box had been 1.5cm thicker to accept a normal heatsink, with aeration holes on the sides so that the heat dissipates. Instead we have dangerous plugs which are too hot to touch, and which die in a few months of idling because capacitors can't stand sustained heat.
Also, the internal wiring is of poor quality. Two of the power output wires broke while I was taking photos of them, and a third one is about to break too. The mains input wires are similarly bad and risk to break too. And it looks like the PSU was modded by adding resistors to it, maybe to slightly increase output power. This PSU features a 105 degree capacitor though, so it may last slightly longer than previous ones.



Close-up of the 5V connector

Inscriptions at the bottom

The PSU in the bottom block

Bottom with battery, SD connector,
Blutooth controller and WiFi

Connectors side

Left side with UART and JTAG
connectors

Right side with reset button and
Micro-SD

Motherboard at the top


PSU cover removed

PSU removed

The 105°C capacitor

The PSU was partched

The PSU

First 5V wire broke instantly

Second 5V wire followed

Next one will be the ground wire

Mains wire about to break

Conclusion

The Guruplug was announced by some sites as the NSLU2 killer, but for me it's just right now a Slow Heater, and by no way will it replace even one NSLU2 as long as it heats like that. It's slower than my old ALIX, and heats a lot more. I think that one of the reasons for it to be so slow despite the high frequency is the RAM bus width : at first I thought it would be a fast system because it runs on DDR2-800, but when you see that it's only a 16-bit bus, it is equivalent to only 200 MHz for a 64-bit bus, but with higher latencies due to DDR2. My ALIX's Geode LX processor runs on a 64-bit, 400 MHz bus, so it has twice the memory bandwidth the Guruplug has. Marvell also has another CPU in the family (MV78200) with a 64-bit RAM bus, which should be better, but I have not seen it anywhere yet. The fact that the Marvell CPU consumes 5 times more power than the Geode LX and is slower is also an indication of something wrong with the design. Apparently this CPU makes it way into entry-level NAS appliances. It's probably one of the applications where the heat issues and limited memory bandwidth will not be too much of a concern. I'm really disappointed, I would have expected it either to replace my old NSLU2 (but it heats too much) or to become sort of an ALIX upgrade with Gigabit ports, but now I think that if I need Gigabit, I'll turn to these dual-gigabit mini-PCI cards. They will give me about the same level of performance without having an overheating CPU.

I don't even think I will waste my time trying to repair and adapt that Guruplug, it will probably spend the next 10 years lying in a box with other dead boards, waiting for my solering iron to pick some components out of it. However, I think that if/when the manufacturer finally decides to do a complete redesign of the product (ie include a BIG heatsink and leave some holes for airflow), then it may be an acceptable replacement for the NSLU2.

External links

2007-12-30

How to add a capacitor to keep RTC running on PC Engines ALIX

Introduction

ALIX is the name of a nice motherboard family designed by PC Engines to build small fanless, quiet, cheap and still powerful servers. Those motherboards are sold without battery for the RTC clock, so as soon as the power is lost, the clock is lost and at next boot, the clock will be set to Jan 1st 2000.
When working on packaging a Linux distribution for these motherboards, it is normal to encounter various boot errors and to have to power cycle the board tens of times a day. After a while, having to set the clock manually becomes irritating, and at least being able to keep it running for a few minutes would be very useful.

Solution

Since the RTC clock is known to run from very low power, I decided to experiment with a few capacitors with pretty interesting results. A 1000µF charged at 3V3 and connected to the 3V battery input is able to keep the clock running for about 30 minutes. This is more than enough for most of the power cycles I have to run through.
This capacitor will have to be charged during operation. Since the power supply does not charge the battery, a diode will have to be connected between the 3V3 power line and the capacitor. Unfortunately, a silicon diode shows an important voltage drop (0V6) which considerably reduces the efficiency of the solution.
Experiments with a germanium diode with a lower drop (0V3) connected to the 3V3 power line charges the capacitor to 3V. Another test with a silicon diode connected to the 5V power line charges the capacitor to 4V4, providing slightly increase longevity.
I finally decided that both of my 2C3 motherboards used as servers will have a 1000µF capacitor charged at 3V through a Ge diode, to power the clock for about 30 minutes. The 3C2 board which I carry everywhere with me for development and experimentation purposes will have a bigger 0.1F capacitor charged at 4V4, managing to keep the clock running for two days.

Installation on the 2C3 motherboard

Locate the power lines

 On the 2C3 motherboard, it is very easy to locate the power lines. The 3V3 is available on C172, very close to the battery connector, and a small capacitor already brings the ground even closer, basically at the distance of a capacitor, so this will be quite helpful. Click on the image on the right to get a full size view.

Identify the connection points

Here is a close-up of the work area enlightened above. The view has been rotated for easier identification. The components will be connected as indicated on the image. Click on the image to get a full size view. First, prepare the terminals with some solder. Put a drop of solder on the pin of C172 facing U12, as the diode will be connected there. Do the same on the "-" pin of C26, which is also facing U12 ; the capacitor will be connected there.

Prepare the components

Prepare a Germanium (or Schottky) diode as indicated on the image (click for a closer view). The pins must be bent several times for the diode to pass over U12 and provide a connection for the capacitor.

Then, get a 1000µF 6V3 radial capacitor, cut its pins to 8mm, and bend the positive terminal perpendicularly for horizontal soldering (not shown on the image).

Install the diode


Solder the diode first. Its anode connects to C172, on the pin facing U12, and the cathode connects to the battery's connector, in the hole designated as BT1+ (in fact it's the + terminal of battery BT1). It is important to leave the cathode exposed (not isolated), because the positive pin of the capacitor will be soldered directly on it.

Install the capacitor

Solder the negative pin of the capacitor to the negative pin of C26. Orientate it so that the positive pin joins the diode's cathode on BT1+.



Installation on the 3C2 motherboard


Install the diode


Here on this photo, the +5V is taken from the left pin of C23. The diode is not easily identifiable, it is the small black plastic thing located just above the R20 printing. Its anode is connected to the +5V via the yellow wire, and its cathode is connected to the BT1+ pin on the board.

Install the capacitor

The capacitor is what is sometimes called a "SuperCap". Its capacity is 0.1F, or 100000µF, which is 100 times the capacity of the ones used on the 2C3 motherboards. It is slightly bigger and there was no easy place to install it. I finally decided to stick it on top of the CS5536 (U11) which is approximately the same size and is perfectly flat.


Another advantage of this location is that the power line from the BT1+pin enters the CS5536 via the C38 capacitor, which provides an easy access to both the positive and negative terminals. Also, it is important to note that BT1+ is not directly connected to C38, but passes through R20 first, a 47 ohm resistor.

In the end, it is a good thing that there is a small resistor between the power supply and this capacitor. It has a very big capacity and a small internal resistance, and having R20in the path ensures that it does not pull too much power through the diode at start-up.

Connect the capacitor to C38 as on the image on the right, with the negative pin to the left ot C38 and the positive pin on the right. That's all, you're done.

Various links


2007-12-29

How to build a cheap UPS for PC Engines ALIX

Introduction

ALIX is the name of a nice motherboard family designed by PC Engines to build small fanless, quiet, cheap and still powerful servers. Those motherboards drain so little power that they can be powered by a simple 9V battery ! Typical power usage sits around 3W for a 500 MHz processor.

I recently experienced a power outage which made me realize that not having a battery backup for such servers is a shame, given how easy it is to power them. I experienced a bit with complex circuits involving MOSFETs and transistors in order to achieve the lowest power drop but finally went back to a very simple 3-component design.

Schematics


Circuit diagram

The principle is very simple. The ALIX power supply delivers 18V to the motherboard. In parallel, a 9V rechargeable battery (B1) is installed in series with a current-limiting resistor (R1). The charging current is defined as the the difference between the power supply's voltage and the battery's voltage, divided by the resistor. I chose 1k5 for R1, which sets the charging current to (18-9)/1500 = 6mA.

A Zener diode installed in parallel with the battery prevents it from over-charging, by draining all the charging current when the battery is fully charged. The diode's voltage must equal the battery's full charge voltage. 9V NiCd battery packs are not 9V in reality, but 8V4 assembled from 7 1V2 batteries. Since they reach 9V6 when fully charged, I used the same value for the diode. The diode may heat a little bit depending on the charging current and the voltage. The power dissipated by the diode equals the difference between the power supply's voltage and the diode's multiplied by the current, which is (18-9.6)²/1500 = 47mW. A small 100mW diode is enough.
A low voltage drop Schottky diode is installed in the discharge path between the battery and the power line so that nearly all of the battery power goes to the motherboard during power outages. The diode has to support at least 1A of continuous current depending on how the motherboard will operate. Also, since the motherboard is equipped with a switching voltage regulator, it will drain a higher current when the battery is low. The diode I used only shows a drop of 0V2, which is quite acceptable.

Test results

My first tests show that the cheapest 120mAh battery maintains the motherboard online for about 10 minutes until the voltage drops to 7V. It requires about 24 hours for a complete charge with the 1k5 resistor. The charging rate may be increased to 10mA with a 820 ohms resistor, but above, a bigger Zener diode will be required.
I also noticed that my SpeedTouch ADSL modem uses a 9V input... I tried the same module on it. Bingo! it worked too. This means that with a few of those modules, I can maintain my network access up during short power outages.

Possible improvements


The 8V4 nominal voltage of this battery pack is very close to the low voltage limit of the motherboard (7V). This leaves a very small work margin. Other battery packs from 9V6 to 12V may be more interesting to experiment with. However, they will require a bigger Zener diode and will probably cost much more.
An interesting enhancement consists in daisy-chaining as many of such modules as there are motherboards to power. This will ensure that all power supplies are able to backup any one which would fail, and it will also optimize the offline duration of the batteries in the even that some batteries are less sollicited. In this case, it would also make sense to use a bigger battery such as a 12V/7Ah as commonly found in medium-sized UPS. Such a battery could power an ALIX motherboard for a full day!

Approximative cost

I got this 9V battery for 1 Euro, but they are generally sold around 5 Euros. There are between 1 and 2 Euros of components, including the male and female jacks. A small plastic box and a PCB to hold the battery and the connectors would be a nice improvement. What would be very nice would be if PC Engines could sell such a module as an option, just as they do with the PoE injector.

Various links

2006-11-19

Making applications scalable with Load Balancing

Why this article

As the author of the HAProxy Load Balancer, I'm often questionned about Load Balancing architectures or choices between load balancers. Since there is no obvious response, it's worth enumerating the pros and cons of several solutions, as well as a few traps to avoid. I hope to complete this article soon with a deeper HTTP analysis and with architecture examples. You can contact me for further information. This document is now available as a PDF file :

Introduction

Originally, the Web was mostly static contents, quickly delivered to a few users which spent most of their time reading between rare clicks. Now, we see real applications which hold users for tens of minutes or hours, with little content to read between clicks and a lot of work performed on the servers. The users often visit the same sites, which they know perfectly and don't spend much time reading. They expect immediate delivery while they inconsciously inflict huge loads on the servers at every single click. This new dynamics has developped a new need for high performance and permanent availability.

What is load balancing ?

Since the power of any server is finite, a web application must be able to run on multiple servers to accept an ever increasing number of users. This is called scaling. Scalability is not really a problem for intranet applications since the number of users has little chances to increase. However, on internet portals, the load continuously increases with the availability of broadband Internet accesses. The site's maintainer has to find ways to spread the load on several servers, either via internal mechanisms included in the application server, via external components, or via architectural redesign. Load balancing is the ability to make several servers participate in the same service and do the same work. As the number of servers grows, the risk of a failure anywhere increases and must be addressed. The ability to maintain unaffected service during any predefined number of simultaneous failures is called high availability. It is often mandatory with load balancing, which is the reason why people often confuse those two concepts. However, certain load balancing techniques do not provide high availability and are dangerous.

Load balancing techniques

1. DNS

The easiest way to perform load balancing is to dedicate servers to predefined groups of users. This is easy on Intranet servers, but not really on internet servers. On common approach relies on DNS roundrobin. If a DNS server has several entries for a given hostname, it will return all of them in a rotating order. This way, various users will see different addresses for the same name and will be able to reach different servers. This is very commonly used for multi-site load balancing, but it requires that the application is not impacted to the lack of server context. For this reason, this is generally used to by search engines, POP servers, or to deliver static contents. This method does not provide any means of availability. It requires additional measures to permanently check the servers status and switch a failed server's IP to another server. For this reason, it is generally used as a complementary solution, not as a primary one.

$ host -t a google.com
google.com. has address 64.233.167.99
google.com. has address 64.233.187.99
google.com. has address 72.14.207.99

$ host -t a google.com
google.com. has address 72.14.207.99
google.com. has address 64.233.167.99
google.com. has address 64.233.187.99

$ host -t a mail.aol.com
mail.aol.com. has address 205.188.162.57
mail.aol.com. has address 205.188.212.249
mail.aol.com. has address 64.12.136.57
mail.aol.com. has address 64.12.136.89
mail.aol.com. has address 64.12.168.119
mail.aol.com. has address 64.12.193.249
mail.aol.com. has address 152.163.179.57
mail.aol.com. has address 152.163.211.185
mail.aol.com. has address 152.163.211.250
mail.aol.com. has address 205.188.149.57
mail.aol.com. has address 205.188.160.249
Fig.1 : DNS roundrobin : multiple addresses assigned to a same name, returned in a rotating order

2. Reducing the number of users per server

A more common approach is to split the population of users across multiple servers. This involves a load balancer between the users and the servers. It can consist in a hardware equipment, or in a software installed either on a dedicated front server, or on the application servers themselves. Note that by deploying new components, the risk of failure increases, so a common practise is to have a second load balancer acting as a backup for the primary one.
Generally, a hardware load balancer will work at the network packets level and will act on routing, using one of the following methods :
  • Direct routing : the load balancer routes the same service address through different local, physical servers, which must be on the same network segment, and must all share the same service address. It has the huge advantage of not modifying anything at the IP level, so that the servers can reply directly to the user without passing again through the load balancer. This is called "direct server return". As the processing power needed for this method is minimal, this is often the one used on front servers on very high traffic sites. On the other hand, it requires some solid knowledge of the TCP/IP model to correctly configure all the equipments, including the servers.
  • Tunnelling : it works exactly like direct routing, except that by establishing tunnels between the load balancer and the servers, theses ones can be located on remote networks. Direct server return is still possible.
  • IP address translation (NAT) : the user connects to a virtual destination address, which the load balancer translates to one of the servers's addresses. This is easier to deploy at first glance, because there is less trouble in the server configuration. But this requires stricter programming rules. One common error is application servers indicating their internal addresses in some responses. Also, this requires more work on the load balancer, which has to translate addresses back and forth, to maintain a session table, and it requires that the return traffic goes through the load balancer as well. Sometimes, too short session timeouts on the load balancer induce side effects known as ACK storms. In this case, the only solution is to increase the timeouts, at the risk of saturating the load balancer's session table.
On the opposite side, we find software load balancers. They most often act like reverse proxies, pretending to be the server and forwarding them the traffic. This implies that the servers themselves cannot be reached directly from the users, and that some protocols might never get load-balanced. They need more processing power than those acting at the network level, but since they splice the communication between the users and the servers, they provide a first level of security by only forwarding what they understand. This is the reason why we often find URL filtering capabilities on those products.

2.1. Testing the servers

To select a server, a load balancer must know which ones are available. For this, it will periodically send them pings, connection attempts, requests, or anything the administrator considers a valid measure to qualify their state. These tests are called "health checks". A crashed server might respond to ping but not to TCP connections, and a hung server might respond to TCP connections but not to HTTP requests. When a multi-layer Web application server is involved, some HTTP requests will provide instant responses while others will fail. So there is a real interest in choosing the most representative tests permitted by the application and the load balancer. Some tests might retrieve data from a database to ensure the whole chain is valid. The drawback is that those tests will consume a certain amount of server ressources (CPU, threads, etc...). They will have to be spaced enough in time to avoid loading the servers too much, but still be close enough to quickly detect a dead server. Health checking is one of the most complicated aspect of load balancing, and it's very common that after a few tests, the application developpers finally implement a special request dedicated to the load balancer, which performs a number of internal representative tests. For this matter, software load balancers are by far the most flexible because they often provide scripting capabilities, and if one check requires code modifications, the software's editor can generally provide them within a short timeframe.

Load balancing techniques (cont'd)

2.2. Selecting the best server

There are many ways for the load balancer to distribute the load. A common misconception is to expect it to send a request to "the first server to respond". This practise is wrong because if a server has any reason to reply even slightly faster, it will unbalance the farm by getting most of the requests. A second idea is often to send a request to "the least loaded server". Although this is useful in environments involving very long sessions, it is not well suited for web servers where the load can vary by two digits factors within seconds.
For homogeneous server farms, the "roundrobin" method is most often the best one. It uses every server in turn. If servers are of unequal capacity, then a "weighted roundrobin" algorithm will assign the traffic to the servers according to their configured relative capacities.
These algorithms present a drawback : they are non-deterministic. This means that two consecutive requests from the same user will have a high chance of reaching two different servers. If user contexts are stored on the application servers, they will be lost between two consecutive requests. And when complex session setup happens (eg: SSL key negociation), it will have to be performed again and again at each connection. To work around this problem, a very simple algorithm is often involved : address hashing. To summarize it, the user's IP address is divided by the number of servers and the result determines the right server for this particular user. This works well as long as the number of servers does not change and as the user's IP address is stable, which is not always the case. In the event of any server failure, all users are rebalanced and lose their sessions. And the few percent of the users browsing through proxy farms will not be able to use the application either (usually 5-10%). This method is not always applicable though, because for a good distribution, a high number of source IP addresses is required. This is true on the Internet, but not always within small companies or even ISP's infrastructures. However, this is very efficient to avoid recomputing SSL session keys too often.

Persistence

The solution to the limits above is then to use persistence. Persistence is a way to ensure that a given user will keep going to the same server for all his requests, where its context is known. A common cheap solution is for the application to send back a redirection to the local server address using an HTTP 302 response. A major drawback is that when the server fails, the user has no easy way to escape, and keeps trying to reach the dead server. Just like with DNS, this method is useful only on big sites when the address passed to the user is guaranteed to be reachable.
A second solution is for the load balancer to learn user-server associations, the cheapest way being to learn which user's IP went to which server last time. This generally solves the problem of the server farm size varying due to failures, but does not solve the problem of users with a variable IP address.
So what's left ? From the start, we're stating that we're trying to guarantee that a user will find his context on subsequent requests to the same server. How is the context identified by the server ? By a cookie. Cookies were invented exactly for that purpose : send an information to the user which he will pass back on future visits so that we know what to do with him. Of course, this requires that the user supports cookies, but this is generally the case on applications which need persistence.
If the load balancer could identify the server based on the cookies presented by the user, it would solve mosts of the problems. There are mainly two approaches regarding the cookies : the load balancer can learn the session cookies assigned by the servers, or they can insert cookie for server identification.

1. Cookie learning

Cookie learning is the least intrusive solution. The load balancer is configured to learn the application cookie (eg. "JSESSIONID"). When it receives the user's request, it checks if it contains this cookie and a known value. If this is not the case, it will direct the request to any server, according to the load balancing algorithm. It will then extract the cookie value from the response and add it along with the server's identifier to a local table. When the user comes back again, the load balancer sees the cookie, lookups the table and finds the server to which it forwards the request. This method, although easily deployed, has two minor drawbacks, implied by the learning aspect of this method :
  • The load balancer has finite memory, so it might saturate, and the only solution to avoid this is to limit the cookie lifetime in the table. This implies that if a user comes back after the cookie expiration, he will be directed to a wrong server.
  • If the load balancer dies and its backup takes over, it will not know any association and will again direct users to wrong servers. Of course, it will not be possible to use the load balancers in active-active combinations either. A solution to this is real-time session synchronization, which is difficult to guarantee.
  • A workaround for these drawbacks is often to choose a deterministic load balancing algorithm such as the user's address hashing when possible. This way, should the cookie be lost on the load balancer, at least the users who have a fixed IP address will be kept on their server.

2. Cookie insertion

If the users support cookies, why not add another one with fixed text ? This method, called "cookie insertion", consists in inserting the server's identifier in a cookie added to the response. This way, the load balancer has nothing to learn, it will simply reuse the value presented by the user to select the right server. This solves both problems of cookie learning : memory limitations and load balancer takeover. However, it requires more efforts from the load balancer, which must open the stream an insert data. This is not easily done, especially on ASIC-based load balancers which have very limited knowledge of TCP and HTTP. It also requires that the user agent accepts multiple cookies. This is not always the case on small mobile terminals for instance, which are sometimes limited to only one cookie. Then, several variations on this concept may be applied, such as cookie modification, consisting in prefixing an already existing cookie with the server's identifier. Special care should also be taken on the load balancer regarding the response's cacheability : front caches might store the load balancer cookie and pass it to all users requesting the index page for instance, which would lead to all users being directed to the same server.

3. Persistence limitations - SSL

We've reviewed methods involving snooping on the user-server exchanges, and sometimes even modifications. But with the ever growing number of applications relying on SSL for their security, the load balancer will not always be able to access HTTP contents. For this reason, we see more and more software load balancers providing support for SSL. The principle is rather simple : the load balancer acts as a reverse proxy and serves as the SSL server end. It holds the servers's certificates, deciphers the request, accesses the contents and directs the request to the servers (either clear text or slightly re-ciphered). This gives a new performance boost to the application servers which do not have to manage SSL anymore. However, the load balancer becomes the bottleneck : a single software-based load balancer will not be faster to process SSL than an eight-servers farm. Because of this architectural error, the load balancer will saturate before the application servers, and the only remedy will be to put another level of load balancers in front of it, and adding more load balancers to handle the SSL load. Obviously, this is wrong. Maintaining a farm of load balancers is very difficult, and the health-checks sent by all those load balancers will induce a noticeable load to the servers. The right solution is to have a dedicated SSL farm.

4. Dedicated SSL-Cache Farm

The most scalable solution when applications require SSL is to dedicate a farm of reverse-proxies only for this matter. There are only advantages to this solution :
  1. SSL reverse proxies are very cheap yet powerful. Anyone can build powerful Apache-based SSL reverse proxies for less than $1000. This has to be compared to the cost of a same performance SSL-enabled load balancer (more than $15000), and to the cost of multi-processor servers used for the application, which will not have to be wasted to do SSL anymore.
  2. SSL reverse proxies almost always provide caching, and sometimes even compression. Most e-business applications present a cacheability rate between 80 and 90%, which means that these reverse proxy caches will offload the servers by at least 80% of the requests.
  3. adding this to the fact that SSL proxies can be shared between multiple applications, the overall gain reduces the need for big and numerous application servers, which impacts maintenance and licencing costs too.
  4. the SSL farm can grow as the traffic increases, without requiring load balancer upgrades nor replacements. Load balancers' capacities are often well over most site's requirements, provided that the architecture is properly designed.
  5. the first level of proxy provides a very good level of security by filtering invalid requests and often providing the ability to add URL filters for known attacks.
  6. it is very easy to replace the product involved in the SSL farm when very specific needs are identified (eg: strong authentication). On the opposite, replacing the SSL-enabled load balancer for this might have terrible impacts on the application's behaviour because of different health-checks methods, load balancing algorithms and means of persistence.
  7. the load balancer choice will be based only on load balancing capabilities and not on its SSL performance. It will always be more adapted and far cheaper than an all-in-one solution.
An SSL-proxy-cache farm consists in a set of identical servers, almost always running any flavour of Apache + ModSSL (even in disguised commercial solutions), which are load-balanced either by an external network load balancer, or by an internal software load balancer such as LVS under Linux. These servers require nearly no maintenance, as their only tasks is to turn the HTTPS traffic into HTTP, check the cache, and forward the non-cacheable requests to the application server farm, represented by the load balancer and the application servers. Interesting to note, the same load balancer may be shared between the reverse proxies and the application servers.

Fig.2 : SSL farm before the layer 7 LB

How to choose between hardware and software load balancer

Although all solutions tend to look like they can do everything, this is not the case. Basically, there are three aspects to consider to choose a load balancer :
  • performance
  • reliability
  • features
Performance is critical, because when the load balancer becomes the bottleneck, nothing can be done. Reliability is very important because the load balancer takes all the traffic, so its reliability must be orders of magnitude above the servers's, in terms of availability as well as in terms of processing quality. Features are determinant to choose a solution, but not critical. It depends on what tradeoffs are acceptable regarding changes to the application server.

1. Session count

One of the most common questions when comparing hardware-based to proxy-based load balancers is why is there such a gap between their session count. While proxies generally announce thousands of concurrent sessions, hardware speaks in millions. Obviously, few people have 20000 Apache servers to drain 4 millions of concurrent sessions ! In fact, it depends whether the load balancer has to manage TCP or not. TCP requires that once a session terminates, it stays in the table in TIME_WAIT state long enough to catch late retransmits, which can be seen several minutes after the session has been closed. After that delay, the session is automatically removed. In practise, delays between 15 and 60 seconds are encountered depending on implementations. This causes a real problem on systems supporting high session rates, because all terminated sessions have to be kept in the table. A 60 seconds delay would consume 1.5 million entries at 25000 sessions per second. Fortunately, sessions in this state do not carry any data and are very cheap. Since they are transparently handled by the OS, a proxy never sees them and only announces how many active sessions it supports. But when the load balancer has to manage TCP, it must support very large session tables to store those sessions, and announces the global limit.

Fig.3 : How session tables fill up with traffic

2. Layer 3/4 load balancing

Obviously, the best performance can be reached by adding the lowest overhead, which means processing the packets at the network level. For this reason, network-based load balancers can reach very high bandwidths when it comes to layer-3 load balancing (eg: direct routing with a hashing algorithm). As layer-3 is almost always not enough, a network load balancer has consider layer-4. Most of the cost lies in the session setup and tear down, which can still be performed at very high rates because they are a cheap operations. More complex operations such as address translation require more processing power to recompute checksums and look up tables. Everywhere we see hardware acceleration providing performance boosts, and load balancing is no exception. A dedicated ASIC can route network packets at wire speed, set up and tear down sessions at very high rates without the overhead imposed by a general purpose operating system.
Conversely, using a software proxy for layer-4 processing leads to very limited performance, because for simple tasks that could be performed at the packet level, the system has to decode packets, decapsulate layers to find the data, allocate buffer memory to queue them, pass them to the process, establish connections to remote servers, manage system ressources, etc... For this reason, a proxy-based layer-4 load balancer will be 5 to 10 times slower than its network-based counterpart on the same hardware.
Memory is also a limited resource : the network-based load balancer needs memory to store in-flight packets, and sessions (which basically are addresses, protocols, ports and a few parameters, which globally require a few hundreds of bytes per session). The proxy-based load balancer needs buffers to communicate with the system. Per-session system buffers are measured in kilobytes, which implies practical limits around a few tens of thousands of active sessions.
Network-based load balancers also permit a lot of useful fantasies such as virtual MAC addresses, mapping to entire hosts or networks, etc... which are not possible in standard proxies relying on general purpose operating systems.
One note on reliability though. Network-based load balancers can do nasty things such as forwarding invalid packets just as they received them, or confuse sessions (especially when doing NAT). We talked earlier about the ACK storms which sometimes happen when doing NAT with too low session timeouts. They are caused by too early re-use of recently terminated sessions, and cause network saturation between one given user and a server. This cannot happen on proxies because they rely on standards-compliant TCP stacks offered by the OS, and they completely cut the session between the client and the server.
Overall, a well tuned network-based load balancer generally provides the best layer-3/4 solution in terms of performance and features.

3. Layer 7 load balancing

Layer-7 load balancing involves cookie-based persistence, URL switching and such useful features. While a proxy relying on a general purpose TCP/IP stack obviously has no problem doing the job, the network-based load balancers have to resort to a lot of tricks which don't always play well with standards and often cause various trouble. The most difficult problems to solve are :
  1. multi-packet headers : the load balancer looks for strings within packets. When the awaited data is not on the first packet, the first ones have to be memorized and consume memory. When the string starts at the end of one packet and continues on the next one, it is very difficult to extract. This is the standard mode of operation for TCP streams though. For reference, a 8 kB request as supported by Apache will commonly span over 6 packets.
    Fig.4 : An HTTP request can span over several packets
  2. reverse-ordered packets : when a big and a small packets are sent over the Internet, it's very common for the small one to reach the target before the large one. The load balancer must handle this gracefully.
    Fig.5 : How HTTP can be difficult to handle on some hardware LB
  3. fragments : when a packet is too big to be routed on a medium, an intermediate router is allowed to cut it in small fragments. This is rare today on the Internet, but it is more and more common on internal networks because of VPNs which limit the payload size. Fragments are very hard to process because they arrive in varying order, need to be buffered to reconstruct the packet, and only the first one contains the session information. Network-based load balancers generally don't cope well with fragments, which they sometimes qualify as "attacks" as an excuse for not processing them.
  4. packets losses and retransmits : over the Internet, there are a huge number of packets lost between a user and a server. They are transparently retransmitted after a short delay, but the processing performed on one packet has to be performed again, so the load balancer must not consider that what has been done will not have to be done again (typically when end of headers is reached).
  5. differenciate headers and data : in HTTP, protocol information, known as headers, lies at the beginning of the exchanges, and data follows the first empty line. Loose packet matching on network-based load balancers sometimes leads to cookies being matched or modified in the data section, which corrupts information. A related problem often happens after a session ends : if the user reuses the same source port too early, often the load balancer confuses it with data and will not perform the L7 processing.
  6. data insertion : when data (eg: a cookie) is inserted by the load balancer, TCP sequence numbers as well as checksums have to be recomputed for all packets which pass through, causing a noticeable performance impact and sometimes erroneous behaviours.
    Fig.6 : Common problems in simplified TCP stacks
That said, it is very hard to process layer-7 at the packet level. Sometimes, contents will be mis-identified, and requests will be sent to a wrong server, and sometimes a response will not be processed before being returned to the user. In fact, the full-featured load balancers often involve local proxies to perform the most complicated actions. Some switch-based load balancers transfer part the layer-7 processing to the management processor which is often very small, and others have dedicated processors for this, but generally they are not as powerful as what can be found in latest server hardware. Also, as a side effect, they are hit by the memory consumption caused by buffering.
Generally, they don't announce how many simultaneous sessions they can process in layer-7, or they tend to claim that this has no impact, which is obviously wrong : for a load balancer to be compatible with the Apache web server, it would have to support 8 kB per request, which means keeping up to 8 kB of data per established session at one moment. Supporting 4 millions of simultaneous sessions at 8 kB each would required 32 GB of RAM which they generally don't have. Tests must then be performed by the customer to detect the load balancer's behaviour in corner cases, because easy denials of services are commonly found.
Conversely, the proxy-based load balancer sees nearly no overhead in layer-7 processing, because all problems listed above are naturally solved by the underlying operating system. The load balancer only has to focus on content and perform the right actions. Most often, they will also provide very detailed logs at no extra cost, which will not only offload the servers, but also provide vital data for capacity planning.
Last, layer-7 is something which evolves very fast, due to new persistence methods or content analysis required by constantly evolving application software. While software updates are very easy to provide for an editor, and to install for the customer, complete image updates for hardware load balancers require a higher level of validation and it is very difficult for their editors to offer the same level of reactivity.

4. The best combination, for those who can afford it

The best combination is to use a first level of network-based load balancers (possibly hardware-accelerated) to perform layer-4 balancing between both SSL reverse proxies and a second level of proxy-based layer-7 load balancers. First, the checks performed by the layer-4 load balancer on the layer-7 load balancers will always be more reliable than the self-analysis performed by proxies. Second, this provides the best overall scalability because when the proxies will saturate, it will always be possible to add more, till the layer-4 load balancer saturates in turn. At this stage, we are talking about filling multi-gigabit pipes. It will then be time to use DNS round robin techniques described above to present multiple layer-4 load balancers, preferably spread over multiple sites.

Fig.7 : Typical example of a scalable architecture
Installing a load balancer is good for scaling, but it does not exempt anyone from tuning the application and the servers.

1. Separate static and dynamic contents

It might sound very common, but it is still rarely done. On many applications, about 25% of the requests are for dynamic objects, the remaining 75% being for static objects. Each Apache+PHP process on the application server will consume between 15 and 50 MB of memory depending on the application. It is absolutely abnormal to sollicitate such a monster to transfer a small icon to a user over the Internet, with all the latencies and lost packets keeping it active longer. And it is even worse when this process is monopolized for several minutes because the user downloads a large file such as a PDF !
The easiest solution consists in relying on a reverse proxy cache in front of the server farm, which will directly return cacheable contents to the users without querying the application servers. The cleanest solution consists in dedicating a lightweight HTTP server to serve static contents. It can be installed on the same servers and run on another port for instance. Preferably, a single-threaded server such as lighttpd or thttpd should be used because their per-session overhead is very low. The application will then only have to classify static contents under an easily parsable directory such as "/static" so that the front load balancer can direct the traffic to the dedicated servers. It is also possible to use a completely different host name for the static servers, which will make it possible to install those servers at different locations, sometimes closer to the users.

2. What can be tuned on the server side - Example with Apache

There are often a few tricks that can be performed on the servers and which will dramatically improve their users capacity. With the tricks explained below, it is fairly common to get a two- to three-fold increase in the number of simultaneous users on an Apache + PHP server without requiring any hardware upgrade.
First, disable keep-alive. This is the nastiest thing against performance. It was designed at a time sites were running NCSA httpd forked off inetd at every request. All those forks were killing the servers, and keep-alive was a neat solution against this. Right now, things have changed. The servers do not fork at each connection and the cost of each new connection is minimal. Application servers run a limited number of threads or processes, often because of either memory constraints, file descriptor limits or locking overhead. Having a user monopolize a thread for seconds or even minutes doing nothing is pure waste. The server will not use all of its CPU power, will consume insane amounts of memory and users will wait for a connection to be free. If the keep-alive time is too short to maintain the session between two clicks, it is useless. If it is long enough, then it means that the servers will need roughly one process per simultaneous user, not counting the fact that most browsers commonly establish 4 simultaneous sessions ! Simply speaking, a site running keep-alive with an Apache-like server has no chance of ever serving more than a few hundreds users at a time.
Second, observe the average per-process memory usage. Play with the MaxClient parameter to adjust the maximum number of simultaneous processes so that the server never swaps. If there are large differences between processes, it means that some requests produce large data sets, which are a real waste when kept in memory. To solve this, you will need to tell Apache to make its processes die sooner, by playing with the MaxRequestsPerChild value. The higher the value, the higher the memory usage. The lower the value, the higher the CPU usage. Generally, values between 30 and 300 provide best results. Then set the MinSpareServers and MaxSpareServers to values close to MaxClient so that the server does not take too much time forking off new processes when the load comes in.
With only those tricks, a recent server with 2 GB of RAM will have no problem serving several hundreds clients. The rest is the load balancer's job.

Miscellaneous links

  • http://haproxy.org/ : HAProxy, the Reliable, High Performance TCP/HTTP Load Balancer
  • ALOHA : a commercial implementation of HAProxy in a fast and robust appliance.