Willy Tarreau's stuff: Line rate HTTP server on the OpenBlocks AX3

This article explains how I'm using the OpenBlocks AX3 as a line-rate HTTP server for testing purposes.

Original idea

In 2003, while doing some performance tests on Netfilter, I realized how frustrating it was to always be limited by the load generators performance.You generally need at least 4-6 machines to load a firewall, with 2-3 HTTP clients and 2-3 HTTP servers. The second one of each is here to ensure that the bandwidth is never limited by a single machine, and the third one is here to prove that the limit reached with the first two cannot be overcome with more clients.And it's generally hard to find that many similar machines, you generally know that some are faster for sending, others for receiving, or that some are more efficient with large packets and others with small packets. In practice you're never totally confident in your own tests.

Two years later, while running some network benchmark to compare several firewall products for a customer, I faced the same issue again, especially when trying to stress the firewall with many short requests to maximize the connection rate. Then I got the idea of a dummy HTTP server which would only work in packet mode, without creating real TCP sessions. That would make it lighter and improve its ability to get close to line rate. Unfortunately, working with SOCK_PACKET by then was not really faster than the local TCP stack so I temporarily gave up on this idea.

After I recently became the lucky owner of an OpenBlocks AX3/4 microserver, the idea of exploiting to the maximum extents its high networking capabilities immediately woke up my old idea of stateless server. The platform is very recent and I needed to go deep into some kernel drivers, which explains why it took quite some time to reach a point where it's working.

The OpenBlocks AX3/4 microserver

The OpenBlocks AX3/4 microserver is a very neat device built by Japanese company Plat'Home.

The microserver compared to a 3,5" floppy disk for scale

This fanless device runs a dual-core 1.33 GHz Marvell Armada XPCPU (ARMv7), has 3 GB of RAM, 128 MB of NOR flash, 20 GB of SATA SSD, and, best of all, 4 true Gigabit Ethernet ports (I mean not over USB nor an internal switch nor crippled by design like in many Cortex-A9 based CPUs). In terms of average performance, it is comparable to a dual-core Atom running at the same frequency, though it consumes 4x less power.And indeed, even at full load, it becomes just warm to the touch. The design is robust and compact, so I now carry it everywhere with me as it's a very convenient device for many usages. The only criticism I could make is that it's a bit expensive, it clearly targets the enterprise market, which will value its benefits for building an ideal router, firewall, web server or monitoring device. But even then, many companies will prefer a cheaper low-end x86 box if they don't value the device's strong differenciators.

Where this device really shines is in the area of network communications. The 4 GigE ports are included in the Armada XP itself, so they're much closer to the CPU caches than usual devices which communicate via a PCIe bus. And this design pays off. After hacking a little bit the mvneta driver, it becomes obvious that each port is capable of both sending and receiving in parallel at line rate for all packet sizes, resulting in exactly 1.488 million of packets per second (Mpps) in each direction. This is something rare and very hard to achieve with more conventional hardware, so that made me want to try to port some network stress testing tools to this platform.

Note that there are other devices using the same family of CPU. I also have a Mirabox running an Armada 370, which is a low-end single-core CPU with a 16-bit memory bus and a smaller cache. It includes two of the same network controllers. What I'm describing here also works with the Mirabox to a certain extent. The limited memory bandwidth and the fact it's a single core prevent this from scaling to multiple ports. The peak performance is also about 10% lower.

Stateless HTTP server : principle

HTTP is a pretty simple protocol when you only look at the exchanges on the wire. It's what I call a "ping-pong" protocol : each side sends one thing and waits for the other side to respond. This is only true for small data transfers, and does not take pipelining into consideration. But for what I need in tests, it's very simple.

I've long been wondering if it was possible to use this "ping-pong" property to build a totally stateless server, which means a server which would only consider the information it gets from the packets and which would not store any session. Looking what a transfer looks like at the TCP level, it's clear that it is possible. Even when optimized, there's everything there for the job (please consult RFC793 if you have difficulties following these exchanges, as I won't paraphrase it here) :

Basic HTTP fetch

Faster HTTP fetch

For the server, all the information is provided in the client's ACK. If you look at the ACK and compare it to the initial SEQ sent by the server, you can determine exactly what step is being processed, so how to act accordingly. The problem is that after the response is sent, the server does not necessarily know how long the response was, so by how much it could have shifted the next sequence numbers. So the idea was to use only the lower bits of the sequence numbers to store the state. That way, each response size just needs to be adjusted so that the next sequence number matches the value we want to assign it.

For this first implementation, I wanted to support multi-packet responses, so I decided to have a limit of 16 states, resulting in 4 bits for the state and the rest for the transfers. That means that responses have to be rounded up to the next multiple of 16 bytes plus or minus the shift to reach the desired state. In HTTP we can easily do this using headers. So I added an "X-Pad" header which serves exactly that purpose. Another point is that the size of the Content-Lengthheader varies with the size of the response. So we need to adjust X-Pad last. Both the SYN flag and the FIN flag count as one unit in sequence numbers (just like one byte), so when we plan on sending any of them, we must also count one unit. This imposes some constraints on the states ordering, but they are easily met. For example, the response contains both the data and the FIN packet. Some clients will ACK the data first, then the FIN. This results in two ACKs offsetted by exactly one point. So in order to properly handle these two different acknowledgements, the two respective states must have a value with a difference of exactly one.

The beauty of this mechanism is that it even supports HTTP keep-alive (serving multiple objects over the same connection) and resists to packet losses since the client will retransmit either a request or an acknowledgement and the server will always do the same thing in response. Note that the multi-packet feature is not totally reliable for two reasons :

clients generally wait 40ms before acknowledging one segment, so the transfer is slow, unless segments are sent two at a time, but then we need a reliable way to distinguish their acks and to recover from partial losses
if a client's ACK for an intermediate packet is lost, the session will remain stuck as nobody will retransmit.

I found one ugly solution to all of these issues, which can work when the client supports the SACK extension. The principle is to send all segments but the first one so that the client constantly acks the first one and indicates in the SACK extension what parts were received. But this becomes complex, not universally usable and in the end does not provide much benefit. Indeed, when I designed this mechanism, I had objects up to 5-10kB in mind in order to try to fill the wire, I didn't imagine I would saturate a wire with single packet objets! So a next implementation will probably only use 2 bits to store the 4 states needed to perform a single-packet transfer and will not support the multi-packet mode anymore. Also with only 4 states, we'll be able to send even-sized packets more often than now. The complete state machine looks like this :

Complete state machine

Stateless HTTP server : first implementation

The first implementation of this server was made as a module for Linux kernel 3.10.x. This module registered a dummy interface which responds to any TCP port accessed through it. The concept is ugly but it was easy to implement. The performance was quite good. On the OpenBlocks, 42000 connections per second were achieved this way, using a single external NIC bound to a single CPU core. This means that about 84kcps could be reached with incoming traffic split on two NICs, which was confirmed. This is not bad at all, it's basically the same level of performance that httpterm gives me on a Core2 Duo at 2.66 GHz. But it's not huge. The issue is that the packets have to pass via all the routing stack, defeating a little bit the purpose of the server. However this mode is convenient to run locally because there is no inter-cpu communications, a response packet is produced for each incoming packet in the context of the sending process.

Stateless HTTP server : NFQueue implementation

The second implementation was done using NFQueue (Netfilter Queue). It's very easy to use and allows packets to be returned very early (in the raw table). So I wanted to give it a try. The result is basically the same as with the interface, except that two CPU cores are involved this time, one for the network and the other one for the user process acting as the server. However for local tests when you have lots of spare cores, it becomes more interesting than the interface version because it reduces the overhead in the network stack, increasing the limit of performance a single process may observe (typically 105k conn/s vs 73k on a Core2 Quad 3 GHz, with one CPU at 100% for the server).

Ndiv framework to the rescue

These numbers are both encouraging and frustrating. They're encouraging because they prove that the mechanism is good and efficient. And they're frustrating because we spend most of our time at places we'd prefer to avoid as much as possible.

So I decided it was time for me to be brave and finish the work I started 6 months ago on my ndiv framework. This is the Ethernet Diverter framework with which I could verify that the mvneta NICs are able to saturate the wire in both directions. Basically it consists in intercepting incoming packets the closest possible to where they're collected in the drivers, and deciding whether to let them pass, drop them or emit another packet in response. I already had an unfinished line-rate packet capture module using it. I temporarily stopped developing on it by lack of time, of needs, and feedback. I needed to implement the ability to forge response packets but I was not happy with its API which was already difficult to use an inefficient. I presented it in details to my coworker Emeric Brun with whom we could define a new "ideal" API that would be optimal for hardware assisted drivers and well balanced so that neither the application nor the driver has too much work to do.

After one full day of work, I could adapt the mvneta driver to the new ndiv API and make it respond packets! The driver looks like the diagram below with the framework plugged into it. The beige part is the ndiv "application" called by the ndiv-compatible driver.

How NDIV is inserted into the network stack

Among the cool things provided by the framework, we can enumerate the fact that it considers the role of the driver (or NIC) to validate incoming protocols and checksums, and to compute outgoing checksums if the application needs so. This makes sense because noawadays, most NICs do all this stuff for free and we'd rather not have the application do it. Similarly, if some checksums have to be computed by the driver or NIC on outgoing packets, it's the responsibility of the application to indicate the various header lengths because it already knows them.

Stateless HTTP server as an Ndiv application

After completing the port of ndiv to mvneta, I was absolutely impatient to see the stateless server run directly in the driver as an ndiv application. It did not take long to port it, just a few hours, and these hours were spent changing the sequencing of the code to clean it up since it was not needed anymore to compute checksums in the application.

The results are astonishing. First, when bombarded with a SYN flood from 5 machines, the theorical limit is immediately reached with 1.488 Mpps in both directions. The CPU usage remains invisible since the periods are too short for the system to measure them. I developped a tool just for this instead.

Second, it appears that line rate is almost always achieved for whatever object size. In keep-alive mode, line rate is achieved for objects of 64 bytes and above, at 564000 requests per second and 94% of one CPU core. Empty responses go higher, 663000 requests per second, but the wire is not full (816 Mbps). The reason is that Ethernet frames are padded to 64 bytes and that for too short responses, there's automatically some padding appended. It is also important at these rates not to forget about Ethernet's preamble (8 bytes) and Inter-Packet-Gap (IPG) of 12 bytes, totalizing 20 bytes. This overhead is represented in yellow on the diagram below.

Performance at various object sizes

The transfers in HTTP close mode are excellent as well. The OpenBlocks reaches 340000 HTTP connections per second. This means a connection establishment, an HTTP request, a fast close (FIN then RST). This is 3 packets in one direction, 2 in the other one. The theorical limit for this test is 496000 connections per second (1.488 M/3). It happens that my client (inject36) sends very large requests (about 166 IP bytes). So if we do the math, we have :

64 + 8 + 12 bytes for the SYN packet = 84 bytes
166 + 14 + 8 + 12 bytes for the request = 200 bytes
64 + 8 + 12 bytes for the RST packet = 84 bytes

So for each request, the clients have to upload 368 bytes on the wire. This times 340000 equals exactly one gigabit (125000000 bytes). So in practice we're still not saturating the device nor its CPU, just the wire again. Just for the comparison, it's 3 times as fast as what I can achieve on a Core i7 3.4 GHz using httpterm.

Conclusion

First thing is that one may note that I rarely spoke about CPU usage. That's the beauty of this device. The CPU is fast enough so that a whole HTTP request parsing + response takes less than 1.4 microsecond and supports being done at line rate. The second point is that the network connectivity inside it as fantastic. I can achieve with this device packet rates that I cannot achieve with some very respectable 10G NICs. Now I urge Marvell to develop a next generation of Armada XP with a 10G NIC on chip! Now what is absolutely cool is that I finally know I won't ever have any problem anymore in benchmarks with the components being too short. Well I still need the clients... By the way, in theory it is possible to develop a client on the same model. The only thing is that the applications I implement in ndiv are reactive, which means they need some traffic to respond to. So we won't initiate a connection this way. One elegant solution however could be to use a classical SYN flooder on the device to initiate connections to the server, which in turn will respond and sollicit the client. But I'm still not completely convinced.

Other things I'd like to experiment with in the near future is porting the ndiv framework to more NICs (at least my laptop's e1000e) and to the loopback interface, so that we can even use the stateless server when developing on the local machine. I've started the ndiv project with a line-rate packet capture module which is not complete. I'm wondering if other uses can arise from this framework (eg: accelerators, load balancing, bridges, routing, IDS/IPS, etc...). Thus I'm not sure whether it's worth submitting for mainline. Any feedback would be much appreciated.

Concerning the stateless HTTP server itself. It has limited uses beyond test environments. But still I can think about delivering very small objects (favicon, redirects, ...) that fit in a single TCP segment and do not require any security. It can also be used for various types of monitoring devices which are ethernet-connected and which prefer to report measures using HTTP to make it easier for their clients to retrieve them. Some system identification or configuration might also be retrieved using such a mechanism embedded in very dumb devices which don't even have an IP stack.

Downloads

The code is available for various Linux kernel versions here. The most up-to-date version is ndiv_v5. The commits are grouped in 6 categories :

for mvneta : add support for retrieving the device's MAC address from the boot loader. Not strictly needed but quite convenient as this avoids running with random MAC addresses ;
for mvneta: some fixes for the mvneta driver ; they are required.
for mvneta: improvements for the mvneta driver ; they are required as well.
the NDIV network frame diverter framework. Required of course!
driver support for the NDIV framework (currently only mvneta, ixgbe, e1000e, e1000, igb).
the SLHTTPD server.

Useful links

Plat'Home's OpenBlocks AX3 devices
GlobalScale's Mirabox
Marvell Armada XP (MV78260) CPU
httpterm : the load generator (HTTP server)
inject36 : the load generator (HTTP client)

Willy Tarreau's stuff

2013-12-13

Line rate HTTP server on the OpenBlocks AX3