2019-01-06

Build farm, version 2 (2016-2017)

[this is a follow-up to this article on version 1 of the build farm]

Background

The first version of my build farm was described here. A significant number of shortcomings surfaced :
  • poor hardware build quality resulting in instabilities
  • performance limited by the single channel DDR2 memory used
  • difficult cooling
  • difficulties holding the cards together
  • no easy access to a console port for debugging
  • inefficient load balancing
  • poor recovery abilities when a board crashes
For these reasons I was constantly watching the arrival of new hardware offerings.

New players

Once the first build farm was up and running, I had already identified and tested a few other solutions :
  • NanoPI-Fire2A : quad-cortex-A9 at 1.4 GHz, very small form factor, easy to power, to stack and to cool, but much slower than the RK3288's quad-A17.
  • Odroid-C2 :  quad-cortex-A53 at 1.5 GHz, significantly slower than the quad-A17, easy enough to cool but not very powerful.
  • NanoPI-M3 / NanoPC-T3 : octa-cortex-A53 at 1.4 GHz (twice the C2's cores), but form factor not convenient to stack it nor to cool it
  • UP board : this board features a quad-core Atom Z8350 at 1.92 GHz (2 cores) or 1.68 GHz (4 cores running). It doesn't heat too much but is about 50% more expensive than the CS008 for the same performance.
A very new one, called MiQi and made by mqmaker, featuring an RK3288 was announced starting at $35. While the $35 model (1GB) wasn't available by then, I ordered the $65 one (2GB) and was very pleased to discover an excellent build quality, fairly good software support with a very responsive team. I started to hack a little bit on the board, to fix the same issues plaguing almost all RK3288 boards, which are DRAM frequency and CPU frequency limitations, and the board proved to be excellent.



The board's layout made it easy to install a large heat sink reaching one side, there were 3mm holes available for screws and spacers, there was a console port, and dual-channel memory! Another interesting point is the GPIO-controlled fan connector on the edge of the board. Given that I know this CPU can draw a lot of power, it definitely makes sense to have the ability to blow a little bit of air when it's too hot to avoid throttling.

Everything was OK so I asked my employer if it was possible to order 10 of them to make a correct build farm this time, and my request was accepted :-)

Upgrade to MiQi

The boards at the office were successfully replaced with 5 MiQis for a quick test. Some packing foam was used to try to hold them in place. The vendor provided excellent quality micro-USB cables which were quite rigid and made it even more difficult to hold the cards in place. But all of them could be connected to the previous farm's USB power supply :


The board is extremely stable even when strongly abused. Mine were overclocked to 2.0 GHz. At this frequency they draw a lot of power and the power supply is not enough for 5 boards, it can only reliably power 4. Test results were published here, and looking back there, this card held the performance record for almost two years, being 40% faster than the previous farm and very stable.

I ordered some plastic M3 spacers to hold them in place (and pulled one Ethernet LED that had the bad idea of living too close to the hole) :

A test was also run at home with more power supplies and more switch port. We still see the previous farm on the photo and the good old DC-DC converter :-) The test was performed with all 10 boards and 12V fans powered under 5V so that they would be almost inaudible and just blow the hot air away from the boards.


The build time was irregular but everything was overall extremely fast. There was no significant gain between 6 and 10 boards at this point. Definitely distcc was showing some fairness limits.

Adding a build controller

The switch on top of which the 6 boards are placed above is in fact my SolidRun Clearfog Pro board :

It contains a 6-port GigE switch but most importantly it's based on a Marvell Armada 388 SoC which contains a dual-core Cortex-A9 processor. It's not usable as a build farm but it's extremely powerful on I/O intensive workloads and has no problem forwarding more than 1 Gbps of traffic. I started HAProxy on it, to fairly balance the distcc traffic to the boards. Using the leastconn algorithm, it sends new connections where there is the least number. It results in boards being used smoothly with all their memory bandwidth divided by the smallest number of CPU cores at any instant.
Doing just this improved the build time by 15-20%. With a little bit more hacking it was even possible to pre-buffer the build traffic to accelerate delivery to the boards, thus further reducing their idle time.

And now the build is much easier to deal with, I just have to know a single IP:port to aim distcc at, and the controller takes care of spreading the load to the available devices. If I need to pull a board from the farm for whatever reason, there is nothing to modify. And the haproxy stats page shows nice info about the farm's status and the build performance, which is how I saw peaks between 50 and 120 files compiled per second.

A full kernel build took 13mn on my quad-core skylake at 4.4 GHz, and it only took 4mn36 with the 10 boards there, and 4mn45 with only 6 boards.

More details on the setup and tests were shared here.

Thus I needed another build controller for the office. I didn't want to buy another Clearfog because it's quite overkill and while it's an excellent board it's too expensive for this task.

I found a very small and inexpensive gigabit router called EdgeRouter-X from Ubiquity Network, which runs Linux on a dual-core MIPS processor :


I could install HAProxy on it, and after fixing a few issues related to the early firmware image I had, I could reach about 500 Mbps of forwarded L7 traffic! That was more than I needed for 4 boards. One nice thing from this device is that it's supposed to be powered by 12V, but it boots fine starting around 5V, so actually connecting a USB-to-jack connector works pretty fine. With 4 ports available on this one for devices, using all ports from the power supply, and with 6 ports available on the Clearfog, I finally went with two build farms, a 4-nodes one at work for HAProxy builds and a 6-nodes one at home for kernel builds.

The one at work had the nice property that everything fits perfectly packed together :


Power issues

This is a recurrent problem now. When overclocking a little bit (2.0 GHz), the power losses in the USB cables and connectors become significant. The excellent quality USB cables provided with the board significantly improve the situation but there are still quite some losses measured in the micro-USB connector. I ended up soldering thick cables directly to the boards and to some USB-A male connectors. This saved up to around 700mV while overclocking, and significantly reduced the power draw and heating of the on-board DC-DC regulators. And the stability was much better after this operation.


For the home build farm I had to buy a new 5V/6 ports 60W power supply (and patch it again). The Clearfog is powered by a 5-to-12V converter stealing power from several MiQi boards. It was not a great idea though, since stopping two boards was enough to brown out the Clearfog.

Heating issues

In summer when it's above 25 degrees C in the room, the cards become extremely hot. The thresholds have been increased from 80 to 92 degrees C to prevent them from throttling too early, but the 4x4cm heat sink was definitely too small to spread all the heat, especially in the middle boards.
The one in the office had a large 12V fan connected to the 5V GPIO-controlled output, so that the system starts the fan as a cooling device when the board is above 70 degrees C. The fan is almost inaudible and never blows more than one minute.

For the 6 boards at home it was more difficult. So I ended up buying much larger heat sinks that would reach the edge of the board and connect to another much larger heat sink. I had to desolder the fan connector for the large heat sink to reach the edge. One difficult part was to assemble all this together, so after several attempts, I had quite some success with some thermal adhesive from 3M, which is 15mm wide, exactly like the CPU. It's not perfect but still good enough for the job :



The same tape was used to attach the cluster-wide heatsink to all boards at once. It's very stable since the boards are already held together via the M3 spacers, so they already constitute a solid block :


I could get these boards to work together with the 4 other ones at the office, all connected to the EdgeRouter-X using a cheap 8-port Gigabit Ethernet switch. In fact one board was not connected because I was using the 5-port DC-DC converter to power the farm (yes I know it doesn't look pretty but it works fine) :



Final assembly

This big cluster is placed on top of the Clearfog board, above an aluminum plate screwed to the Clearfog's M3 spacers to protect the Clearfog against scratches or short circuits. It aligns well with the RJ45 connectors :


All this was fixed to a plexiglass plate with rubber feet, and using some of the adhesive tape to firmly hold the power supply in place. This results in a nice, compact, fanless block delivering 3 times the build performance of my quad-core skylake 4.4 GHz!




Updates

The power issues caused to the Clearfog when stopping a board made me want to replace the power supply to have more ports. Also, this power supply is very hot during long builds, and I wanted to have a bit of headroom to add more boards. I ended up buying a cheap fanless 5V/30A framed power supply and USB-A female connectors, which I arranged to provide 12 outputs for whatever I need to plug there, including temporary boards for experimentation. This power supply is adjustable so I increased the output voltage to 5.3V to reduce heating resulting from power losses and to lower the strain on the DC/DC regulators. Overall everything works even better now.


Links

Unsurprisingly, this project was followed by a number of people, regularly asking for updates. I've put the various kernel patches here. The various performance measures of the board at different CPU and RAM frequencies are reported here. I've posted the first series of updates to the build farm on mqmaker's form here. A new presentation was made at Kernel Recipes 2017 upon various attendees' request, the slides are available here and the video is here. The MiQi code, schematics and docs are available here on GitHub.

Version 1 of the build farm is described here. Version 3 of the build farm is described here.

No comments:

Post a Comment