Willy Tarreau's stuff: December 2025

I recently stumbled upon this llama.cpp fork which supports AMD Radeon Instinct MI50-32GB cards ("gfx906"). That made me discover this card, and notice that some of them were amazingly cheap, around $250 used. The 16GB variant is even cheaper, around $100. I had long been watching AI-capable hardware for quite a long time, thinking about various devices (Mac Ultra, AMD's AI-Max-395+, DGX Spark etc), knowing that memory size and bandwidth are always the two limiting factors: either you're using a discrete GPU and have little RAM but a large bandwidth (suitable for text generation but little context), or you're using a shared memory system like those above, and you have a lot of RAM with a much lower bandwidth (more suitable for prompt processing and large contexts). And this suddenly made me realize that this board with 32 GB and 1024 GB/s bandwidth would have the two at once.

I checked on Aliexpress, and cheap devices were everywhere, all with a very high shipping cost though. I found a few on eBay in Europe with decent prices. I bought one to give it a try. I was initially cautious because these cards require a forced air flow which might be complicated to set up, and can be extremely noisy, resulting in the card never being used. Some assemblies were seen with 4cm high-speed fans.

When I received the card, I disassembled it to find that it had a dedicated room for a 75mm fan in it as can be seen below:

I didn't know however if it would be sufficient to cool it, so I installed a temporary CPU fan on it, filling holes with foam:

Then I installed ubuntu-24.04 on an SSD, the amdgpu drivers, rocm stack and built llama.cpp and it initially didn't work. All machines on which I tested it failed to boot, crashed etc. I figured that I needed to have "Above 4G decoding" enabled in the BIOS... except that all my test machines don't have it. I had to buy a new motherboard (which I'll use to upgrade my PC) to test the card.

And there it worked fine, keeping the GPU around 80 degrees C. The card is quite fast thanks to its bandwidth. On Llama-7B-Q4_0, it processes about 1200-1300 tokens/s and generates around 100-110 token/s (without/with flash attention). That's about 23% of the processing speed of a RTX3090-24GB and 68% of the generation speed, for 33% more RAM and 15% of the price!

It was time to try to cut the cover. I designed a cut path using Inkscape and marked it using my laser engraver in order to cut it with my dremel. It's not very difficult, it's just an approximately 0.5mm thick aluminum cover, so it takes around 2 cutting discs and 15mn to cut all this:

Installed the fan, here a Delta 12V 0.24A:

It was also necessary to plug the holes on the back. We're seeing some boards sold with a shroud to hold a loud fan that brings the air from the back because it's wide open. I just cut some foam to the same shape as the back, and after a few attempts it worked pretty fine:

Software testing

Even moderately large models such as Qwen3-30B-A3B-Instruct-2507-Q5_K_L.gguf (MoE) or Qwen3-VL-32B-Instruct in Q4_K_M and Q4_0 quantizations (Q4_0 is quite fast here):


| model                          |      size |  params | backend | ngl | n_batch | n_ubatch | fa |  test |           t/s |
| ------------------------------ | --------: | ------: | ------- | --: | ------: | -------: | -: | ----: | ------------: |
| qwen3moe 30B.A3B Q5_K - Medium | 20.42 GiB | 30.53 B | ROCm    |  99 |    1024 |     2048 |  1 | pp512 | 587.85 ± 0.00 |
| qwen3moe 30B.A3B Q5_K - Medium | 20.42 GiB | 30.53 B | ROCm    |  99 |    1024 |     2048 |  1 | tg128 |  75.65 ± 0.00 |
| qwen3vl 32B Q4_K - Medium      | 18.40 GiB | 32.76 B | ROCm    |  99 |    1024 |     2048 |  1 | pp512 | 216.40 ± 0.00 |
| qwen3vl 32B Q4_K - Medium      | 18.40 GiB | 32.76 B | ROCm    |  99 |    1024 |     2048 |  1 | tg128 |  19.78 ± 0.00 |
| qwen3vl 32B Q4_0               | 17.41 GiB | 32.76 B | ROCm    |  99 |    1024 |     2048 |  1 | pp512 | 278.20 ± 0.00 |
| qwen3vl 32B Q4_0               | 17.41 GiB | 32.76 B | ROCm    |  99 |    1024 |     2048 |  1 | tg128 |  24.50 ± 0.00 |

build: 4206a600 (6905)

I connected the fan to a FAN connector on the motherboard, and adjusted the PWM to slow it down enough to keep it mostly silent. Around 35-40% speed, the noise is bearable and the temperature stabilizes to 100-104 degrees C while processing large inputs. It's also interesting to see that as soon as the boards switches to generation, it's limited by the RAM bandwidth so GPU cores slow down to a lower frequency and the card starts to cool again.

I noticed that support for this chip was recently merged into the mainline llama.cpp. I tried it, it often shows the same performance except for smaller quantizations like Q4_0 which is a bit slower (~7% on PP, up to 20% on small models like 7B):


| model                          |      size |  params | backend | ngl | n_batch | n_ubatch | fa |  test |           t/s |
| ------------------------------ | --------: | ------: | ------- | --: | ------: | -------: | -: | ----: | ------------: |
| qwen3moe 30B.A3B Q5_K - Medium | 20.42 GiB | 30.53 B | ROCm    |  99 |    1024 |     2048 |  1 | pp512 | 590.24 ± 0.00 |
| qwen3moe 30B.A3B Q5_K - Medium | 20.42 GiB | 30.53 B | ROCm    |  99 |    1024 |     2048 |  1 | tg128 |  76.70 ± 0.00 |
| qwen3vl 32B Q4_K - Medium      | 18.40 GiB | 32.76 B | ROCm    |  99 |    1024 |     2048 |  1 | pp512 | 229.45 ± 0.00 |
| qwen3vl 32B Q4_K - Medium      | 18.40 GiB | 32.76 B | ROCm    |  99 |    1024 |     2048 |  1 | tg128 |  19.86 ± 0.00 |
| qwen3vl 32B Q4_0               | 17.41 GiB | 32.76 B | ROCm    |  99 |    1024 |     2048 |  1 | pp512 | 259.87 ± 0.00 |
| qwen3vl 32B Q4_0               | 17.41 GiB | 32.76 B | ROCm    |  99 |    1024 |     2048 |  1 | tg128 |  24.99 ± 0.00 |

build: db9783738 (7310)

Scaling to two cards

I was convinced and wanted at least another one to see how multiple cards work together. Things began to become quite difficult because apparently the rumor had spread about this card, and prices went crazy, with most being between 380 and 450 EUR, and some even at 1700. I searched for one week, even tried to negotiate with some vendors with no acceptable deal in sight, while prices were still rising. And luckily my vendor recontacted me indicating they had one returned, and they sent it to me at the original price. So I'm now having 7680 cores and 64 GB RAM for 500 EUR shipping included! The only way to beat this is by using the 16GB variant instead but one needs lots of PCIe slots in this case.

I've tried different fans on the new card: 0.45A, 0.55A, 0.70A, and the 0.45A one is quite sufficient and really effective. I should even replace the previous one in the first card with a similar one. Now my two fans spin at around 1500-1600 RPM:

$ cat /sys/class/hwmon/hwmon8/fan{6,7}_input
1536
1480

And the rocm-smi utility shows that both cards are doing well:

$ rocm-smi

=========================================== ROCm System Management Interface ===========================================
===================================================== Concise Info =====================================================
Device  Node  IDs              Temp    Power     Partitions          SCLK    MCLK    Fan     Perf  PwrCap  VRAM%  GPU%  
              (DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)                                                     
========================================================================================================================
0       1     0x66a1,   29631  38.0°C  20.0W     N/A, N/A, 0         930Mhz  350Mhz  14.51%  auto  225.0W  0%     0%    
1       2     0x66a1,   8862   34.0°C  17.0W     N/A, N/A, 0         930Mhz  350Mhz  14.51%  auto  225.0W  0%     0%    
========================================================================================================================
================================================= End of ROCm SMI Log ==================================================

I was initially surprised to see absolutely no change on llama-bench, but found a thread where it was explained that this is normal because llama-bench doesn't use large inputs and it's only on large inputs that the cards can work in parallel (in which case the prompt processing almost doubles but text generation doesn't change). It could allow two users to run in parallel on the llama-server with a higher speed though, but I don't really need this. The real gain for me here is the larger context combined with the ability to load large models. And of course, it's still appreciated to almost double the performance when consuming large prompts.

Another point which is appreciable is when uploading images to be analyzed, because they're processed in parallel by the cards. It was even possible to load Qwen3-235B quantized in TQ1 (1.5625bit):

| model                                   |       size |     params | backend    | ngl | n_batch | n_ubatch | fa | mmap |  test |           t/s |
| --------------------------------------- | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | ---: | ----: | ------------: |
| qwen3vlmoe 235B.A22B IQ1_S - 1.5625 bpw |  50.67 GiB |   235.09 B | ROCm       |  99 |    1024 |     2048 |  1 |    0 | pp512 | 137.10 ± 0.00 |
| qwen3vlmoe 235B.A22B IQ1_S - 1.5625 bpw |  50.67 GiB |   235.09 B | ROCm       |  99 |    1024 |     2048 |  1 |    0 | tg128 |  19.93 ± 0.00 |

137 tokens/s in and 20 token/s out for a 50GB model is quite good (that's due to the MoE architecture, there are in fact 22B active weights at a given moment). I could never do anything with such a large model in the past, even on the Ampere Altra at work since it has only around 5% of this setup's memory bandwidth and processing power. And when dealing with a very large context, a task that took 20 hours on the Altra took one minute here! Now I know for certain that such processing must be done on GPU only.

Power management

I noticed that the boards can automatically control a fan output depending on the temperature and power drawn, but after looking everywhere on the PCB for one hour, I couldn't find any track looking like a PWM output. The fan-like connectors are not connected to anything, and there are empty pads around so it's hard to estimate if it's a pin out of the GPU itself that is supposed to drive the fans, or if it's expected to communicate with an external MCU. Finally I'm connecting the fans to the mainboard and that's fine.

I found that it was possible to reduce the maximum power drawn by the boards. By capping them to 180W, I'm losing around 10% processing capacity, almost no generation capacity, and the cards barely reach 90 degrees C with the fans staying at low speed. For example this is after about 2 minutes of consuming a large input file:


$ rocm-smi -d 0 --setpoweroverdrive 180
$ rocm-smi -d 1 --setpoweroverdrive 180
$ rocm-smi
====================================================== Concise Info ======================================================
Device  Node  IDs              Temp    Power     Partitions          SCLK     MCLK    Fan     Perf  PwrCap  VRAM%  GPU%  
              (DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)                                                      
==========================================================================================================================
0       1     0x66a1,   29631  81.0°C  181.0W    N/A, N/A, 0         1485Mhz  800Mhz  100.0%  auto  180.0W  85%    100%  
1       2     0x66a1,   8862   76.0°C  178.0W    N/A, N/A, 0         1485Mhz  800Mhz  30.2%   auto  180.0W  80%    100%  
==========================================================================================================================

After a long time at full speed it can stabilize:


====================================================== Concise Info ======================================================
Device  Node  IDs              Temp    Power     Partitions          SCLK     MCLK    Fan     Perf  PwrCap  VRAM%  GPU%  
              (DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)                                                      
==========================================================================================================================
0       1     0x66a1,   29631  89.0°C  76.0W     N/A, N/A, 0         930Mhz   350Mhz  100.0%  auto  180.0W  59%    100%  
1       2     0x66a1,   8862   86.0°C  152.0W    N/A, N/A, 0         1485Mhz  800Mhz  30.59%  auto  180.0W  59%    100%  
==========================================================================================================================

Photos taken with a thermal camera confirm these measurements and the fact that the first board is slightly hotter than the second one:

I could probably increase the power limit to 200W to regain a few more percent of processing performance, though it's really not important.

By the way it's visible here that the first card is a bit hotter since its fan is a less powerful (I can adjust that using /sys but here it's not really needed). Another interesting observation is that the boards rarely run with everything at full speed. Either it's the GPU that peaks at 1725 MHz, or it's the RAM that peaks at 1000 MHz, but the card's power management is quite effective at adjusting the frequencies very quickly to the needs. It's in fact super hard to observe both SCLK and MCLK at full speed at the same time above. I never managed to see more than 500W on the mains power meter, and even then it's super rare, the most common power drawn is in the range of 270-330W.

I've seen that some users successfully overclock their RAM to 1200 MHz and directly gain 20% text generation speed. Given that the RAM doesn't influence heat that much, it's something I could try, though for now I don't even know where to start given that the driver (or maybe the card's firmware?) enforces the limitations. I'm not going to reflash them though :-)

Caveats

If this card is so great, why isn't it used more ? Just because AMD has dropped support for it in latest ROCm 7.x drivers. It still works to install ROCm 7 and copy all the tensile files having "gfx906" in their names from ROCm-6.4.4 into /opt/rocm/ but it might trigger bugs. I'll need to try again with pure ROCm-6.4.4. This means that in the near future, this board it just going to become e-waste thanks to AMD's pressure to force customers to quickly renew their hardware. It's a shame because it's still an extremely valuable device which has the same amount of RAM as a 5090 and more than 4 times its FP64 processing capacity! It's just not optimal for AI anymore despite being pretty decent compared to commonly available consumer products. Thinking that the cheapest equivalent setup made of three 3090-24GB would cost 5000 EUR, or 10x more than mine explains why this card has become so popular! And AMD showing their customers how they can quickly drop support for well-working products is not something that will convince them to buy their products in the future. At least they won't count me among their customers.

Future

I don't think that AMD will roll back on their decision to drop support for these cards, but for now they allow me to experiment a lot with much larger models than before and with super large contexts that were not practical on CPU only, in order to try to qualify HAProxy patches for backports, see if it's reasonably feasible to spot certain classes of bugs in recent commits, and also help synthetize commit messages to give me summaries of some changes I missed, which helps a lot to prepare the announces for new versions. For the time it works, it's nice. Once it's no longer possible to use these cards, or if it's conclusive and we want to buy something at work, then we'll likely switch to an nvidia or maybe intel, depending on the amount of RAM needed.

I've also seen some videos where people were generating images in less than one second using these cards. I haven't tried this and don't know the software in use yet, but that's clearly among the interesting things to experiment with.

A long time project I had been having that I initially thought about hosting on my Radxa Orion O6 was analyzing and summarizing incoming e-mails. The problem was that large messages could take one minute to be analyzed and I receive quite more than 1440 mails per day, so it would require me to only process some of them. With these cards I could run the same analysis in a few seconds and experiment better without having to pre-filter anything. And given the silent fans I can easily keep the device running full time.

Willy Tarreau's stuff

2025-12-07

AMD Radeon Instinct MI50-32GB: best AI card for beginners ?

Software testing

Scaling to two cards

Power management

Caveats

Future