I recently stumbled upon this llama.cpp fork which supports AMD Radeon Instinct MI50-32GB cards ("gfx906"). That made me discover this card, and notice that some of them were amazingly cheap, around $250 used. The 16GB variant is even cheaper, around $100. I had long been watching AI-capable hardware for quite a long time, thinking about various devices (Mac Ultra, AMD's AI-Max-395+, DGX Spark etc), knowing that memory size and bandwidth are always the two limiting factors: either you're using a discrete GPU and have little RAM but a large bandwidth (suitable for text generation but little context), or you're using a shared memory system like those above, and you have a lot of RAM with a much lower bandwidth (more suitable for prompt processing and large contexts). And this suddenly made me realize that this board with 32 GB and 1024 GB/s bandwidth would have the two at once.
I checked on Aliexpress, and cheap devices were everywhere, all with a very high shipping cost though. I found a few on eBay in Europe with decent prices. I bought one to give it a try. I was initially cautious because these cards require a forced air flow which might be complicated to set up, and can be extremely noisy, resulting in the card never being used. Some assemblies were seen with 4cm high-speed fans.
When I received the card, I disassembled it to find that it had a dedicated room for a 75mm fan in it as can be seen below:
I didn't know however if it would be sufficient to cool it, so I installed a temporary CPU fan on it, filling holes with foam:
Then I installed ubuntu-24.04 on an SSD, the amdgpu drivers, rocm stack and built llama.cpp and it initially didn't work. All machines on which I tested it failed to boot, crashed etc. I figured that I needed to have "Above 4G decoding" enabled in the BIOS... except that all my test machines don't have it. I had to buy a new motherboard (which I'll use to upgrade my PC) to test the card.
And there it worked fine, keeping the GPU around 80 degrees C. The card is quite fast thanks to its bandwidth. On Llama-7B-Q4_0, it processes about 1200-1300 tokens/s and generates around 100-110 token/s (without/with flash attention). That's about 23% of the processing speed of a RTX3090-24GB and 68% of the generation speed, for 33% more RAM and 15% of the price!
It was time to try to cut the cover. I designed a cut path using Inkscape and marked it using my laser engraver in order to cut it with my dremel. It's not very difficult, it's just an approximately 0.5mm thick aluminum cover, so it takes around 2 cutting discs and 15mn to cut all this:
Installed the fan, here a Delta 12V 0.24A:
It was also necessary to plug the holes on the back. We're seeing some boards sold with a shroud to hold a loud fan that brings the air from the back because it's wide open. I just cut some foam to the same shape as the back, and after a few attempts it worked pretty fine:
Software testing
| model | size | params | backend | ngl | n_batch | n_ubatch | fa | test | t/s |
| ------------------------------ | --------: | ------: | ------- | --: | ------: | -------: | -: | ----: | ------------: |
| qwen3moe 30B.A3B Q5_K - Medium | 20.42 GiB | 30.53 B | ROCm | 99 | 1024 | 2048 | 1 | pp512 | 587.85 ± 0.00 |
| qwen3moe 30B.A3B Q5_K - Medium | 20.42 GiB | 30.53 B | ROCm | 99 | 1024 | 2048 | 1 | tg128 | 75.65 ± 0.00 |
| qwen3vl 32B Q4_K - Medium | 18.40 GiB | 32.76 B | ROCm | 99 | 1024 | 2048 | 1 | pp512 | 216.40 ± 0.00 |
| qwen3vl 32B Q4_K - Medium | 18.40 GiB | 32.76 B | ROCm | 99 | 1024 | 2048 | 1 | tg128 | 19.78 ± 0.00 |
| qwen3vl 32B Q4_0 | 17.41 GiB | 32.76 B | ROCm | 99 | 1024 | 2048 | 1 | pp512 | 278.20 ± 0.00 |
| qwen3vl 32B Q4_0 | 17.41 GiB | 32.76 B | ROCm | 99 | 1024 | 2048 | 1 | tg128 | 24.50 ± 0.00 |
build: 4206a600 (6905)
I connected the fan to a FAN connector on the motherboard, and adjusted the PWM to slow it down enough to keep it mostly silent. Around 35-40% speed, the noise is bearable and the temperature stabilizes to 100-104 degrees C while processing large inputs. It's also interesting to see that as soon as the boards switches to generation, it's limited by the RAM bandwidth so GPU cores slow down to a lower frequency and the card starts to cool again.
| model | size | params | backend | ngl | n_batch | n_ubatch | fa | test | t/s |
| ------------------------------ | --------: | ------: | ------- | --: | ------: | -------: | -: | ----: | ------------: |
| qwen3moe 30B.A3B Q5_K - Medium | 20.42 GiB | 30.53 B | ROCm | 99 | 1024 | 2048 | 1 | pp512 | 590.24 ± 0.00 |
| qwen3moe 30B.A3B Q5_K - Medium | 20.42 GiB | 30.53 B | ROCm | 99 | 1024 | 2048 | 1 | tg128 | 76.70 ± 0.00 |
| qwen3vl 32B Q4_K - Medium | 18.40 GiB | 32.76 B | ROCm | 99 | 1024 | 2048 | 1 | pp512 | 229.45 ± 0.00 |
| qwen3vl 32B Q4_K - Medium | 18.40 GiB | 32.76 B | ROCm | 99 | 1024 | 2048 | 1 | tg128 | 19.86 ± 0.00 |
| qwen3vl 32B Q4_0 | 17.41 GiB | 32.76 B | ROCm | 99 | 1024 | 2048 | 1 | pp512 | 259.87 ± 0.00 |
| qwen3vl 32B Q4_0 | 17.41 GiB | 32.76 B | ROCm | 99 | 1024 | 2048 | 1 | tg128 | 24.99 ± 0.00 |
build: db9783738 (7310)
Scaling to two cards
$ cat /sys/class/hwmon/hwmon8/fan{6,7}_input
1536
1480
$ rocm-smi
=========================================== ROCm System Management Interface ===========================================
===================================================== Concise Info =====================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Edge) (Socket) (Mem, Compute, ID)
========================================================================================================================
0 1 0x66a1, 29631 38.0°C 20.0W N/A, N/A, 0 930Mhz 350Mhz 14.51% auto 225.0W 0% 0%
1 2 0x66a1, 8862 34.0°C 17.0W N/A, N/A, 0 930Mhz 350Mhz 14.51% auto 225.0W 0% 0%
========================================================================================================================
================================================= End of ROCm SMI Log ==================================================
| model | size | params | backend | ngl | n_batch | n_ubatch | fa | mmap | test | t/s |
| --------------------------------------- | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | ---: | ----: | ------------: |
| qwen3vlmoe 235B.A22B IQ1_S - 1.5625 bpw | 50.67 GiB | 235.09 B | ROCm | 99 | 1024 | 2048 | 1 | 0 | pp512 | 137.10 ± 0.00 |
| qwen3vlmoe 235B.A22B IQ1_S - 1.5625 bpw | 50.67 GiB | 235.09 B | ROCm | 99 | 1024 | 2048 | 1 | 0 | tg128 | 19.93 ± 0.00 |
Power management
$ rocm-smi -d 0 --setpoweroverdrive 180
$ rocm-smi -d 1 --setpoweroverdrive 180
$ rocm-smi
====================================================== Concise Info ======================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Edge) (Socket) (Mem, Compute, ID)
==========================================================================================================================
0 1 0x66a1, 29631 81.0°C 181.0W N/A, N/A, 0 1485Mhz 800Mhz 100.0% auto 180.0W 85% 100%
1 2 0x66a1, 8862 76.0°C 178.0W N/A, N/A, 0 1485Mhz 800Mhz 30.2% auto 180.0W 80% 100%
==========================================================================================================================
====================================================== Concise Info ======================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Edge) (Socket) (Mem, Compute, ID)
==========================================================================================================================
0 1 0x66a1, 29631 89.0°C 76.0W N/A, N/A, 0 930Mhz 350Mhz 100.0% auto 180.0W 59% 100%
1 2 0x66a1, 8862 86.0°C 152.0W N/A, N/A, 0 1485Mhz 800Mhz 30.59% auto 180.0W 59% 100%
==========================================================================================================================



Hi, just FYI AMD is changing it's way of supporting cards. It's not off the table and we as community are trying to keep especially gfx906 on the table.
ReplyDeleteSo far noone has succesfully ventured into enabling infinityfabric for 2/3/4 cards (it's around 240GB/s per card) and it's unknown if it will be solved to a point that it can be used by LLM. If so, this would break more ground. in any case, don't discount this card's future yet. Debian's ROCm build apparently supports it out of the box, too.
Some of the troubles are with extended ecosystem still, i.e. some projects were in progress of dropping it because the policy at AMD _has not_ officially changed yet, and this is putting negative momentum. But I think the community is winning here. The other part is tinygrad and in consequence Exo. This is still open but I hope it'll also work out, especially as rocm-smi etc. get better support.
The card per se has some weaknesses in lacking bfloat16 and flash attention (I honestly rarely understand the first and not at all the second), but as you mentioned it wins with it's fp32/fp64 features.
Potentially one day we also fully crack how to manage it with SR-IOV (some people already got it sorted on ESXi), and I think some of the weaknesses from missing instructions etc. will also be nullified to some extent.
Overall, it's a good story ahead, especially as people find more and more ways to shrink models, the card will make sense for complex models that suddenly fit into 32GB and can use it's bandwidth to the max.
I also underclock mine using different firmware and it serves as a desktop GPU aside it's compute tasks.
On a good day I'll optimize my fan control also, plus use some PCoIP or similar setup to put the server in its closet. I bought the card a few weeks before you after calculating and doing price estimates, seeing it would not go lower and the AMD ecosys was not as much of a nightmare anymore as it had been in mid 24.
I would want a second if the computations could parallelize better but since it's a 2U server I can't fit them side by side, so i'll likely never do it.
Hi, thanks for sharing your thoughts on this. Yes I've noticed some community efforts to keep gfx906 running. Apparently v9 of rocm should support it again as part of a generic driver (though I'm allowing myself to have doubts about the resulting performance). Also I'm personally using the regularly rebased llama.cpp-gfx906 project that's significantly faster than mainline, and it also shows that this card is popular, given the amount of RAM it provides for a low price. I've seen as well the existence of this infinity fabric, but at 1/4 of the card's bandwidth I'm wondering how useful it will be.
DeleteFor now I intend to use these cards as long as they continue to be usable with evolving software. They're definitely way faster than what I was doing on CPU, and I realize that since I assembled them I've no longer used llama.cpp on my Orion O6 nor on $work's Ampere Altra which was my reference machine on this point.
Also regarding cooling, I've finally found a solution to automate the fan. My motherboard's PWM control is rubbish (passes through IPMI and does whatever). Instead, I've bought these fan controllers: https://fr.aliexpress.com/item/1005006625689097.html and https://fr.aliexpress.com/item/1005008742111227.html , and installed the first one with the second one's sensor inside the cards, with the sensor placed at the output of the heat sink. I adjusted it to run silently when under idle, and to progressively ramp up above 45 degrees till 70 degrees, and am now seeing the card run for long periods at 1600 to 1900 MHz (it no longer throttles). It's a bit noisy when it works but it doesn't last long, that's OK.