I recently stumbled upon this llama.cpp fork which supports AMD Radeon Instinct MI50-32GB cards ("gfx906"). That made me discover this card, and notice that some of them were amazingly cheap, around $250 used. The 16GB variant is even cheaper, around $100. I had long been watching AI-capable hardware for quite a long time, thinking about various devices (Mac Ultra, AMD's AI-Max-395+, DGX Spark etc), knowing that memory size and bandwidth are always the two limiting factors: either you're using a discrete GPU and have little RAM but a large bandwidth (suitable for text generation but little context), or you're using a shared memory system like those above, and you have a lot of RAM with a much lower bandwidth (more suitable for prompt processing and large contexts). And this suddenly made me realize that this board with 32 GB and 1024 GB/s bandwidth would have the two at once.
I checked on Aliexpress, and cheap devices were everywhere, all with a very high shipping cost though. I found a few on eBay in Europe with decent prices. I bought one to give it a try. I was initially cautious because these cards require a forced air flow which might be complicated to set up, and can be extremely noisy, resulting in the card never being used. Some assemblies were seen with 4cm high-speed fans.
When I received the card, I disassembled it to find that it had a dedicated room for a 75mm fan in it as can be seen below:
I didn't know however if it would be sufficient to cool it, so I installed a temporary CPU fan on it, filling holes with foam:
Then I installed ubuntu-24.04 on an SSD, the amdgpu drivers, rocm stack and built llama.cpp and it initially didn't work. All machines on which I tested it failed to boot, crashed etc. I figured that I needed to have "Above 4G decoding" enabled in the BIOS... except that all my test machines don't have it. I had to buy a new motherboard (which I'll use to upgrade my PC) to test the card.
And there it worked fine, keeping the GPU around 80 degrees C. The card is quite fast thanks to its bandwidth. On Llama-7B-Q4_0, it processes about 1200-1300 tokens/s and generates around 100-110 token/s (without/with flash attention). That's about 23% of the processing speed of a RTX3090-24GB and 68% of the generation speed, for 33% more RAM and 15% of the price!
It was time to try to cut the cover. I designed a cut path using Inkscape and marked it using my laser engraver in order to cut it with my dremel. It's not very difficult, it's just an approximately 0.5mm thick aluminum cover, so it takes around 2 cutting discs and 15mn to cut all this:
Installed the fan, here a Delta 12V 0.24A:
It was also necessary to plug the holes on the back. We're seeing some boards sold with a shroud to hold a loud fan that brings the air from the back because it's wide open. I just cut some foam to the same shape as the back, and after a few attempts it worked pretty fine:
Software testing
| model | size | params | backend | ngl | n_batch | n_ubatch | fa | test | t/s |
| ------------------------------ | --------: | ------: | ------- | --: | ------: | -------: | -: | ----: | ------------: |
| qwen3moe 30B.A3B Q5_K - Medium | 20.42 GiB | 30.53 B | ROCm | 99 | 1024 | 2048 | 1 | pp512 | 587.85 ± 0.00 |
| qwen3moe 30B.A3B Q5_K - Medium | 20.42 GiB | 30.53 B | ROCm | 99 | 1024 | 2048 | 1 | tg128 | 75.65 ± 0.00 |
| qwen3vl 32B Q4_K - Medium | 18.40 GiB | 32.76 B | ROCm | 99 | 1024 | 2048 | 1 | pp512 | 216.40 ± 0.00 |
| qwen3vl 32B Q4_K - Medium | 18.40 GiB | 32.76 B | ROCm | 99 | 1024 | 2048 | 1 | tg128 | 19.78 ± 0.00 |
| qwen3vl 32B Q4_0 | 17.41 GiB | 32.76 B | ROCm | 99 | 1024 | 2048 | 1 | pp512 | 278.20 ± 0.00 |
| qwen3vl 32B Q4_0 | 17.41 GiB | 32.76 B | ROCm | 99 | 1024 | 2048 | 1 | tg128 | 24.50 ± 0.00 |
build: 4206a600 (6905)
I connected the fan to a FAN connector on the motherboard, and adjusted the PWM to slow it down enough to keep it mostly silent. Around 35-40% speed, the noise is bearable and the temperature stabilizes to 100-104 degrees C while processing large inputs. It's also interesting to see that as soon as the boards switches to generation, it's limited by the RAM bandwidth so GPU cores slow down to a lower frequency and the card starts to cool again.
| model | size | params | backend | ngl | n_batch | n_ubatch | fa | test | t/s |
| ------------------------------ | --------: | ------: | ------- | --: | ------: | -------: | -: | ----: | ------------: |
| qwen3moe 30B.A3B Q5_K - Medium | 20.42 GiB | 30.53 B | ROCm | 99 | 1024 | 2048 | 1 | pp512 | 590.24 ± 0.00 |
| qwen3moe 30B.A3B Q5_K - Medium | 20.42 GiB | 30.53 B | ROCm | 99 | 1024 | 2048 | 1 | tg128 | 76.70 ± 0.00 |
| qwen3vl 32B Q4_K - Medium | 18.40 GiB | 32.76 B | ROCm | 99 | 1024 | 2048 | 1 | pp512 | 229.45 ± 0.00 |
| qwen3vl 32B Q4_K - Medium | 18.40 GiB | 32.76 B | ROCm | 99 | 1024 | 2048 | 1 | tg128 | 19.86 ± 0.00 |
| qwen3vl 32B Q4_0 | 17.41 GiB | 32.76 B | ROCm | 99 | 1024 | 2048 | 1 | pp512 | 259.87 ± 0.00 |
| qwen3vl 32B Q4_0 | 17.41 GiB | 32.76 B | ROCm | 99 | 1024 | 2048 | 1 | tg128 | 24.99 ± 0.00 |
build: db9783738 (7310)
Scaling to two cards
$ cat /sys/class/hwmon/hwmon8/fan{6,7}_input
1536
1480
$ rocm-smi
=========================================== ROCm System Management Interface ===========================================
===================================================== Concise Info =====================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Edge) (Socket) (Mem, Compute, ID)
========================================================================================================================
0 1 0x66a1, 29631 38.0°C 20.0W N/A, N/A, 0 930Mhz 350Mhz 14.51% auto 225.0W 0% 0%
1 2 0x66a1, 8862 34.0°C 17.0W N/A, N/A, 0 930Mhz 350Mhz 14.51% auto 225.0W 0% 0%
========================================================================================================================
================================================= End of ROCm SMI Log ==================================================
| model | size | params | backend | ngl | n_batch | n_ubatch | fa | mmap | test | t/s |
| --------------------------------------- | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | ---: | ----: | ------------: |
| qwen3vlmoe 235B.A22B IQ1_S - 1.5625 bpw | 50.67 GiB | 235.09 B | ROCm | 99 | 1024 | 2048 | 1 | 0 | pp512 | 137.10 ± 0.00 |
| qwen3vlmoe 235B.A22B IQ1_S - 1.5625 bpw | 50.67 GiB | 235.09 B | ROCm | 99 | 1024 | 2048 | 1 | 0 | tg128 | 19.93 ± 0.00 |
Power management
$ rocm-smi -d 0 --setpoweroverdrive 180
$ rocm-smi -d 1 --setpoweroverdrive 180
$ rocm-smi
====================================================== Concise Info ======================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Edge) (Socket) (Mem, Compute, ID)
==========================================================================================================================
0 1 0x66a1, 29631 81.0°C 181.0W N/A, N/A, 0 1485Mhz 800Mhz 100.0% auto 180.0W 85% 100%
1 2 0x66a1, 8862 76.0°C 178.0W N/A, N/A, 0 1485Mhz 800Mhz 30.2% auto 180.0W 80% 100%
==========================================================================================================================
====================================================== Concise Info ======================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Edge) (Socket) (Mem, Compute, ID)
==========================================================================================================================
0 1 0x66a1, 29631 89.0°C 76.0W N/A, N/A, 0 930Mhz 350Mhz 100.0% auto 180.0W 59% 100%
1 2 0x66a1, 8862 86.0°C 152.0W N/A, N/A, 0 1485Mhz 800Mhz 30.59% auto 180.0W 59% 100%
==========================================================================================================================


