How to use GPUs over multiple computers for local AI? (lemmy.dbzer0.com)

submitted 3 months ago by [email protected] to c/[email protected]

64 comments fedilink hide all child comments

The problem is simple: consumer motherboards don't have that many PCIe slots, and consumer CPUs don't have enough lanes to run 3+ GPUs at full PCIe gen 3 or gen 4 speeds.

My idea was to buy 3-4 computers for cheap, slot a GPU into each of them and use 4 of them in tandem. I imagine this will require some sort of agent running on each node which will be connected through a 10Gbe network. I can get a 10Gbe network running for this project.

Does Ollama or any other local AI project support this? Getting a server motherboard with CPU is going to get expensive very quickly, but this would be a great alternative.

Thanks

top 50 comments

sorted by: hot top new old

[-] [email protected] 34 points 3 months ago

A 10 Gbps network is MUCH slower than even the smallest oldest PCIe slot you have. So cramming the GPUs in any old slot that'll fit is a much better option than distributing it over multiple PCs.

[-] [email protected] 13 points 3 months ago

I agree with the idea of not using a 10 Gbps network for GPU work. Just one small nitpick: PCIe Gen 1 in an x1 slot is only capable of 2.5 GTransfers/sec, which translates to about 2 GBits/sec, making it about 5x slower than a 10 Gbps line-rate network.

I sincerely hope OP is not running modern AI work on a mobo with only Gen 1...

[-] [email protected] 3 points 3 months ago

Thanks for the comment. I don't want to use a networked distributed cluster for AI if I can help it. I'm looking at other options and maybe I'll find something

[-] [email protected] 3 points 3 months ago

Your point is valid. Originally I was looking for deals on cheap CPU + Motherboard combos that will offer me a lot of PCIe and won't be very expensive, but I couldn't find anything good for EPYC. I am now looking for used supermicro motherboards and maybe I can get something I like. I don't want to do networking for this project either but it was the only idea I could think of a few hours back

[-] [email protected] 17 points 3 months ago

There are several solutions:

https://github.com/b4rtaz/distributed-llama

https://github.com/exo-explore/exo

https://github.com/kalavai-net/kalavai-client

https://petals.dev/

Didn't try any of them and haven't looked for 6 months, so maybe something better have arrived..

[-] [email protected] 6 points 3 months ago

Thank you for the links. I will go through them

[-] [email protected] 2 points 3 months ago

I've tried Exo and it worked fairly well for me. Combined my 7900 XTX, GTX 1070, and M2 MacBook Pro.

[-] [email protected] 2 points 3 months ago

+1 on exo, worked for me across the 7900xtx, 6800xt, and 1070ti

[-] [email protected] 11 points 3 months ago* (last edited 3 months ago)

Basically no GPU needs a full PCIe x16 slot to run at full speed. There are motherboards out there which will give you 3 or 4 slots of PCIe x8 electrical (x16 physical). I would look into those.

Edit: If you are willing to buy a board that supports AMD Epyc processors, you can get boards with basically as many PCIe slots as you could ever hope for. But that is almost certainly overkill for this task.

[-] [email protected] 4 points 3 months ago

Aren't Epyc boards really expensive? I was going to buy 3-4 used computers and stuff a GPU in each.

Are there motherboards on the used market that can run the E5-2600 V4 series CPUs and have multiple PCIe Xi slots? The only ones I found were super expensive/esoteric.

[-] [email protected] 4 points 3 months ago* (last edited 3 months ago)

Prior-gen Epyc boards show up on eBay from time to time, often as CPU+mobo bundles from Chinese datacenters that are upgrading to latest gen. These can be had for a deal, if they're still available, and would provide PCIe lanes for days.

[-] [email protected] 3 points 3 months ago* (last edited 3 months ago)

Yeah, adding to your post, Threadripper also has lots of PCIe lanes. Here is one that has 4 x16 slots. And, note, I am not endorsing that specific listing. I did very minimal research on that listing, just using it as an example.

Edit: Marauding_gibberish, if you need/want AM5: x670E motherboards have a good number of PCIe lanes and can be bought used now (x870E are newest gen AM5 with lots of lanes as well, but both pale compared to what you can get with Epyc or Threadripper).

[-] [email protected] 2 points 3 months ago

Thanks for the tip on x670, I'll take a look

[-] [email protected] 2 points 3 months ago

I see. I must be doing something wrong because the only ones I found were over $1000 on eBay. Do you have any tips/favoured listings?

load more comments (1 replies)

[-] [email protected] 2 points 3 months ago* (last edited 3 months ago)

Hey I built a micro -atx epyc for work that has tons of pcie slots. Pretty sure it was an ASRock (or ASRack). I can find the details tomorrow if you'd like. Just let me know!

E: well, it looks like I remembered wrong and it was an atx, not micro. I think it is ASRock Rack ROMED8-2T and it has 7 PCIe4.0 x16 (I needed a lot). Unfortunately I don't think it's sold anymore other than really high prices on eBay.

[-] [email protected] 3 points 3 months ago

Thank you, and that highlights the problem - I don't see any affordable options (around $200 or so for a motherboard + CPU combo) for a lot of PCIe lanes other than purchasing Frankenstein boards from Aliexpress. Which isn't going to be a thing for much longer with tariffs, so I'm looking elsewhere

[-] [email protected] 4 points 3 months ago

Yes, I inadvertently emphasized your challenge :-/

[-] [email protected] 1 points 3 months ago

Wow, so you want to use inefficient models super cheap. I guarantee nobody has ever thought of this before. Good move coming to Lemmy for tips on how to do so. I bet you're the next Sam Altman 🤣

[-] [email protected] 5 points 3 months ago

I don't understand your point, but I was going to use 4 GPUs (something like used 3090s when they get cheaper or the Arc B580s) to run the smaller models like Mistral small.

load more comments (1 replies)

[-] [email protected] 8 points 3 months ago

consumer motherboards don’t have that many PCIe slots

The number of PCIe slots isn't the most limiting factor when it comes to consumer motherboards. It's the number of PCIe lanes that are supported by your CPU and the motherboard has access to.

It's difficult to find non-server focused hardware that can do something like this because you need a significant number of PCIe lanes to accommodate your CPU, and your several GPUs at full speed. Using an M.2 SSD? Even more difficult.

Your 1 GPU per machine is a decent approach. Using a Kubernetes cluster with device plugins is likely the best way to accomplish what you want here. It would involve setting up your cluster, installing the drivers for your GPU (on each node) which then exposes the device to the system. Then when you create your Ollama container, in the prestart hook, ensure you expose your GPUs to the container for usage.

The issue with doing this, is 10Gbe is very slow compared to your GPU via PCIe. You're networking all these GPUs to do some cool stuff, but then you're severely bottle-necking yourself with your network. All in all, it's not a very good plan.

[-] [email protected] 2 points 3 months ago

I agree with your assessment. I was indeed going to run k8s, just hadn't figured out what you told me. Thanks for that.

And yes, I realised that 10Gbe is just not enough for this stuff. But another commenter told me to look for used threadripper and EPYC boards (which are extremely expensive for me), which gave me the idea to look for older Intel CPU+Motherboard combos. Maybe I'll have some luck there. I was going to use Talos in a VM with all the GPUs passed through to it.

[-] [email protected] 7 points 3 months ago

You're entering the realm of enterprise AI horizontal scaling which is $$$$

[-] [email protected] 4 points 3 months ago

I'm not going to do anything enterprise. I'm not sure how people seem to think of it this way when I didn't even mention it.

I plan to use 4 GPUs with 16-24GB VRAM each to run smaller 24B models.

[-] [email protected] 7 points 3 months ago

I didn't say you were, I said you were asking about a topic that enters that area.

[-] [email protected] 2 points 3 months ago

I see. Thanks

[-] [email protected] 4 points 3 months ago

well that looks like small enterprise scale

load more comments (2 replies)

[-] [email protected] 2 points 3 months ago

I’m not going to do anything enterprise.

You are, though. You're creating a GPU cluster for generative AI which is an enterprise endeavor...

[-] [email protected] 2 points 3 months ago

Specifically because PCIe slots go for a premium on motherboards and CPU architectures. If I didn't have to worry about PCIe I wouldn't care about a networked AI cluster. But yes, I accept what you say

load more comments (1 replies)

[-] [email protected] 7 points 3 months ago

Distributed llama

[-] [email protected] 3 points 3 months ago

Thank you, I'll take a look

[-] [email protected] 2 points 3 months ago

That looks interesting. Might have to check it out.

[-] [email protected] 5 points 3 months ago

Maybe you want something like a Beowulf Cluster?

[-] [email protected] 3 points 3 months ago

Never heard of it. What is it about?

[-] [email protected] 3 points 3 months ago

It's a way to do distributed parallel computing using consumer-grade hardware. I don't actually know a ton about them, so you'd be better served by looking up information about them.

https://en.wikipedia.org/wiki/Beowulf_cluster

load more comments (1 replies)

[-] [email protected] 4 points 3 months ago* (last edited 3 months ago)

If you want to use supercomputer software, setup SLURM scheduler on those machines. There are many tutorials how to do distributed gpu computing with slurm. I have it on my todo list.
https://github.com/SchedMD/slurm
https://slurm.schedmd.com/

load more comments (5 replies)

[-] [email protected] 4 points 3 months ago

Ignorant here. Would mining rigs work for this?

[-] [email protected] 2 points 3 months ago

I think yes

[-] [email protected] 3 points 3 months ago* (last edited 3 months ago)

Why?

You're trying to run a DC setup in your home for AI bullshit?

[-] [email protected] 6 points 3 months ago

It is because modern consumer GPUs do not have enough VRAM to load the 24B models. I want to run Mistral small locally.

[-] [email protected] 4 points 3 months ago* (last edited 3 months ago)

May take a look at systems with the newer AMD SoC's first. They utilize the systems' RAM and come with a proper NPU, once ollama or mistral.rs are supporting those they might give you sufficient performance for your needs for way lower costs (incl. power consumption). Depending on how NPU support gets implemented it might even become possible to use NPU and GPU in tandem, that would probably enable pretty powerful models to be run on consumer-grade hardware at reasonable speed.

[-] [email protected] 2 points 3 months ago

Thanks, but will NPUs integrated along with the CPU ever match the performance of a discrete GPU?

[-] [email protected] 2 points 3 months ago

Depends on which GPU you compare it with, what model you use, what kind of RAM it has to work with, ecetera. NPU's are purpose-built chips after all. Unfortunately the whole tech is still very young, so we'll have to wait for stuff like ollama to introduce native support for an apples-to-apples comparison. The raw numbers to however do look promising.

load more comments (3 replies)

[-] [email protected] 3 points 3 months ago* (last edited 3 months ago)

This is false: Mistral small 24b at q4_K_M quantization is 15GB. q8 is 26GB. A 3090/4090/5090 with 24GB or two cards with 16GB (I recommend the 4060 Ti 16GB) will work fine with this model, and will work in a single computer. Like others have said, 10Gbe will be a huge bottleneck, plus it’s just simply not necessary to distribute a 24b model across multiple machines.

[-] [email protected] 2 points 3 months ago

Thank you, but which consumer motherboard + CPU combo is giving me 32 lanes of PCIe Gen 4 neatly divided into 2 x16 slots for me to put 2 GPUs in? I only asked this question because I was going to buy used computers and stuff a GPU in each.

Your point about networking is valid, and I'll be hesitant to invest in 25Gbe right now

[-] [email protected] 4 points 3 months ago* (last edited 3 months ago)

You don’t need cards to have full bandwidth, they only time it will matter is when you’re loading the models on the card. You need a motherboard with x16 slots but even x4 connections would be good enough. Running the model doesn’t need a lot of bandwidth. Remember you only load the model once then reuse it.

An x4 pcie gen 4 slot has ~7.8 GiB/s theoretical transfer rate (after overhead), a x16 has ~31.5GiB/s - so disk I/O is likely your limit even for a x4 slot.

overhead was already in calculations

[-] [email protected] 2 points 3 months ago

I see. That solves a lot of the headaches I imagined I would have. Thank you so much for clearing that up

load more comments (3 replies)

load more comments

this post was submitted on 09 Apr 2025

51 points (100.0% liked)

Selfhosted

49756 readers

230 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago

MODERATORS

[email protected]