Guide to Self Hosting LLMs Faster/Better than Ollama

brucethemoose@lemmy.world · 1 day ago

Could this ever be “self hosted” on a phone, in the future? Eg run as a web app, basically?

That would get around the issue of rate limiting for those of us with no home server.

That’s just a far flung idea though. Either way, this is amazing.

brucethemoose@lemmy.world · 1 day ago

Eh, most of the poison is the dark patterns in the UI, the relentless engagement optimization, algorithmic recommendations, the tracking, the ads, and so on.

This short circuits all of that.

You could still watch toxic influencers, but it’s not funneling you towards that anymore.

brucethemoose@lemmy.world · 12 days ago

In practice, they’re not very good because of broken FP16, broken kernels, high idle usage and a bunch of other things.

Same with the AMD MI50 and MI100. Looks great on paper, not practical IRL, unless you want to pay a whole team of software devs to fix them for you.

Better to just save up for a 2080 TI or 3090, sadly.

brucethemoose@lemmy.world · 12 days ago

No.

Even the biggest open weights models are trained on pennies compared to OpenAI and Claude. They just don’t have the hardware to be so wasteful.

In fact, the Nvidia GPU ban was the best thing to ever happen to “small” AI devs. It made them thrifty.

brucethemoose@lemmy.world · 12 days ago

MoEs can be very fast with hybrid inference. I run Xiaomi Mimo 2.5 (a 310B model, 116GB weights) on my single 3090 + 7800 CPU, and it outputs faster than I can read it.

It’s also easier to fit long context, if you need that.

It’s best to use the ik_llama.cpp fork for that, though. It gives a huge boost to hybrid MoE speeds.

brucethemoose@lemmy.world · 12 days ago

Probably Qwen 35B then. ~9GB free VRAM + (let’s say) ~16GB of free CPU RAM is a good size for that, and squeezing bigger models in would be hard unless it’s a headless linux server.

brucethemoose@lemmy.world · 13 days ago

Depends on how much CPU RAM you have, and how fast it is.

As others said, Qwen 35B at the very elast.

brucethemoose@lemmy.world · 13 days ago

Yeah.

It’s not even about efficiency, really, but independence from corporations, privacy, and principle. Kind of like Lemmy.

brucethemoose@lemmy.world · edit-2 13 days ago

This is a “feel guilty about missing recycling” kind of complaint.

Having a server run for an hour or two (?) a day is negligible. You use more energy running a fridge, or leaving a few lights on, or browsing Lemmy for a while. Or running a docker container for other services. You release more greenhouse gasses eating beef, or driving anywhere, or even opening your front door a few times, and individual industries are going to use vastly more electricity than a few self hosters ever would. If you own an EV, you’ve probably blown out your entire zip code of self hosters.

…But if it still bothers you, you can find an ewaste smartphone(s) and host on that. This is actually a very neat use case IMO.

However, if you get to the homelab scale of “an EPYC + 3090s running all the time” that electricity use does start to add up. But that’s quite a rare hobbyist tier, I’d say, and it really shouldnt be running 24/7.

brucethemoose@lemmy.world · 19 days ago

As an observer in these comments, this is a great answer. Thanks for typing it out.

It does seem like some “cuts” could be ironed out reasonably quickly, like the file naming issue or UI lag.

brucethemoose@lemmy.world · 1 month ago

Partially configured some parts via LLM but please don’t crucify me for that.

Slap in a spare GPU, and self-host one!

The 30B-class models are unbelievably good now, for being so small. They’re kinda where Claude was like a year ago, if not less. And (with the right backend) they aren’t expensive to host.

brucethemoose@lemmy.world · edit-2 2 years ago

It’s less optimal.

On a 3090, I simply can’t run Command-R or Qwen 2.5 34B well at 64K-80K context with ollama. Its slow even at lower context, the lack of DRY sampling and some other things majorly hit quality.

Ollama is meant to be turnkey, and thats fine, but LLMs are extremely resource intense. Sometimes the manual setup/configuration is worth it to squeeze out every ounce of extra performance and quantization quality.

Even on CPU-only setups, you are missing out on (for instance) the CPU-optimized quantizations llama.cpp offers now, or the more advanced sampling kobold.cpp offers, or more fine grained tuning of flash attention configs, or batched inference, just to start.

And as I hinted at, I don’t like some other aspects of ollama, like how they “leech” off llama.cpp and kinda hide the association without contributing upstream, some hype and controversies in the past, and hints that they may be cooking up something commercial.

brucethemoose@lemmy.world · edit-2 2 years ago

Guide to Self Hosting LLMs Faster/Better than Ollama