Guide to Self Hosting LLMs Faster/Better than Ollama

brucethemoose@lemmy.world · edit-2 3 days ago

One can’t offload “usable” LLMs without tons of memory bandwidth and plenty of RAM. It’s just not physically possible.

You can run small models like Phi pretty quick, but I don’t think people will be satisfied with that for copilot, even as basic autocomplete.

About 2x faster than Intel’s current IGPs is the threshold where the offloading can happen, IMO. And that’s exactly what AMD/Apple are producing.

brucethemoose@lemmy.world · edit-2 3 days ago

The localllama crowd is supremely unimpressed with Intel, not just because of software issues but because they just don’t have beefy enough designs, like Apple does, and AMD will soon enough. Even the latest chips are simply not fast enough for a “smart” model, and the A770 doesn’t have enough VRAM to be worth the trouble.

They made some good contributions to runtimes, but seeing how they fired a bunch of engineers, I’m not sure that will continue.

brucethemoose@lemmy.world · edit-2 3 days ago

I wouldn’t call that “large.”

Strix Halo (256 bit LPDDR5X, 40 AMD CUs) is where I’d start calling integrated graphics “large.” Intel is going to remain a laughing stock in the gaming world without bigger designs than their little 128-bit IGPs.

brucethemoose@lemmy.world · edit-2 3 days ago

If they wanna abandon discrete GPUs… OK.

But they need graphics. They should make M Pro/Max-ish integrated GPUs like AMD is already planning on doing, with wide busses, instead of topping out at bottom-end configs.

They could turn around and sell them as GPU-accelerated servers too, like the market is begging for right now.

brucethemoose@lemmy.world · edit-2 6 days ago

I’m not sure how you’d solve the problem of big corpos becoming cheap content farms while avoiding harming the people who use these tools to make something rich and beautiful, but I have to believe there’s a way to thread that needle.

Easy, local AI.

Keep generative AI locally runnable instead of corporate hosted. Make it free, open and accessible. This gives the little guys the cost advantage, and takes away the scaling advantages of mega publishers. Lemmy users should be familiar with this concept.

Whenever I hear people rail against AI, I tell them they are handing the world to Sam Altman and his dystopia, who do not care about stealing content, equality, or them. I get a lot of hate for it. But they need to be fighting the corporate vs open AI battle instead.

brucethemoose@lemmy.world · edit-2 8 days ago

Sounds like it’d be nice if you had real control over the car’s software, and you could roll it back.

This… also makes me a little more weary driving around Teslas in traffic.

brucethemoose@lemmy.world · edit-2 11 days ago

The localllama people are feeling quite mixed about this, as they’re still charging through the nose for more RAM. Like, orders of magnitude more than the bigger ICs actually cost.

It’s kinda poetic. Apple wants to go all in on self-hosted AI now, yet their incredible RAM stinginess over the years is derailing that.

brucethemoose@lemmy.world · 11 days ago

I remember when SBF news was peaking right around the time Stable Diffusion 1.5 came out, and thinking of how fundamentally gutted the entire premise of an NFT was in like a month.

brucethemoose@lemmy.world · edit-2 11 days ago

Evangelists of the stuff will tell you that you can own your own digital corner of the information highway (Second Life came out in 2003, and most MMOs have housing), or that you can trade rare items with your fellow players (TF2 and Counter-Strike have been doing this forever). Then there’s this idea that you “own the item” in question more than you would otherwise (you don’t, you own a certificate that’s associated with it, and the item will vanish if the infrastructure does). Then there’s the whole “you could use a sword from one game in another game!” nonsense, which I think we can all agree was cooked up by people who don’t understand how game design works on even a fundamental level.

This is so on point for the web3 space, and parts of the AI space too.

Evangelists waltz in and berate you for not understanding how gloriously awesome their system is… without even making a cursory effort to check if it already exists, much less accumulate a deep understanding and appreciation like they expect you to do.

brucethemoose@lemmy.world · edit-2 11 days ago

There is a breaking point, eventually. YouTube’s trajectory is gonna make next quarter’s revenue great, but eventually something else will pick up user’s attention instead.

brucethemoose@lemmy.world · 11 days ago

I don’t even look at the algo anymore, I just go out and search for content externally.

brucethemoose@lemmy.world · edit-2 11 days ago

Maybe I am just out of touch, but I smell another bubble bursting when I look at how enshittified all major web services are simultaneously becoming.

It feels like something has to give, right?

We have YouTube, Reddit, Twitter, and more just racing to enshittify like I can’t even believe, Google Search is racing to destroy the internet, yet they’re also at the ‘critical mass’ of ‘too big to fail’ and shoved out all their major competitors already (other than Discord I guess).

brucethemoose@lemmy.world · 12 days ago

There are already open source/self hosted alternatives, like Perplexica.

brucethemoose@lemmy.world · 12 days ago

brucethemoose@lemmy.world · 12 days ago

CEO Tony Stubblebine says it “doesn’t matter” as long as ~~nobody reads it.~~ they keep generating sign-ups and selling ads… till next quarter, at least.

brucethemoose@lemmy.world · edit-2 12 days ago

Soldered is better! It’s sometimes faster, definitely faster if it happens to be lpddr.

But TBH the only thing that really matters his “how much VRAM do you have,” and Qwen 32B slots in at 24GB, or maybe 16GB if the GPU is totally empty and you tune your quantization carefully. And the cheapest way to that (until 2025) is a used MI60, P40 or 3090.

brucethemoose@lemmy.world · 12 days ago

TSMC doesn’t really have official opinions, they take silicon orders for money and shrug happily. Being neutral is good for business.

Altman’s scheme is just a whole other level of crazy though.

brucethemoose@lemmy.world · edit-2 12 days ago

It’s useful.

I keep Qwen 32B loaded on my desktop pretty much whenever its on, as an (unreliable) assistant to analyze or parse big texts, to do quick chores or write scripts, to bounce ideas off of or even as a offline replacement for google translate (though I specifically use aya 32B for that).

It does “feel” different when the LLM is local, as you can manipulate the prompt syntax so easily, hammer it with multiple requests that come back really fast when it seems to get something wrong, not worry about refusals or data leakage and such.

brucethemoose@lemmy.world · 12 days ago

the model seems ok for tasks like summarisation though

That and retrieval and the business use cases so far, but even then only if the results can be wrong somewhat frequently.

brucethemoose@lemmy.world · edit-2 12 days ago

the term AI will become every bit as radioactive to investors in the future as it is lucrative right now.

Well you say that, but somehow crypto is still around despite most schemes being (IMO) a much more explicit scam. We have politicans supporting it.

brucethemoose@lemmy.world · edit-2 27 days ago

Guide to Self Hosting LLMs Faster/Better than Ollama