Yeah, that’s problematic, heh.
To be fair, I do wish more privacy-friendly browsers took DDG mobile’s approach, namely torch all sites but make it really easy to (and prompt you to) whitelist frequently used ones.
Yeah, that’s problematic, heh.
To be fair, I do wish more privacy-friendly browsers took DDG mobile’s approach, namely torch all sites but make it really easy to (and prompt you to) whitelist frequently used ones.
You can permanently disable the chatbot in full DDG search. Click the little gear.
It does make me wonder what API they use. I thought it was huggingface (which would be less bad), but they don’t say it explicitly.
Yeah. But it also messes stuff up from the llama.cpp baseline, and hides or doesn’t support some features/optimizations, and definitely doesn’t support the more efficient iq_k quants of ik_llama.cpp and its specialzied MoE offloading.
And that’s not even getting into the various controversies around ollama (like broken GGUFs or indications they’re going closed source in some form).
…It just depends on how much performance you want to squeeze out, and how much time you want to spend on the endeavor. Small LLMs are kinda marginal though, so IMO its important if you really want to try; otherwise one is probably better off spending a few bucks on an API that doesn’t log requests.
In case I miss your reply, assuming a 3080 + 64 GB of RAM, you want the IQ4_KSS (or IQ3_KS, for more RAM for tabs and stuff) version of this:
https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF
Part of it will run on your GPU, part will live in system RAM, but ik_llama.cpp does the quantizations split and GPU offloading in a particularly efficient way for these kind of ‘MoE’ models. Follow the instructions on that page.
If you ‘only’ have 32GB RAM or less, that’s tricker, and the next question is what kind of speeds do you want. But it’s probably best to wait a few days and see how Qwen3 80B looks when it comes out. Or just go with the IQ4_K version of this: https://huggingface.co/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF
And you don’t strickly need the hyper optimization of ik_llama.cpp for a small model like Qwen3 30B. Something easier like lm studio or the llama.cpp docker image would be fine.
Alternatively, you could try to squeeze Gemma 27B into that 11GB VRAM, but it would be tight.
How much system RAM, and what kind? DDR5?
ik doesn’t have great documentation, so it’d be a lot easier for me to just point you places, heh.
At risk of getting more technical, ik_llama.cpp has a good built in webui:
https://github.com/ikawrakow/ik_llama.cpp/
Getting more technical, its also way better than ollama. You can run models way smarter than ollama can on the same hardware.
For reference, I’m running GLM-4 (667 GB of raw weights) on a single RTX 3090/Ryzen gaming rig, at reading speed, with pretty low quantization distortion.
And if you want a ‘look this up on the internet for me’ assistant (which you need for them to be truly useful), you need another docker project as well.
…That’s just how LLM self hosting is now. It’s simply too hardware intense and ad hoc to be easy and smart and cheap. You can indeed host a small ‘default’ LLM without much tinkering, but its going to be pretty dumb, and pretty slow on ollama defaults.
Mobile 5090 would be an underclocked, binned desktop 5080, AFAIK:
https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_50_series
In KCD2 (a fantastic CryEngine game, a great benchmark IMO), at QHD, the APU is a hair less half as fast. For instance, 39 FPS at QHD vs 84 FPS for the mobile 5090:
https://www.notebookcheck.net/Nvidia-GeForce-RTX-5090-Laptop-Benchmarks-and-Specs.934947.0.html
https://www.notebookcheck.net/AMD-Radeon-8060S-Benchmarks-and-Specs.942049.0.html
Synthetic benchmarks between the two
But these are both presumably running at high TDP (150W for the 5090). Also, the mobile 5090 is catastrophically overpriced and inevitably tied to a weaker CPU, whereas the APU is a monster of a CPU. So make of that what you will.
Oh wow, that’s awesome! I didn’t know folks ran TDP tests like this, just that my old 3090 seems to have a minimum sweet spot around that same same ~200W based on my own testing, but I figured the 4000 or 5000 series might go lower. Apparently not, at least for the big die.
I also figured the 395 would draw more than 55W! That’s also awesome! I suspect newer, smaller GPUs like the 9000 or 5000 series still make the value proposition questionable, but still you make an excellent point.
And for reference, I just checked, and my dGPU hovers around 30W idle with no display connected.
Eh, but you’d be way better off with an X3D CPU in that scenario, which is both significantly faster in games, about as fast outside them (unless you’re dram bandwidth limited) and more power efficient (because they clock relatively low).
You’re right about the 395 being a fine HTPC machine by itself.
But I’m also saying even an older 7900, 4090 or whatever would be way lower power at the same performance as the 395’s IGP, and whisper quiet in comparison. Even if cost is no object. And if that’s the case, why keep a big IGP at all? It just doesn’t make sense to pair them without some weirdly specific use case that can use both at once, or that a discrete GPU literally can’t do because it doesn’t have enough VRAM like the 395 does.
Eh, actually that’s not what I had in mind:
Discrete desktop graphics idle hot. I think my 3090 uses at least 40W doing literally nothing.
It’s always better to run big dies slower than small dies at high clockspeeds. In other words, if you underclocked a big desktop GPU to 1/2 its peak clockspeed, it would use less than a fourth of the energy and run basically inaudible… and still be faster than the iGPU. So why keep a big iGPU around?
My use case was multitasking and compute stuff. EG game/use the discrete GPU while your IGP churns away running something. Or combine them in some workloads.
Even the 395 by itself doesn’t make a ton of sense for an HTPC because AMD slaps so much CPU on it. It’s way too expensive and makes it power thirsty. A single CCD (8 cores instead of 16) + the full integrated GPU would be perfect and lower power, but AMD inexplicably does not offer that.
Also, I’ll add that my 3090 is basically inaudible next to a TV… key is to cap its clocks, and the fans barely even spin up.
Rumor is it’s successor is 384 bit, and after that their designs are even more modular:
https://www.techpowerup.com/340372/amds-next-gen-udna-four-die-sizes-one-potential-96-cu-flagship
Hybrid inference prompt processing actually is pretty sensitive to PCIe bandwidth, unfortunately, but again I don’t think many people intend on hanging an AMD GPU off these Strix Halo boards, lol.
It’s PCIe 4.0 :(
but these laptop chips are pretty constrained lanes wise
Indeed. I read Strix Halo only has 16 4.0 PCIe lanes in addition to its USB4, which is resonable given this isn’t supposed to be paired with discrete graphics. But I’d happily trade an NVMe slot (still leaving one) for x8.
One of the links to a CCD could theoretically be wired to a GPU, right? Kinda like how EPYC can switch its IO between infinity fabric for 2P servers, and extra PCIe in 1P configurations. But I doubt we’ll ever see such a product.
Nah, unfortunately it is only PCIe 4.0 4x. That’s a bit slim for a dGPU, especially in the future :(
If you can swing $2K, get one of the new mini PCs with an AMD 395 and 64GB+ RAM (ideally 128GB).
They’re tiny, lower power, and the absolute best way to run the new MoEs like Qwen3 or GLM Air for coding. TBH they would blow a 5060 TI out of the water, as having a ~100GB VRAM pool is a total game changer.
I would kill for one on an ITX mobo with an x8 slot.
I’m a massive fan of CachyOS, personally! Installed it years ago, kept the same image since then and haven’t even considered switching.
Different philosphies, I suppose. I suspect Bazzite may work better if you want stuff to just work, while Cachy is more tweaking focused and gets quite rapid updates, though is still quite set up out-of-the-box.
Yeah. Distros are basically just preconfigured sets of Linux, with the communities focusing on what they are interested in.
For gaming? You need a distro that does stuff for you!
To elaborate, if you’re using wine bottles, you’ve gone waaay into the land of manual from-scratch configuration, when you should just use stuff from a community that spends thousands of man hours figuring it out and packaging it.
Try CachyOS or Bazzite! They have a bunch of packages like advanced versions of preconfigured Proton one install away.
For docker… yeah, it’s a crazy learning curve if you just want to try one small thing. It’s honestly annoying to go through all the setup and download like 100 gigabytes of files just to run a python script or whatever.
You can often set up the environment yourself without docker, though.
And to reiterate, I’m very much against the ethos of “you should learn how to do everything yourself!” I get the sentiment, but honestly, this results in suboptimal configurations for most people vs simply using the packages others have spent thousands of hours refining.
Yep.
FYI, rumors suggest the AI Max/Strix Halo successor won’t be coming out till H2 2027, aka nearly 2028 (as Strix Halo techically launched in January this year, but as you can see takes time to actually make it into laptops):
Anyway, what I’m saying is it won’t go obsolete anytime soon, and it will be quite strong for many years to come if you get one.
There is a 14" HP laptop with the same chip:
https://www.ultrabookreview.com/70442-amd-strix-halo-laptops/
And a handheld, heh: https://gpdstore.net/gpd-handheld-gaming-pcs/gpd-win-5/
There may be more.
TBH, it may be prudent to wait a month or two for more “AI Max” chips to show up in laptops. It’s pretty new; Asus is just super early with it like they usually are.
+1 for immutable in general.