Local LLMs Leave the Lab
General
People Are Switching OpenClaw to GLM-5.1 for Everyday Use A lot of OpenClaw users are moving away from expensive default cloud setups and using GLM-5.1 instead for day-to-day stuff. The vibe is basically: save the pricier models for the hard prompts, and use something cheaper or local for everything else. At the same time, plenty of people are still running OpenClaw fully on their own machines with Ollama or LM Studio, especially with Qwen-based models on consumer RTX GPUs, so they can code without token costs or sending data anywhere. Link: https://www.reddit.com/r/better_claw/comments/1skc2us/everyone_is_switching_to_glm51_after_the/?share_id=vKOzhTyuU9r8qM-68Z9gD&utm_content=1&utm_medium=android_app&utm_name=androidcss&utm_source=share&utm_term=1
Agent Computers: DGX Spark, NemoClaw, and AI PCs
NVIDIA positions DGX Spark and RTX AI PCs as "agent computers" with up to 128 GB unified memory for running OpenClaw and large models fully locally. NemoClaw bundles a hardened OpenClaw stack that installs in one command, now marketed by OEMs like ASUS as "NemoClaw-ready" AI PCs.
Link:
https://blogs.nvidia.com/blog/rtx-ai-garage-gtc-2026-nemoclaw/
Google TurboQuant Slashes KV Cache Memory
TurboQuant is a training-free compression method that quantizes the LLM KV cache to 3–4 bits, cutting memory usage by 4–6x with nearly unchanged accuracy. Benchmarks show up to 8x faster attention computation on H100s, making long-context local inference far more feasible.
Link: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
New Models
Zhipu GLM 5.1: 744B MoE, coding first GLM 5.1 is a post training upgrade of GLM 5 with 744 billion parameters in a Mixture of Experts setup and about 40 billion active per token, tuned specifically for code and software engineering. It reaches the top spot among open models on SWE Bench Pro at release and keeps an MIT style open weight license, making it attractive for serious local coding and agent workflow Link: https://github.com/zai-org/GLM-5
Google Gemma 4: small, open, agent ready
Gemma 4 is Google’s most capable open model family so far, released under Apache 2.0 and targeted at advanced reasoning and agentic workflows rather than simple chat. It comes in Effective 2B and 4B variants for edge and mobile plus 26B Mixture of Experts and 31B dense models that reach top positions on the Arena AI leaderboard while remaining practical for local deployment when quantized.
Link: https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/
NVIDIA Nemotron 3 Super: Open Latent MoE for Agents
A 120B-parameter hybrid of Mamba, Transformer, and MoE with only 12B active parameters per token, targeting coding and multi-step agentic reasoning. It offers a 1M-token context, open weights, and leads agentic benchmarks like PinchBench deployable on DGX Spark and high-end RTX workstations.
Link: https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/
Flash-MoE: "397B on a Laptop"
Flash-MoE streams expert weights from SSD instead of loading all parameters into RAM, enabling Qwen3.5-397B on a MacBook Pro with only ~5.5 GB of active memory. Full tool-calling works, showing how MoE and streaming are redefining what "local" means.
Link:
https://github.com/osayamenja/FlashMoE
Additional Links
Local LLMs + OpenClaw in Practice
- Video Guide: OpenClaw on Qwen 2.5 Coder 7B via Ollama, full local coding workflow.
https://www.youtube.com/watch?v=82TrKuAl7Ic - Blog Post: Qwen 3.5 35B via llama.cpp in OpenClaw, fully local and API-free.
https://sonusahani.com/blogs/qwen-35b-openclaw - GitHub Repo: Qwen3.5-35B-A3B on DGX Spark with OpenClaw, scripts included.
https://github.com/ZengboJamesWang/Qwen3.5-35B-A3B-openclaw-dgx-spark
NemoClaw, DGX Spark, and AI PCs
- NVIDIA Blog: RTX AI PCs, DGX Spark, and NemoClaw overview.
https://blogs.nvidia.com/blog/rtx-ai-garage-gtc-2026-nemoclaw/ - DGX Spark Product Page: 128 GB unified memory, large local model support.
https://www.nvidia.com/en-us/products/workstations/dgx-spark/ - Hardwareinside (DE): OpenClaw running fully local on RTX PCs and DGX Spark.
https://www.hardwareinside.de/openclaw-laeuft-vollstaendig-lokal-auf-nvidia-rtx-pcs-und-dgx-spark-101203/
Nemotron, Flash-MoE, and Streaming
- Tweet (Shubham Saboo): Nemotron and Flash-MoE in context of running massive models locally.
https://x.com/Saboo_Shubham_/status/2035760668641742953 - HuggingFace Model Card: Nemotron 3 Super architecture, variants, and hardware requirements.
https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
TurboQuant and KV Cache Compression
- Golem (DE): TurboQuant and its impact on LLM RAM requirements.
https://www.golem.de/news/turboquant-googles-kompression-soll-ram-bedarf-von-llms-extrem-senken-2603-206927.html - Ars Technica: 6x memory reduction, up to 8x speedup, no quality loss.
https://arstechnica.com/ai/2026/03/google-says-new-turboquant-compression-can-lower-ai-memory-usage-without-sacrificing-quality/ - Dev.to: TurboQuant explained for developers, with practical deployment notes.
https://dev.to/arshtechpro/turboquant-what-developers-need-to-know-about-googles-kv-cache-compression-eeg