The agentic development workflow has shifted to closed-loop verification, with Claude Code natively integrating computer use while Hermes Agent establishes itself as the leading open agent OS abstraction. Amidst these infrastructure maturation milestones, market tension is visible through high-profile internal leaks of Anthropic Mythos and strategic cancellations at OpenAI, while local inference scales with llama.cpp crossing 100k stars and Flash-MoE enabling massive context on consumer silicon.
Theme 1. Agentic Workflows: Closed-Loop Verification & Tooling Composition
The core value driver in agent frameworks is transitioning from raw model capacity to the reliability of the execution harness. Anthropic's update to Claude Code and the emergence of interoperable tooling define the technical baseline for reliable devops.
Anthropic Releases 'Computer Use' for Claude Code
The feature allows Pro/Max users to open apps, click through UIs, and test builds directly from the CLI, enabling a closed-loop workflow:
code → run → inspect UI → fix → re-test.Early engineering feedback from @Yuchenj_UW and @omarsar0 highlights this as the "missing piece" for reliable app iteration compared to open-ended desktop agents.
The "What": Integrated UI manipulation capabilities previously exclusive to separate research previews.
Community Sentiment: High enthusiasm for the practical application, viewing the tool as critical for reducing hallucination in deployment testing.
Cross-Agent Composition & Interoperability
OpenAI shipped a Codex plugin for Claude Code, enabling the trigger of reviews, adversarial reviews, and "rescue" flows from within Anthropic's toolchain.
Implementation relies on a ChatGPT subscription rather than custom glue code, signaling a shift toward composable coding harnesses rather than monolithic products.
Technical Observation: Jobs started late-night (approx. 11pm) are 60% more likely to run >3 hours, fitting the pattern of delegating refactors and planning to background agents.
Community Critique: Theo noted a performance delta where Opus scores ~20% higher in Cursor than in Claude Code, reinforcing the thesis that tooling/prompt orchestration currently outweighs raw model capability gaps.
Hermes Agent Breakout & Ecosystem Growth
Nous released a major Hermes Agent update driving migrations from OpenClaw setups. Users cite better compaction, less bloat, and stronger adaptability as the drivers.
Multi-Agent Profiles: Introduced per-bot memory, skills, histories, and gateway connections, moving the abstraction from "personal assistant" to a reusable Agent OS.
Ecosystem Projects: @jayfarei’s
opentraces.aiprovides a CLI/schema for sanitizing agent traces to Hugging Face for evals/SFT. @kaiostephens uploaded ~4,000 GLM-5 Hermes traces to HF. @winglian’s ARC adds remote browser-based monitoring with E2E encryption.Open vs Proprietary: @ClementDelangue argued for open-source agent tools defaulting to open models for privacy and durability. Conversely, @fcholle pitched PokeeClaw (secure sandboxing/RBAC) and Z AI launched AutoClaw (local runtime, optional GLM-5-Turbo).
Theme 2. Model Frontiers: Multimodal Capabilities & The Local Inference Milestone
Simultaneously, frontier release cycles are focusing on specialized multimodal inputs (audio/video) and the democratization of massive models on consumer hardware.
Qwen3.5-Omni: Multimodal Specialization
Alibaba launched Qwen3.5-Omni, featuring native text/image/audio/video understanding and built-in function calling/web search.
Performance Metrics: Supports 10 hours of audio / 400s of 720p video. Claims 113 speech-recognition languages and 36 spoken languages.
Benchmark Claims: Alibaba claims outperformance vs Gemini 3.1 Pro in audio; match in AV understanding in specific settings.
Technical Caveat: @kimmonismus noted "Omni" refers to interpreting multimodal inputs, not arbitrary multimodal generation.
Use Case: "Audio-visual vibe coding" demo where the model builds websites/games from spoken visual instructions.
Local Inference: Massively Parameter Models on Mac
Flash-MoE on Apple Silicon: Claims circulating indicate Qwen3.5-397B could run on a 48GB MacBook Pro at 4.4 tok/s.
Implementation: Pure C + Metal engine streaming weights from SSD, loading only active experts (~5.5GB RAM usage during inference).
Related Tooling:
emll-flash-mlxoptimizing MoE paths; AIToolkit added Apple Silicon support.llama.cpp Milestone: @ggerganov celebrated crossing 100k GitHub stars, framing 2026 as the breakout year for local agentic workflows where portable runtimes matter more than scale.
Emerging Open Weights: GLM-5 & Qwen Variants
Z AI’s GLM-5-Turbo: Evaluated by Artificial Analysis with a score of 47 on the AA Intelligence Index. Notably scored 1503 on GDPval-AA (vs 1408 for GLM-5), positioning it for real-world agent workflows over benchmark maximalism.
Community Trend: A distilled Qwen3.5-27B (from Claude 4.6 Opus) is trending on HF, reportedly fitting on 16GB VRAM in 4-bit (via Unsloth).
Qwen 3.6 Specs: Community spotting "Qwen 3.6 Plus" with a reported 1,000,000 token context size and iterative learning capabilities.
Theme 3. Corporate Strategy, Reliability, and Trust Signals
Infrastructure reliability and corporate transparency remain contentious. Leaks regarding "Mythos" and usage limits on Claude vs OpenAI stability highlight the volatility of the current subscription model landscape.
Anthropic: 'Mythos' Leak & Capybara Tier
Leaked draft materials (CMS misconfiguration) describe 'Claude Mythos' as Anthropic's "most powerful AI model ever developed," part of a new 'Capybara' tier surpassing Opus.
Strategic Focus: Improved reasoning, coding, and cybersecurity tasks. Rollout is cautious, initially targeting orgs with cybersecurity defenses to mitigate misuse risk.
Sentiment: The leak generated significant discourse; some users joked about the "Capybara" naming convention compared to previous "Opus/Sonnet" elegance, while others questioned the security of the leak itself.
OpenAI: Strategic Cuts & Product Controversies
Project Cancellations: An Atlantic article and community discussion highlight shelving of Sora (allegedly costing $15M/day) and Stargate, plus delays in promised hardware.
Business Pivot: Analysis suggests a shift toward profitable enterprise solutions due to compute shortages, rather than consumer projects like video generation.
Adult Mode Pause: Development paused due to advisory board/employee concerns. The age verification system incorrectly identified minors as adults in 12% of cases.
Community Reaction: Mixed. Some view the cancellation as a necessary financial move; others view the pause on productivity tools and pivot to military/enterprise contracts as an ethical red flag.
Usage Limits & Consumer Trust (Anthropic vs OpenAI)
Claude Pro Session Limits: Anthropic quietly adjusted 5-hour session limits during peak hours (5am–11am PT). Users report exhausting limits faster than expected.
Usage Efficiency Complaints: Multiple users reported hitting 7-50% of quota after minimal tasks (e.g., editing Word docs). A user on the Max plan ($100/mo) hit limits in 3 hours of active use.
Refund Issues: Several subscribers canceled subscriptions citing the Pro plan's unpredictability for coding assistance and difficulty in securing refunds.
Transparency Critique: Users criticize the lack of clear communication regarding new limits, citing a degradation of value compared to Gemini expectations.
Theme 4. Systems Engineering: Training Optimization & Quantization Friction
Under the hood, technical friction points in quantization and training stability are becoming as significant as model capabilities, with open-source researchers defending academic integrity against rapid release cycles.
TurboQuant vs RaBitQ: Academic Integrity Dispute
Jianyang Gao (RaBitQ co-author) published a critique of TurboQuant, citing three failures: incomplete description (omitting Johnson-Lindenstrauss transform), unsupported theoretical claims, and misleading empirical comparisons.
Claimed Performance: TurboQuant claims were heavily promoted prior to ICLR 2026. Gao urged public clarification.
Empirical Verification: Community testing of
llama.cppTurboQuant showed asymmetricq8_0-K + turbo4-Vis nearly lossless (+0.0-0.2% perplexity increase) with 4.57x KV memory compression (handling 4000+ tokens on 8GB MB Air).Counter-Pain Point: Symmetric turboquantization on Qwen Q4_K_M caused performance failure (Perplexity 3,400+), recoverable only via asymmetric settings or KV rotation.
KV Cache Optimization & Rotation
Rotation Utility: A PR in
llama.cppfound existing q8 kv quants tanked on AIME25 (Score 31.7%), but rotation recovered it to 37.1%.Comparison: Q4_0 without rotation scored 0%, jumping to 21.7% with rotation.
Technical Impact: Rotation is now viewed as a crucial factor for maintaining performance in lower precision formats, mitigating degradation relative to F16 baselines.
Training Ops & Efficiency Fixes
Muon Optimization: Gram Newton-Schulz is a drop-in for Muon, working on symmetric XXᵀ Gram matrices. Reportedly 2x faster than rectangular matrix versions with identical validation perplexity.
PyTorch Bug: @ross_wightman flagged a
trunc_normal_misuse pattern where defaults were absolute values instead of standard deviations, effectively disabling truncation in many codebases.Cost Reduction Case Study: Shopify reported a reduction in DSPy costs from $5.5M/year to $73K/year by decomposing business logic, modeling intent, and switching to smaller optimized models.
New Evals & Benchmarks
World Reasoning Arena: Highlights a gap in hypothetical/world-model reasoning vs humans.
Tau Bench: New banking domain with 698-doc support where best models solve only ~25% of tasks.
Sycophancy: Stanford-led paper notes AI certainty increase reduces willingness to repair relationships, suggesting "helpfulness" metrics obscure harmful alignment shifts.
Theme 5. Agentic Research: Harnesses & Distributed Execution
Research into how agents execute tasks is maturing into formalized engineering fields, focusing on long-context filesystems and asynchronous delegation.
Natural-Language Harnesses
Tsinghua/Shenzhen Paper: Proposed letting LLMs execute orchestration logic from an SOP (Standard Operating Procedure) rather than hard-coded rules.
Meta-Harness: Optimizes the harness end-to-end over code, traces, and scores rather than the base model. Achieves #1 on TerminalBench-2 and improves text classification/transfer.
Async/Multi-Agent SWE Design
CMU CAID Paper: Argues for centralized asynchronous isolated delegation using manager agents, dependency graphs, and isolated git worktrees.
Performance Gains: Reported +26.7 absolute gain on PaperBench and +14.3 on Commit0 vs single-agent baselines.
Philosophy: Concurrency and isolation beat simply adding iterations to one agent.
Long-Context as Filesystems
Reframing: Papers highlight treating huge corpora as directory trees, allowing agents to navigate via shell commands/Python instead of stuffing text into windows or retrieval.
Metrics: 88.5% on BrowseComp-Plus (750M tokens) vs 80% previous best. Capable of operations up to 3T tokens.
You just read issue #35 of TLDR of AI news. You can also browse the full archives of this newsletter.