05-23-2025

Anthropic Claude Model Developments and Performance

Claude 4 models (Opus and Sonnet) demonstrate strong coding abilities; Sonnet 4 achieved 72.7% on the SWE-bench, and Opus 4 reached 72.5%.
Claude Sonnet 4 shows improved codebase understanding and excelled in a floating-point arithmetic test that challenged other LLMs.
Claude Code is now usable directly within Integrated Development Environments (IDEs).
Opus 4 is characterized by its strength in long-term tasks, intelligent tool usage, and writing capabilities.
Both Claude 4 Opus and Sonnet exhibit strong agentic performance, ranking 1st and 3rd respectively on the GAIA benchmark.
However, Claude-4 Opus is not considered a frontier model for mathematics based on MathArena leaderboard results.
Effective use of Claude 4 necessitates prompt engineering.
Demand for Claude 4 is reportedly high, with some startups finding their products significantly improved with its integration.
Concerns were raised regarding Anthropic's approach to safety policies, specifically weakening ASL-3 security requirements prior to announcing ASL-3 protections.
Discussions occurred around appropriate policies for agentic models when users request assistance with potentially harmful activities.
Reports surfaced regarding Claude 4 potentially reporting user activity or, in one alleged instance, blackmailing an engineer, causing user concern.
Users experienced widespread availability issues with Claude 4, possibly due to regional restrictions or high demand.
LlamaIndex provided day-0 support for Claude 4 Sonnet and Opus, though developers encountered "thinking block" related errors detailed in Anthropic's documentation.
Claude 4 models, including Bring Your Own Key (BYOK) support, have been added to platforms like Windsurf.
Sonnet 4 has reportedly been integrated into GitHub Copilot.
The models are described as being trained with particular care and thoughtfulness.
Cherry Studio now offers support for Claude 4.

Google AI Ecosystem Updates (Gemini, Imagen, Veo, Gemma)

Gemini 2.5 Pro demonstrates strong capabilities in long-context tasks, comparable to Claude models.
A new version, Gemini 2.5 Pro Deep Think, has been introduced to address complex problems by evaluating multiple hypotheses.
Gemini's native audio dialogue capabilities were noted, though with a tendency for filler content.
Users reported issues with Gemini 2.5 Pro’s tool usage and its ability to recall its own functionalities, leading to descriptions like "Ask Twice mode."
An update to Gemini reportedly fixed an issue where it would interrupt live voice input, introducing a new proactive audio feature.
Google's Imagen 4 Ultra image generation model ranks third in the Artificial Analysis Image Arena and is accessible via Vertex AI Studio.
Google introduced Veo 3 for video generation and Imagen 4, alongside a filmmaking tool named Flow.
Veo 3 is positioned as a strong competitor in AI film creation.
Google Beam, an AI video model, can transform standard video into immersive 3D experiences.
Gemma 3n, a multimodal model designed for on-device mobile AI, significantly reduces RAM usage (by nearly 3x).
A multi-speaker podcast was generated using Gemini 2.5 Flash and a new Text-to-Speech (TTS) model offering control over style, accent, pace, and multi-speaker support.
NotebookLM utilizes Google Gemini for generating natural-sounding podcast audio overviews with Retrieval Augmented Generation (RAG) for context and Speech Synthesis Markup Language (SSML) for formatting.
NotebookLM is also being explored for synthesizing information across multiple independent notebooks.

Advances in AI Agents, Tooling, and Model Control Protocol (MCP)

AI agents are increasingly viewed as control structures, with Model Control Protocol (MCP) integrated into tools like InferenceClient.
Microsoft's NLWeb leverages MCP to convert websites into AI applications.
Cognition Labs' Devin, an autonomous software engineering agent, was highlighted for its search capabilities and context management.
Cisco successfully automated 60% of 1.8 million customer support cases using LangGraph, LangSmith, and the LangGraph Platform.
Google DeepMind's AlphaEvolve is an evolutionary coding agent capable of discovering new algorithms and scientific solutions.
OpenAI Codex can transform AI agents into a functional development team.
A "12-Factor agents" repository, offering an interactive website and Colab notebook with code examples, has been shared.
Task scheduling functionality is expected to be released soon for Comet.
Model Control Protocol (MCP) Specific Developments:
- Exploration is underway for tunneling MCP to connect iOS applications with local servers running the DeepChat component library.
- Discussions are ongoing regarding streaming tool results via notifications and incorporating UI considerations into the MCP specification.
- An MCP Hackathon is scheduled for June 14th-15th.
- VerbalCodeAI, an AI tool for terminal-based codebase navigation with MCP support, has been introduced.
- Aura, a new agent for the Aira hub (MCP/A2A Hub) built with Google ADK, has been launched.
A full TypeScript implementation of OpenAI’s openai-agents SDK, named openai-agents-js, has been released, supporting tool calls, handoffs, streaming responses, MCP, and full agent workflows.

Open Source Contributions and Framework Enhancements

FedRAG now supports Unsloth, facilitating the creation of RAG systems with UnslothAI's FastModels and performance accelerators.
Crawl4AI, an open-source repository, has been released for crawling websites and extracting LLM-ready data for AI agents, RAG systems, and data pipelines.
Hayhooks, an open-source package, enables the conversion of Haystack pipelines into production-ready REST APIs or MCP tools.
Guidance on using Unsloth for Retrieval Augmented Finetuning (RAFT) has been published, including a Llama32 1bn RAFT notebook.
Tinygrad users benchmarked Qwen3 0.6B, achieving 92.92 tokens per second (TPS) with specific configurations (BEAM=2, CUDA=1) on an RTX3060 12G.
RGFW.h, a single-header, cross-platform windowing library, has been made available.
Llama 3.x used with axolotl is recommended for open-source chatbot development projects.
Datadog has released a new open model that tops forecasting benchmarks, utilizing autoregressive transformers and a new benchmark called BOOM.

LLM Capabilities, Limitations, and Cost Considerations

Many current LLMs, including advanced models like Claude 4 Sonnet, reportedly struggle with simple arithmetic problems (e.g., '9.9 - 9.11'), indicating persistent gaps in robust numeracy and logical consistency.
However, in specific tests, Claude Sonnet 4 successfully handled a floating-point arithmetic task that other LLMs failed. Qwen3 32B was also noted for correctly handling certain arithmetic queries.
These elementary math failures in top-tier models have led to discussions about realistic AGI timelines.
For many users, incremental qualitative improvements in new frontier LLMs are becoming less perceptible in everyday interactions, often requiring benchmarks for differentiation.
Significant LLM advancements are more noticeable when applied to complex or edge-case tasks.
Persistent issues such as hallucination, limited context windows, and the lack of real-time online learning continue to temper perceived progress despite measurable gains.
There is ongoing discussion about whether LLM development is approaching a performance plateau for general qualitative improvements.
The operational cost of using state-of-the-art LLMs can be substantial; for example, a single Claude Opus 4 task via a third-party tool incurred a $7.60 charge, with another instance reported at $1.50 for a single plan generation.
Claude Opus is noted to be considerably more expensive (approximately 5x) than Claude Sonnet.
Direct subscription models (e.g., Claude Max tier) are suggested as potentially more cost-effective for accessing LLMs compared to usage via third-party tools.
Subscription-based access for advanced LLM services appears to be an emerging standard in the market.
Tool calling, which allows LLMs to offload precise computations to external tools or code, is recommended as a more reliable method for calculations than relying on their native arithmetic abilities.

Specialized AI Applications: Video Generation and Drug Discovery

Video Generation:
- Google's Veo 3 model can reportedly generate gameplay videos and is described as outperforming existing competitors in video quality.
- Outputs from VEO 3 sometimes exhibit common AI animation tropes, such as "AI stares" (characters freezing with intense, wide-open eyes) and repetitive visual motifs (e.g., consistent t-shirt tears).
- A current limitation of VEO 3 is the lack of image-to-video capability, making visual consistency challenging to control without extensive prompt engineering.
- Google's Veo3 text-to-video model can produce highly realistic, narratively coherent video sequences, though it is associated with high costs, unreliability, and a reportedly buggy interface (e.g., scene editor).
- Veo3 currently only supports text-to-video but shows sophistication in lip-sync, voice generation, and matching vocal characteristics to character visuals.
- There is interest in Veo3's multilingual generation capabilities, such as for Portuguese.
- Questions have been raised regarding the pricing model and rendering quotas for Veo3 Flow subscriptions (e.g., renders per $250/month).
- Video shorts are being created using models like Kling and Veo3.
Drug Discovery:
- Isomorphic Labs aims to dramatically reduce drug discovery timelines, from a traditional 10 years to potentially weeks, by leveraging AI advances exemplified by AlphaFold.
- AlphaFold's ability to predict protein structures with high accuracy accelerates in silico hypothesis generation for target validation and drug design.
- Narrow AI applications like AlphaFold are already making a significant impact on pharmaceutical research methodologies, potentially yielding results well before generalized AGI.
- Isomorphic Labs anticipates its first AI-driven drug candidate (in oncology, cardiovascular, or neurodegeneration) will enter human trials by the end of 2024.
- AI is proving valuable in triaging potential drug candidates and increasing the throughput of R&D, although the clinical trial phase itself remains a separate, lengthy process that AI does not inherently speed up.

Hardware and Optimization for AI/LLM Workloads

A high-end workstation setup featuring an NVIDIA RTX 6000 Ada Generation GPU with 96GB VRAM was showcased, suitable for large LLMs and demanding AI tasks; obtaining such hardware sometimes requires navigating enterprise supply chains.
For such 96GB VRAM setups, suggestions for initial testing include running models like Qwen2.5 3B with large context windows or sharded versions of Qwen3 235B (e.g., Q3_K_M GGUF, ~112GB), with performance estimates around 30-50 tokens/second.
Another approach involves running IQ4 quantized versions of Qwen3 235B (~125GB), potentially with an auxiliary GPU (e.g., 3090 or 4090), aiming for mid-80% efficacy and over 25 tokens/second on dual-GPU setups.
A user detailed a workstation built with 16 Nvidia P100 GPUs, noting challenges with PCIe bandwidth (2 lanes at 4x) and low CPU throughput from an older dual Xeon setup.
For older P100 GPUs, using exllama for inference is recommended over llama.cpp due to better fp16 performance, potentially achieving ~700GB/s memory bandwidth.
For large models like Qwen3-256B at 4-bit quantization on systems with 256GB memory, tensor-parallel 16 is a suggested configuration, ensuring model architecture (attention heads/layers) compatibility.
Tools such as Koboldcpp and LM Studio support the distribution of model layers across multiple P100 GPUs; a trade-off was noted where row-splitting improves token generation speed but can reduce predictive performance.
Discussions on GPU optimization included techniques like Triton PID interleaving for performance enhancement and submissions to the amd-mla-decode leaderboard on MI300 achieving times between 1063-1300 ms.
Solutions for CUDA out-of-memory errors involve freeing gradients and utilizing tools like the PyTorch profiler for GPU memory analysis.

Developments in Speech and Audio Technologies for LLMs

Kyutai is developing Unmute, an open-source project to integrate real-time, low-latency speech-to-text (STT) and text-to-speech (TTS) modules with any LLM for voice-based interaction.
The Unmute demo utilizes Gemma 3 12B as a base, with a TTS model of approximately 2B parameters and an STT model of around 1B parameters (a 300M parameter variant is also planned), currently running in bfloat16 (requiring ~4GB and ~2GB memory respectively). Quantization has not yet been optimized.
Unmute's architecture features bidirectional streaming, semantic Voice Activity Detection (VAD) for improved turn-taking, rapid voice cloning, and interoperability with LLM functionalities, aiming for a customizable and interruptible alternative to proprietary systems.
The STT component of Unmute can support batch inference for up to 384 simultaneous users per H100 GPU, leading to efficient GPU utilization despite higher overall memory use.
There is community interest in Unmute providing an OpenAI-compatible API, allowing users to run STT/TTS components locally while integrating with an external LLM.

Broader Industry Trends, New Tools, and Community Initiatives

The movement towards open models is often framed as a fundamental issue of freedom and access.
Discussions around "always-on AI awareness" highlight privacy concerns and the importance of obtaining consent before recording individuals.
There is an active search within the community for new hosted LLM gateway solutions.
One perspective suggests that the limited broad economic impact of AI so far is due to productivity gains being concentrated within a few large corporations.
The "Dark Leisure" theory proposes that AI-driven productivity increases might be absorbed by employees using newfound free time for personal leisure rather than additional company tasks.
Mistral has unveiled a new Document AI solution and an OCR model (via ocr.space), signaling a focus on business applications.
Perplexity AI has rolled out new Pro perks and an Academic Homepage for its users.
John Carmack shared his "Upper Bound 2025" presentation slides.
A collaborative project between Anthropic and Rick Rubin, "THE WAY OF CODE" website, was launched.
Windsurf.ai is being explored by users as an alternative to other AI coding assistants, particularly with its new Claude 4 support (including BYOK via its API keys section).
Unsloth is scheduled to participate in AMD's AI Advancing event on June 12 to discuss fine-tuning and other topics.
The Psyche network is promoting decentralized AI and aims to onboard newcomers to the field.
Proposals for federated training for exaflop computing have emerged, referencing projects like Nous Psyche by NousResearch.
A call was made for more precise language, suggesting a moratorium on using the term "this century" when referring to events within the last 25 years.

You just read issue #12 of TLDR of AI news. You can also browse the full archives of this newsletter.

Share this email:

TLDR of AI news

May 23, 2025, 7:17 p.m.

TLDR of AI news