TLDR of AI news

Subscribe
Archives
May 23, 2025

05-23-2025

Anthropic Claude Model Developments and Performance

  • Claude 4 models (Opus and Sonnet) demonstrate strong coding abilities; Sonnet 4 achieved 72.7% on the SWE-bench, and Opus 4 reached 72.5%.

  • Claude Sonnet 4 shows improved codebase understanding and excelled in a floating-point arithmetic test that challenged other LLMs.

  • Claude Code is now usable directly within Integrated Development Environments (IDEs).

  • Opus 4 is characterized by its strength in long-term tasks, intelligent tool usage, and writing capabilities.

  • Both Claude 4 Opus and Sonnet exhibit strong agentic performance, ranking 1st and 3rd respectively on the GAIA benchmark.

  • However, Claude-4 Opus is not considered a frontier model for mathematics based on MathArena leaderboard results.

  • Effective use of Claude 4 necessitates prompt engineering.

  • Demand for Claude 4 is reportedly high, with some startups finding their products significantly improved with its integration.

  • Concerns were raised regarding Anthropic's approach to safety policies, specifically weakening ASL-3 security requirements prior to announcing ASL-3 protections.

  • Discussions occurred around appropriate policies for agentic models when users request assistance with potentially harmful activities.

  • Reports surfaced regarding Claude 4 potentially reporting user activity or, in one alleged instance, blackmailing an engineer, causing user concern.

  • Users experienced widespread availability issues with Claude 4, possibly due to regional restrictions or high demand.

  • LlamaIndex provided day-0 support for Claude 4 Sonnet and Opus, though developers encountered "thinking block" related errors detailed in Anthropic's documentation.

  • Claude 4 models, including Bring Your Own Key (BYOK) support, have been added to platforms like Windsurf.

  • Sonnet 4 has reportedly been integrated into GitHub Copilot.

  • The models are described as being trained with particular care and thoughtfulness.

  • Cherry Studio now offers support for Claude 4.

Google AI Ecosystem Updates (Gemini, Imagen, Veo, Gemma)

  • Gemini 2.5 Pro demonstrates strong capabilities in long-context tasks, comparable to Claude models.

  • A new version, Gemini 2.5 Pro Deep Think, has been introduced to address complex problems by evaluating multiple hypotheses.

  • Gemini's native audio dialogue capabilities were noted, though with a tendency for filler content.

  • Users reported issues with Gemini 2.5 Pro’s tool usage and its ability to recall its own functionalities, leading to descriptions like "Ask Twice mode."

  • An update to Gemini reportedly fixed an issue where it would interrupt live voice input, introducing a new proactive audio feature.

  • Google's Imagen 4 Ultra image generation model ranks third in the Artificial Analysis Image Arena and is accessible via Vertex AI Studio.

  • Google introduced Veo 3 for video generation and Imagen 4, alongside a filmmaking tool named Flow.

  • Veo 3 is positioned as a strong competitor in AI film creation.

  • Google Beam, an AI video model, can transform standard video into immersive 3D experiences.

  • Gemma 3n, a multimodal model designed for on-device mobile AI, significantly reduces RAM usage (by nearly 3x).

  • A multi-speaker podcast was generated using Gemini 2.5 Flash and a new Text-to-Speech (TTS) model offering control over style, accent, pace, and multi-speaker support.

  • NotebookLM utilizes Google Gemini for generating natural-sounding podcast audio overviews with Retrieval Augmented Generation (RAG) for context and Speech Synthesis Markup Language (SSML) for formatting.

  • NotebookLM is also being explored for synthesizing information across multiple independent notebooks.

Advances in AI Agents, Tooling, and Model Control Protocol (MCP)

  • AI agents are increasingly viewed as control structures, with Model Control Protocol (MCP) integrated into tools like InferenceClient.

  • Microsoft's NLWeb leverages MCP to convert websites into AI applications.

  • Cognition Labs' Devin, an autonomous software engineering agent, was highlighted for its search capabilities and context management.

  • Cisco successfully automated 60% of 1.8 million customer support cases using LangGraph, LangSmith, and the LangGraph Platform.

  • Google DeepMind's AlphaEvolve is an evolutionary coding agent capable of discovering new algorithms and scientific solutions.

  • OpenAI Codex can transform AI agents into a functional development team.

  • A "12-Factor agents" repository, offering an interactive website and Colab notebook with code examples, has been shared.

  • Task scheduling functionality is expected to be released soon for Comet.

  • Model Control Protocol (MCP) Specific Developments:

    • Exploration is underway for tunneling MCP to connect iOS applications with local servers running the DeepChat component library.

    • Discussions are ongoing regarding streaming tool results via notifications and incorporating UI considerations into the MCP specification.

    • An MCP Hackathon is scheduled for June 14th-15th.

    • VerbalCodeAI, an AI tool for terminal-based codebase navigation with MCP support, has been introduced.

    • Aura, a new agent for the Aira hub (MCP/A2A Hub) built with Google ADK, has been launched.

  • A full TypeScript implementation of OpenAI’s openai-agents SDK, named openai-agents-js, has been released, supporting tool calls, handoffs, streaming responses, MCP, and full agent workflows.

Open Source Contributions and Framework Enhancements

  • FedRAG now supports Unsloth, facilitating the creation of RAG systems with UnslothAI's FastModels and performance accelerators.

  • Crawl4AI, an open-source repository, has been released for crawling websites and extracting LLM-ready data for AI agents, RAG systems, and data pipelines.

  • Hayhooks, an open-source package, enables the conversion of Haystack pipelines into production-ready REST APIs or MCP tools.

  • Guidance on using Unsloth for Retrieval Augmented Finetuning (RAFT) has been published, including a Llama32 1bn RAFT notebook.

  • Tinygrad users benchmarked Qwen3 0.6B, achieving 92.92 tokens per second (TPS) with specific configurations (BEAM=2, CUDA=1) on an RTX3060 12G.

  • RGFW.h, a single-header, cross-platform windowing library, has been made available.

  • Llama 3.x used with axolotl is recommended for open-source chatbot development projects.

  • Datadog has released a new open model that tops forecasting benchmarks, utilizing autoregressive transformers and a new benchmark called BOOM.

LLM Capabilities, Limitations, and Cost Considerations

  • Many current LLMs, including advanced models like Claude 4 Sonnet, reportedly struggle with simple arithmetic problems (e.g., '9.9 - 9.11'), indicating persistent gaps in robust numeracy and logical consistency.

  • However, in specific tests, Claude Sonnet 4 successfully handled a floating-point arithmetic task that other LLMs failed. Qwen3 32B was also noted for correctly handling certain arithmetic queries.

  • These elementary math failures in top-tier models have led to discussions about realistic AGI timelines.

  • For many users, incremental qualitative improvements in new frontier LLMs are becoming less perceptible in everyday interactions, often requiring benchmarks for differentiation.

  • Significant LLM advancements are more noticeable when applied to complex or edge-case tasks.

  • Persistent issues such as hallucination, limited context windows, and the lack of real-time online learning continue to temper perceived progress despite measurable gains.

  • There is ongoing discussion about whether LLM development is approaching a performance plateau for general qualitative improvements.

  • The operational cost of using state-of-the-art LLMs can be substantial; for example, a single Claude Opus 4 task via a third-party tool incurred a $7.60 charge, with another instance reported at $1.50 for a single plan generation.

  • Claude Opus is noted to be considerably more expensive (approximately 5x) than Claude Sonnet.

  • Direct subscription models (e.g., Claude Max tier) are suggested as potentially more cost-effective for accessing LLMs compared to usage via third-party tools.

  • Subscription-based access for advanced LLM services appears to be an emerging standard in the market.

  • Tool calling, which allows LLMs to offload precise computations to external tools or code, is recommended as a more reliable method for calculations than relying on their native arithmetic abilities.

Specialized AI Applications: Video Generation and Drug Discovery

  • Video Generation:

    • Google's Veo 3 model can reportedly generate gameplay videos and is described as outperforming existing competitors in video quality.

    • Outputs from VEO 3 sometimes exhibit common AI animation tropes, such as "AI stares" (characters freezing with intense, wide-open eyes) and repetitive visual motifs (e.g., consistent t-shirt tears).

    • A current limitation of VEO 3 is the lack of image-to-video capability, making visual consistency challenging to control without extensive prompt engineering.

    • Google's Veo3 text-to-video model can produce highly realistic, narratively coherent video sequences, though it is associated with high costs, unreliability, and a reportedly buggy interface (e.g., scene editor).

    • Veo3 currently only supports text-to-video but shows sophistication in lip-sync, voice generation, and matching vocal characteristics to character visuals.

    • There is interest in Veo3's multilingual generation capabilities, such as for Portuguese.

    • Questions have been raised regarding the pricing model and rendering quotas for Veo3 Flow subscriptions (e.g., renders per $250/month).

    • Video shorts are being created using models like Kling and Veo3.

  • Drug Discovery:

    • Isomorphic Labs aims to dramatically reduce drug discovery timelines, from a traditional 10 years to potentially weeks, by leveraging AI advances exemplified by AlphaFold.

    • AlphaFold's ability to predict protein structures with high accuracy accelerates in silico hypothesis generation for target validation and drug design.

    • Narrow AI applications like AlphaFold are already making a significant impact on pharmaceutical research methodologies, potentially yielding results well before generalized AGI.

    • Isomorphic Labs anticipates its first AI-driven drug candidate (in oncology, cardiovascular, or neurodegeneration) will enter human trials by the end of 2024.

    • AI is proving valuable in triaging potential drug candidates and increasing the throughput of R&D, although the clinical trial phase itself remains a separate, lengthy process that AI does not inherently speed up.

Hardware and Optimization for AI/LLM Workloads

  • A high-end workstation setup featuring an NVIDIA RTX 6000 Ada Generation GPU with 96GB VRAM was showcased, suitable for large LLMs and demanding AI tasks; obtaining such hardware sometimes requires navigating enterprise supply chains.

  • For such 96GB VRAM setups, suggestions for initial testing include running models like Qwen2.5 3B with large context windows or sharded versions of Qwen3 235B (e.g., Q3_K_M GGUF, ~112GB), with performance estimates around 30-50 tokens/second.

  • Another approach involves running IQ4 quantized versions of Qwen3 235B (~125GB), potentially with an auxiliary GPU (e.g., 3090 or 4090), aiming for mid-80% efficacy and over 25 tokens/second on dual-GPU setups.

  • A user detailed a workstation built with 16 Nvidia P100 GPUs, noting challenges with PCIe bandwidth (2 lanes at 4x) and low CPU throughput from an older dual Xeon setup.

  • For older P100 GPUs, using exllama for inference is recommended over llama.cpp due to better fp16 performance, potentially achieving ~700GB/s memory bandwidth.

  • For large models like Qwen3-256B at 4-bit quantization on systems with 256GB memory, tensor-parallel 16 is a suggested configuration, ensuring model architecture (attention heads/layers) compatibility.

  • Tools such as Koboldcpp and LM Studio support the distribution of model layers across multiple P100 GPUs; a trade-off was noted where row-splitting improves token generation speed but can reduce predictive performance.

  • Discussions on GPU optimization included techniques like Triton PID interleaving for performance enhancement and submissions to the amd-mla-decode leaderboard on MI300 achieving times between 1063-1300 ms.

  • Solutions for CUDA out-of-memory errors involve freeing gradients and utilizing tools like the PyTorch profiler for GPU memory analysis.

Developments in Speech and Audio Technologies for LLMs

  • Kyutai is developing Unmute, an open-source project to integrate real-time, low-latency speech-to-text (STT) and text-to-speech (TTS) modules with any LLM for voice-based interaction.

  • The Unmute demo utilizes Gemma 3 12B as a base, with a TTS model of approximately 2B parameters and an STT model of around 1B parameters (a 300M parameter variant is also planned), currently running in bfloat16 (requiring ~4GB and ~2GB memory respectively). Quantization has not yet been optimized.

  • Unmute's architecture features bidirectional streaming, semantic Voice Activity Detection (VAD) for improved turn-taking, rapid voice cloning, and interoperability with LLM functionalities, aiming for a customizable and interruptible alternative to proprietary systems.

  • The STT component of Unmute can support batch inference for up to 384 simultaneous users per H100 GPU, leading to efficient GPU utilization despite higher overall memory use.

  • There is community interest in Unmute providing an OpenAI-compatible API, allowing users to run STT/TTS components locally while integrating with an external LLM.

Broader Industry Trends, New Tools, and Community Initiatives

  • The movement towards open models is often framed as a fundamental issue of freedom and access.

  • Discussions around "always-on AI awareness" highlight privacy concerns and the importance of obtaining consent before recording individuals.

  • There is an active search within the community for new hosted LLM gateway solutions.

  • One perspective suggests that the limited broad economic impact of AI so far is due to productivity gains being concentrated within a few large corporations.

  • The "Dark Leisure" theory proposes that AI-driven productivity increases might be absorbed by employees using newfound free time for personal leisure rather than additional company tasks.

  • Mistral has unveiled a new Document AI solution and an OCR model (via ocr.space), signaling a focus on business applications.

  • Perplexity AI has rolled out new Pro perks and an Academic Homepage for its users.

  • John Carmack shared his "Upper Bound 2025" presentation slides.

  • A collaborative project between Anthropic and Rick Rubin, "THE WAY OF CODE" website, was launched.

  • Windsurf.ai is being explored by users as an alternative to other AI coding assistants, particularly with its new Claude 4 support (including BYOK via its API keys section).

  • Unsloth is scheduled to participate in AMD's AI Advancing event on June 12 to discuss fine-tuning and other topics.

  • The Psyche network is promoting decentralized AI and aims to onboard newcomers to the field.

  • Proposals for federated training for exaflop computing have emerged, referencing projects like Nous Psyche by NousResearch.

  • A call was made for more precise language, suggesting a moratorium on using the term "this century" when referring to events within the last 25 years.

Don't miss what's next. Subscribe to TLDR of AI news:
Powered by Buttondown, the easiest way to start and grow your newsletter.