June 6, 2025, 12:34 p.m.

06-05-2025

TLDR of AI news

Major Model Updates and Performance

  • Gemini 2.5 Pro:

    • Google's Gemini 2.5 Pro (preview 06-05) achieved the top spot on the LMArena leaderboard with a score of 1470.

    • The update demonstrates improvements in coding, reasoning, and math, scoring 82.2% on AIDER POLYGLOT at a reduced cost compared to some alternatives.

    • The model can convert images into Excalidraw charts and shows strong performance in factual answer generation.

    • Its Aider Polyglot performance has shown significant improvement since March.

    • Benchmark results indicate Gemini 2.5 Pro leads in 'Science' (86.4%) and is competitive in 'Reasoning & Knowledge' and 'Coding' against other major models.

    • A comprehensive benchmark table for Gemini 2.5 Pro 06-05 details comparisons across various tasks, including reasoning, science, coding, factuality, visual understanding, long context, and multilingual capabilities, alongside pricing metrics.

    • A rapid update cadence is observed, potentially enabled by control over the full infrastructure stack.

    • Ambiguity in naming conventions for Gemini preview models (e.g., gemini-2.5-pro-preview-06-05 vs. 05-06) caused confusion due to unclear date formats.

    • It achieved a top score of 1443 in a Chatbot Arena web development context.

    • Some users reported Gemini 2.5 Pro as less effective for complex coding tasks, preferring Opus.

    • Gemini 2.5 Flash was perceived by some as inferior, with users anticipating o3pro.

    • High AIDER benchmark scores (e.g., a reported 86% on the polyglot test) prompted discussions on benchmark validity and potential overfitting.

    • The chat mode for Gemini 2.5 Pro was noted for duplicating entire files instead of providing concise diffs.

    • Gemini 2.5 Flash reportedly experienced issues with infinite loops in structured responses.

    • Gemini Pro API users encountered new rate limits (e.g., 100 messages per 24 hours).

    • Gemini API capabilities were observed to sometimes lag behind its online interface performance.

    • Discrepancies were noted in some reported Gemini 2.5 Pro benchmark scores, such as on swebench.

  • Qwen Models:

    • The Qwen team released open-weight embedding and reranking models described as state-of-the-art and free.

    • Qwen3-Embedding-8B achieved the #1 rank on the MTEB multilingual leaderboard.

    • The new Qwen embedding/reranking models are supported by vLLM, suggesting potential for widespread RAG system upgrades.

    • DeepSeek's R1-0528-Qwen3-8B model reportedly achieves top scores among 8B models, marginally outperforming Alibaba's Qwen3 8B on one "Intelligence Index."

    • User experience suggests Qwen3 8B offers superior multilingual performance compared to DeepSeek R1 8B.

    • The Qwen3-Embedding-0.6B-GGUF model was released as part of a broader Qwen Embedding Collection.

    • A collection of specialized Qwen embedding and reranking models was released in formats including safetensors and GGUF.

    • Qwen3-Embedding and Qwen3-Reranker Series (0.6B, 4B, 8B sizes) support 119 languages and claim strong performance on MMTEB, MTEB, and MTEB-Code, available via Hugging Face and Alibaba Cloud API.

  • Other Notable Model Releases:

    • OpenThinker3-7B was announced as a new state-of-the-art 7B open-data reasoning model.

    • OpenThinker3-7B, trained on the OpenThoughts3-1.2M dataset, reportedly improves over DeepSeek-R1-Distill-Qwen-7B by 33% on a key benchmark. It is available in standard and GGUF formats, with a 32B model planned.

    • Deepseek-0528-Qwen3-8B is reported to achieve significantly higher scores than OpenThinker3-7B on some benchmarks.

    • Arcee AI's Homunculus-12B, distilled from Qwen3-235B onto a Mistral-Nemo backbone, maintains Qwen’s two-mode interaction style (/think, /nothink) and can run on a single consumer GPU. GGUF versions are available.

    • Shisa.ai released Shisa v2, a Llama3.1 405B full fine-tune, positioned as Japan's highest-performing model and competitive with GPT-4o on Japanese tasks.

    • A model named Kingfall was released and subsequently removed, leading to speculation about its capabilities.

    • The DeepHermes 24B API and Chat Product experienced an outage but was restored.

Advancements in AI Specializations and Research

  • Embedding and Reranking Technologies:

    • The Qwen team released SOTA open-weight embedding (Qwen3-Embedding-8B ranked #1 on MTEB multilingual) and reranking models.

    • Discussions highlighted the distinction between specialized embedding models optimized for semantic tasks and general LLMs' token representations.

    • Concerns were noted regarding the interoperability of embeddings across different model architectures and training methodologies.

    • There is interest in Qwen's reranker models for multilingual Semantic Textual Similarity (STS) tasks.

  • Voice Synthesis:

    • Bland AI introduced Bland TTS, claiming it is the first voice AI to cross the uncanny valley.

    • ElevenLabs released Eleven v3 (alpha), an expressive Text-to-Speech model supporting over 70 languages, with demonstrations of highly realistic speech.

    • Eleven v3 showed significant improvements in naturalness, emotional expressiveness, prosody, breath control, and nuanced intonation.

    • Higgsfield AI launched Higgsfield Speak for creating motion-driven talking videos.

    • Despite high quality, ElevenLabs v3's proprietary nature and cost were noted, with open-weight alternatives like ChatterboxTTS emerging for consumer GPU use.

  • Reasoning and Agentic Capabilities:

    • OpenThinker3-7B was released as a leading open reasoning model.

    • A 100-game Town of Salem simulation using various LLMs tested contextual reasoning, deception, and multi-agent strategy; DeepSeek and Qwen performed well.

    • Research presented self-challenging LLM agents as a potential path toward self-improving AI.

    • A study found Supervised Fine-tuning (SFT) can achieve gains similar to Reinforcement Learning (RL) for specific problems, suggesting RL benefits might stem from repeated problem exposure.

    • Claude Code, now on the Pro tier, received praise for coding tasks, though it sometimes provides human-like project time estimates (e.g., 5-8 days) before delivering code rapidly.

    • Gemini 2.5 Pro achieved 82.2% on AIDER POLYGLOT, and a reported 86% on a polyglot test, indicating strong coding abilities.

  • Model Architecture and Optimization:

    • LightOn introduced FastPlaid, a new architecture for late-interaction models, offering significant speedup for ColBERT models.

    • The Mixture-of-Transformers (MoT) architecture, using decoupled transformers for different modalities, allows modality-specific training within an autoregressive LLM framework, seen in models like BAGEL and Mogao.

    • NimbleEdge released fused operator kernels for structured contextual sparsity in transformers, leading to faster MLP inference, reduced memory, lower TTFT, and faster throughput in Llama 3.2 3B benchmarks.

    • Meta-learning was described as training a model to quickly adapt to new tasks from limited examples via a base-learner and a meta-learner.

  • Robotics:

    • The first robotics action model (VLA) named BB-ACT (3.1B parameters) was made publicly available via API.

    • Amazon is reportedly testing humanoid delivery bots.

    • Hugging Face released a robotics AI model efficient enough to operate on a MacBook.

  • Visual Generation Evaluation:

    • A "pelican SVG benchmark" was introduced for evaluating LLM visual generation capabilities.

Developer Ecosystem: Tools, Frameworks, and Platforms

  • Development Frameworks and Libraries:

    • LlamaIndex launched a Spreadsheet Agent for data transformation and Q&A over Excel sheets using RL-based semantic structure parsing.

    • LlamaIndex demonstrated using LlamaExtract to automate data extraction from SEC Form 4 filings.

    • LangChain partnered with Microsoft to enhance AI security on Azure and is soliciting user feedback for LangGraph.

    • UnslothAI released a repository with over 100 fine-tuning notebooks and workshop materials on advanced topics like GRPO, kernels, and quantization.

    • DSPy was likened to "Rails for AI."

    • Langfuse launched as a full-featured open-source platform for LLM application observability.

    • The Multi-Craft Protocol (MCP) was actively discussed, with developers creating related tools like a local sequential thinking enhancer.

  • IDE and Coding Assistants:

    • Cursor 1.0 was launched with features like background agents and enhanced code review, receiving mixed user feedback on functionality and performance.

    • Many users found Claude Code to be a superior coding tool compared to Cursor 1.0.

    • Aider was preferred by some for its intuitive AI editing and terminal-driven workflow.

    • Anthropic's Claude Code became available for Pro tier users via a JetBrains IDE plugin, with reports of generous usage quotas.

  • Platform Enhancements:

    • OpenAI enabled ChatGPT to connect to workplace applications including Gmail and Google Calendar.

    • Anthropic's Claude 'Projects' feature increased its content capacity tenfold, incorporating a retrieval-augmented generation (RAG) mode for larger datasets, facilitating work with extensive documents like semiconductor datasheets.

  • Open Source Contributions & Access:

    • Baidu joined Hugging Face, leading to discussions about potential open-source releases of its Ernie models.

    • The OpenThinker3 team emphasized the impact of openly sharing datasets like OpenThoughts3-1.2M.

Industry Dynamics, Ethical Considerations, and Market Trends

  • Platform Risk and Competition:

    • Concerns were raised about risks for startups building on large AI platforms, citing instances of "Sherlocking" (e.g., Granola by OpenAI) and terminated model access (e.g., Anthropic to Windsurf).

    • It was noted that AI platform dynamics may differ from traditional OS platforms, with AI companies potentially having fewer incentives to avoid competing with developers.

    • The competitive dynamic between OpenAI, Google, and Anthropic (the "AI Wars") was a frequent topic.

    • Google's Veo 3 video model release was viewed as a competitive move in response to OpenAI's SORA.

  • Data Privacy, Regulation, and Trust:

    • A court order now requires OpenAI to preserve all ChatGPT logs, including previously temporary chats and API requests, sparking privacy concerns.

    • OpenRouter is re-evaluating its data retention policies for OpenAI models due to this mandate.

    • OpenAI published a statement on its user privacy protection measures.

    • Concerns exist that LLMs might learn to generate unfalsifiable narratives if human feedback primarily corrects them on familiar topics.

    • The importance of using apply_chat_template() for instruction-tuned models was emphasized to prevent out-of-distribution behavior.

    • IBM Research's Responsible Prompting API aims to enhance LLM outputs by suggesting prompt improvements.

  • Cost of AI:

    • LLMs are increasingly perceived as becoming more affordable.

    • An example cited processing an entire insurance policy with Gemini for approximately $0.01.

  • Human-AI Interaction:

    • OpenAI articulated its goal as building tools rather than creatures, highlighting the growing importance of public perception of AI.

    • Research identified an "uncanny valley effect," where users may dislike LLMs that appear overly human-like.

  • Benchmark Integrity:

    • Skepticism regarding benchmark-based leaderboards is prevalent due to concerns about benchmark saturation, overfitting to test data, and inconsistent real-world applicability.

    • Some benchmark aggregators were criticized for relying on outdated or overused datasets.

    • The "leaderboard illusion" – where only the best-performing private model variants are publicly released after internal testing – was identified as a practice that could distort perceptions of progress.

    • There is a demand for more reliable evaluation methodologies.

Hardware and Performance Optimization

  • GPU Advancements and Benchmarking:

    • A Blackwell B200 GPU demonstrated high performance, achieving nearly 1 Petaflop/sec on FP16 GEMM, 1.97 Petaflop/sec on FP8 GEMM, and 3.09 Petaflop/sec on nvfp4_nvfp4_gemm.

    • The cuDNN backend was noted for delivering optimal performance on Blackwell architecture.

    • Performance on mixed_mxfp8_bf16_gemm for the Blackwell B200 was comparatively lower.

  • Low-Level Optimization and Challenges:

    • A learning exercise proposed creating a CUDA matrix multiplication from scratch to achieve 85% of cuBLAS throughput in bf16/fp16 using tensor cores.

    • The AMD FP8 GEMM Challenge saw active development, with optimized kernels achieving high rankings.

    • torch.compile is now generally recommended over AITemplate (which is in maintenance mode) for potentially better performance, with AOTInductor suggested as a C++ runtime alternative.

    • The importance of profiling workloads to identify optimization opportunities was stressed.

AI Community and Events

  • AI Engineer World's Fair:

    • A fireside chat with OpenAI's Greg Brockman and @swyx, featuring NVIDIA CEO Jensen Huang, was a key event.

    • Brockman emphasized the return of "Basic research" for scaling future models and advised structured coding practices ("make your modules small and your tests fast").

    • Key conference themes included AI product management and strategies for running small AI teams.

    • Docker creator Solomon Hykes offered a definition of an AI agent: "an LLM wrecking its environment in a loop."

    • Nathan Lambert presented a taxonomy for next-generation reasoning models.

    • Simon Willison introduced a "pelican SVG benchmark" for visual generation.

    • The event was commended for its high energy, quality of presentations, and engaged attendees.

You just read issue #21 of TLDR of AI news. You can also browse the full archives of this newsletter.

Powered by Buttondown, the easiest way to start and grow your newsletter.