06-05-2025

Major Model Updates and Performance

Gemini 2.5 Pro:
- Google's Gemini 2.5 Pro (preview 06-05) achieved the top spot on the LMArena leaderboard with a score of 1470.
- The update demonstrates improvements in coding, reasoning, and math, scoring 82.2% on AIDER POLYGLOT at a reduced cost compared to some alternatives.
- The model can convert images into Excalidraw charts and shows strong performance in factual answer generation.
- Its Aider Polyglot performance has shown significant improvement since March.
- Benchmark results indicate Gemini 2.5 Pro leads in 'Science' (86.4%) and is competitive in 'Reasoning & Knowledge' and 'Coding' against other major models.
- A comprehensive benchmark table for Gemini 2.5 Pro 06-05 details comparisons across various tasks, including reasoning, science, coding, factuality, visual understanding, long context, and multilingual capabilities, alongside pricing metrics.
- A rapid update cadence is observed, potentially enabled by control over the full infrastructure stack.
- Ambiguity in naming conventions for Gemini preview models (e.g., gemini-2.5-pro-preview-06-05 vs. 05-06) caused confusion due to unclear date formats.
- It achieved a top score of 1443 in a Chatbot Arena web development context.
- Some users reported Gemini 2.5 Pro as less effective for complex coding tasks, preferring Opus.
- Gemini 2.5 Flash was perceived by some as inferior, with users anticipating o3pro.
- High AIDER benchmark scores (e.g., a reported 86% on the polyglot test) prompted discussions on benchmark validity and potential overfitting.
- The chat mode for Gemini 2.5 Pro was noted for duplicating entire files instead of providing concise diffs.
- Gemini 2.5 Flash reportedly experienced issues with infinite loops in structured responses.
- Gemini Pro API users encountered new rate limits (e.g., 100 messages per 24 hours).
- Gemini API capabilities were observed to sometimes lag behind its online interface performance.
- Discrepancies were noted in some reported Gemini 2.5 Pro benchmark scores, such as on swebench.
Qwen Models:
- The Qwen team released open-weight embedding and reranking models described as state-of-the-art and free.
- Qwen3-Embedding-8B achieved the #1 rank on the MTEB multilingual leaderboard.
- The new Qwen embedding/reranking models are supported by vLLM, suggesting potential for widespread RAG system upgrades.
- DeepSeek's R1-0528-Qwen3-8B model reportedly achieves top scores among 8B models, marginally outperforming Alibaba's Qwen3 8B on one "Intelligence Index."
- User experience suggests Qwen3 8B offers superior multilingual performance compared to DeepSeek R1 8B.
- The Qwen3-Embedding-0.6B-GGUF model was released as part of a broader Qwen Embedding Collection.
- A collection of specialized Qwen embedding and reranking models was released in formats including safetensors and GGUF.
- Qwen3-Embedding and Qwen3-Reranker Series (0.6B, 4B, 8B sizes) support 119 languages and claim strong performance on MMTEB, MTEB, and MTEB-Code, available via Hugging Face and Alibaba Cloud API.
Other Notable Model Releases:
- OpenThinker3-7B was announced as a new state-of-the-art 7B open-data reasoning model.
- OpenThinker3-7B, trained on the OpenThoughts3-1.2M dataset, reportedly improves over DeepSeek-R1-Distill-Qwen-7B by 33% on a key benchmark. It is available in standard and GGUF formats, with a 32B model planned.
- Deepseek-0528-Qwen3-8B is reported to achieve significantly higher scores than OpenThinker3-7B on some benchmarks.
- Arcee AI's Homunculus-12B, distilled from Qwen3-235B onto a Mistral-Nemo backbone, maintains Qwen’s two-mode interaction style (/think, /nothink) and can run on a single consumer GPU. GGUF versions are available.
- Shisa.ai released Shisa v2, a Llama3.1 405B full fine-tune, positioned as Japan's highest-performing model and competitive with GPT-4o on Japanese tasks.
- A model named Kingfall was released and subsequently removed, leading to speculation about its capabilities.
- The DeepHermes 24B API and Chat Product experienced an outage but was restored.

Advancements in AI Specializations and Research

Embedding and Reranking Technologies:
- The Qwen team released SOTA open-weight embedding (Qwen3-Embedding-8B ranked #1 on MTEB multilingual) and reranking models.
- Discussions highlighted the distinction between specialized embedding models optimized for semantic tasks and general LLMs' token representations.
- Concerns were noted regarding the interoperability of embeddings across different model architectures and training methodologies.
- There is interest in Qwen's reranker models for multilingual Semantic Textual Similarity (STS) tasks.
Voice Synthesis:
- Bland AI introduced Bland TTS, claiming it is the first voice AI to cross the uncanny valley.
- ElevenLabs released Eleven v3 (alpha), an expressive Text-to-Speech model supporting over 70 languages, with demonstrations of highly realistic speech.
- Eleven v3 showed significant improvements in naturalness, emotional expressiveness, prosody, breath control, and nuanced intonation.
- Higgsfield AI launched Higgsfield Speak for creating motion-driven talking videos.
- Despite high quality, ElevenLabs v3's proprietary nature and cost were noted, with open-weight alternatives like ChatterboxTTS emerging for consumer GPU use.
Reasoning and Agentic Capabilities:
- OpenThinker3-7B was released as a leading open reasoning model.
- A 100-game Town of Salem simulation using various LLMs tested contextual reasoning, deception, and multi-agent strategy; DeepSeek and Qwen performed well.
- Research presented self-challenging LLM agents as a potential path toward self-improving AI.
- A study found Supervised Fine-tuning (SFT) can achieve gains similar to Reinforcement Learning (RL) for specific problems, suggesting RL benefits might stem from repeated problem exposure.
- Claude Code, now on the Pro tier, received praise for coding tasks, though it sometimes provides human-like project time estimates (e.g., 5-8 days) before delivering code rapidly.
- Gemini 2.5 Pro achieved 82.2% on AIDER POLYGLOT, and a reported 86% on a polyglot test, indicating strong coding abilities.
Model Architecture and Optimization:
- LightOn introduced FastPlaid, a new architecture for late-interaction models, offering significant speedup for ColBERT models.
- The Mixture-of-Transformers (MoT) architecture, using decoupled transformers for different modalities, allows modality-specific training within an autoregressive LLM framework, seen in models like BAGEL and Mogao.
- NimbleEdge released fused operator kernels for structured contextual sparsity in transformers, leading to faster MLP inference, reduced memory, lower TTFT, and faster throughput in Llama 3.2 3B benchmarks.
- Meta-learning was described as training a model to quickly adapt to new tasks from limited examples via a base-learner and a meta-learner.
Robotics:
- The first robotics action model (VLA) named BB-ACT (3.1B parameters) was made publicly available via API.
- Amazon is reportedly testing humanoid delivery bots.
- Hugging Face released a robotics AI model efficient enough to operate on a MacBook.
Visual Generation Evaluation:
- A "pelican SVG benchmark" was introduced for evaluating LLM visual generation capabilities.

Developer Ecosystem: Tools, Frameworks, and Platforms

Development Frameworks and Libraries:
- LlamaIndex launched a Spreadsheet Agent for data transformation and Q&A over Excel sheets using RL-based semantic structure parsing.
- LlamaIndex demonstrated using LlamaExtract to automate data extraction from SEC Form 4 filings.
- LangChain partnered with Microsoft to enhance AI security on Azure and is soliciting user feedback for LangGraph.
- UnslothAI released a repository with over 100 fine-tuning notebooks and workshop materials on advanced topics like GRPO, kernels, and quantization.
- DSPy was likened to "Rails for AI."
- Langfuse launched as a full-featured open-source platform for LLM application observability.
- The Multi-Craft Protocol (MCP) was actively discussed, with developers creating related tools like a local sequential thinking enhancer.
IDE and Coding Assistants:
- Cursor 1.0 was launched with features like background agents and enhanced code review, receiving mixed user feedback on functionality and performance.
- Many users found Claude Code to be a superior coding tool compared to Cursor 1.0.
- Aider was preferred by some for its intuitive AI editing and terminal-driven workflow.
- Anthropic's Claude Code became available for Pro tier users via a JetBrains IDE plugin, with reports of generous usage quotas.
Platform Enhancements:
- OpenAI enabled ChatGPT to connect to workplace applications including Gmail and Google Calendar.
- Anthropic's Claude 'Projects' feature increased its content capacity tenfold, incorporating a retrieval-augmented generation (RAG) mode for larger datasets, facilitating work with extensive documents like semiconductor datasheets.
Open Source Contributions & Access:
- Baidu joined Hugging Face, leading to discussions about potential open-source releases of its Ernie models.
- The OpenThinker3 team emphasized the impact of openly sharing datasets like OpenThoughts3-1.2M.

Industry Dynamics, Ethical Considerations, and Market Trends

Platform Risk and Competition:
- Concerns were raised about risks for startups building on large AI platforms, citing instances of "Sherlocking" (e.g., Granola by OpenAI) and terminated model access (e.g., Anthropic to Windsurf).
- It was noted that AI platform dynamics may differ from traditional OS platforms, with AI companies potentially having fewer incentives to avoid competing with developers.
- The competitive dynamic between OpenAI, Google, and Anthropic (the "AI Wars") was a frequent topic.
- Google's Veo 3 video model release was viewed as a competitive move in response to OpenAI's SORA.
Data Privacy, Regulation, and Trust:
- A court order now requires OpenAI to preserve all ChatGPT logs, including previously temporary chats and API requests, sparking privacy concerns.
- OpenRouter is re-evaluating its data retention policies for OpenAI models due to this mandate.
- OpenAI published a statement on its user privacy protection measures.
- Concerns exist that LLMs might learn to generate unfalsifiable narratives if human feedback primarily corrects them on familiar topics.
- The importance of using apply_chat_template() for instruction-tuned models was emphasized to prevent out-of-distribution behavior.
- IBM Research's Responsible Prompting API aims to enhance LLM outputs by suggesting prompt improvements.
Cost of AI:
- LLMs are increasingly perceived as becoming more affordable.
- An example cited processing an entire insurance policy with Gemini for approximately $0.01.
Human-AI Interaction:
- OpenAI articulated its goal as building tools rather than creatures, highlighting the growing importance of public perception of AI.
- Research identified an "uncanny valley effect," where users may dislike LLMs that appear overly human-like.
Benchmark Integrity:
- Skepticism regarding benchmark-based leaderboards is prevalent due to concerns about benchmark saturation, overfitting to test data, and inconsistent real-world applicability.
- Some benchmark aggregators were criticized for relying on outdated or overused datasets.
- The "leaderboard illusion" – where only the best-performing private model variants are publicly released after internal testing – was identified as a practice that could distort perceptions of progress.
- There is a demand for more reliable evaluation methodologies.

Hardware and Performance Optimization

GPU Advancements and Benchmarking:
- A Blackwell B200 GPU demonstrated high performance, achieving nearly 1 Petaflop/sec on FP16 GEMM, 1.97 Petaflop/sec on FP8 GEMM, and 3.09 Petaflop/sec on nvfp4_nvfp4_gemm.
- The cuDNN backend was noted for delivering optimal performance on Blackwell architecture.
- Performance on mixed_mxfp8_bf16_gemm for the Blackwell B200 was comparatively lower.
Low-Level Optimization and Challenges:
- A learning exercise proposed creating a CUDA matrix multiplication from scratch to achieve 85% of cuBLAS throughput in bf16/fp16 using tensor cores.
- The AMD FP8 GEMM Challenge saw active development, with optimized kernels achieving high rankings.
- torch.compile is now generally recommended over AITemplate (which is in maintenance mode) for potentially better performance, with AOTInductor suggested as a C++ runtime alternative.
- The importance of profiling workloads to identify optimization opportunities was stressed.

AI Community and Events

AI Engineer World's Fair:
- A fireside chat with OpenAI's Greg Brockman and @swyx, featuring NVIDIA CEO Jensen Huang, was a key event.
- Brockman emphasized the return of "Basic research" for scaling future models and advised structured coding practices ("make your modules small and your tests fast").
- Key conference themes included AI product management and strategies for running small AI teams.
- Docker creator Solomon Hykes offered a definition of an AI agent: "an LLM wrecking its environment in a loop."
- Nathan Lambert presented a taxonomy for next-generation reasoning models.
- Simon Willison introduced a "pelican SVG benchmark" for visual generation.
- The event was commended for its high energy, quality of presentations, and engaged attendees.

June 6, 2025, 12:34 p.m.

TLDR of AI news