LLM Daily: June 04, 2025

                June 4, 2025

            LLM Daily: June 04, 2025

            🔍 LLM DAILY
Your Daily Briefing on Large Language Models
June 04, 2025
HIGHLIGHTS
• Yoshua Bengio has launched LawZero, a nonprofit AI safety lab with $30 million in funding from tech leaders including Eric Schmidt and organizations like Open Philanthropy, focusing on building safer AI systems.
• Google has open-sourced its DeepSearch stack to help developers build agents using Gemini, reflecting a recent trend of Google releasing high-quality open-source tools that can potentially work with local models.
• Researchers at UC Berkeley have introduced "circuit stability" as a groundbreaking alternative to traditional benchmarks for evaluating LLM performance, measuring how consistently a model applies its reasoning process across varied inputs.
• Microsoft's educational repository "ai-agents-for-beginners" has gained over 24,000 GitHub stars, offering a structured 11-lesson course to help newcomers understand and build functional AI agents.
• The "LLMs-from-scratch" repository has emerged as a popular educational resource with over 50,500 GitHub stars, guiding users through implementing ChatGPT-like models in PyTorch as a companion to the book "Build a Large Language Model (From Scratch)."

BUSINESS
Yoshua Bengio Launches $30M AI Safety Lab LawZero
Turing Award winner Yoshua Bengio has launched LawZero, a nonprofit AI safety lab focused on building safer AI systems. The initiative has secured $30 million in philanthropic contributions from prominent tech figures and organizations including Skype founding engineer Jaan Tallinn, former Google chief Eric Schmidt, Open Philanthropy, and the Future of Life Institute. (2025-06-03) - TechCrunch
Funding & Investment

Console Raises $6.2M: The IT automation startup secured funding from Thrive Capital to help IT teams reduce mundane tasks using AI, freeing up help desk professionals for more strategic work. (2025-06-02) - TechCrunch

Creatify Secures $15.5M Series A: Former DreamWorks CEO Jeffrey Katzenberg co-led a $15.5 million Series A round for the AI video ad platform. Creatify's AdMax platform uses AI to quickly generate video advertisements optimized for social media marketing. (2025-06-02) - TechCrunch

Sequoia Capital Invests in Rillet: Sequoia announced its partnership with Rillet, described as "The Financial ERP for the AI Age," though specific investment details weren't disclosed. (2025-05-28) - Sequoia Capital

M&A and Partnerships

Snowflake to Acquire Crunchy Data: Cloud data platform Snowflake announced its intent to acquire Postgres database partner Crunchy Data, enhancing its data management capabilities relevant to AI agents. (2025-06-02) - TechCrunch

OpenAI Reportedly Acquiring Windsurf: Reports indicate that OpenAI is in the process of acquiring Windsurf, the popular "vibe coding" startup. This development comes as Windsurf announced that Anthropic has significantly reduced its first-party access to Claude AI models. (2025-06-03) - TechCrunch

Company Updates

Microsoft Launches Free Sora-Powered Video Generator: Microsoft Bing has introduced the Bing Video Creator to its app, leveraging OpenAI's Sora model to generate videos from text prompts. This marks a significant expansion of Sora's availability to general users. (2025-06-02) - TechCrunch

Anthropic Launches AI-Generated Blog: Anthropic has quietly launched "Claude Explains," a new blog primarily generated by its AI model family, Claude. The content focuses on technical topics related to various Claude use cases, with human oversight in the production process. (2025-06-03) - TechCrunch

Phonely's AI Agents Achieve 99% Accuracy: Phonely, in collaboration with MaiTai and Groq, has achieved a breakthrough in AI phone support with sub-second response times and 99.2% accuracy. This development enables human-level conversational AI for call centers, with customers reportedly unable to distinguish the AI from human operators. (2025-06-03) - VentureBeat

Google Launches AI Edge Gallery: Google has quietly released AI Edge Gallery, an experimental Android app that runs AI models offline without internet connectivity. The app brings Hugging Face models directly to smartphones with enhanced privacy features. (2025-06-02) - VentureBeat

Market Analysis

OpenAI Board Drama to Become Movie: The five-day saga of Sam Altman's firing and rehiring at OpenAI is reportedly being developed into a movie titled "Artificial" at Amazon MGM Studios, highlighting the cultural impact of major AI companies and their leadership. (2025-06-03) - TechCrunch

Allen Institute Updates RewardBench: The Allen Institute of AI has updated its reward model evaluation tool, RewardBench, to better reflect real-life scenarios for enterprises, addressing challenges in AI model production environments. (2025-06-03) - VentureBeat

PRODUCTS
Google DeepSearch Stack Now Open Source (2025-06-03)
Google has open-sourced its DeepSearch stack, designed to help developers build agents using Gemini. The author of the announcement clarified that while this isn't the exact stack used in the Gemini app, it's aimed at helping developers get started with agent development using LangGraph. The community reception has been enthusiastic, with users noting that Google has been releasing high-quality open-source tools and weights recently. The stack is reportedly compatible with Gemini and Google Search, with potential for adaptation to work with local models and SearXNG.
Flux.1 Kontext Demonstrates Impressive Photo Colorization (2025-06-03)
A Reddit user has showcased the impressive capabilities of Flux.1 Kontext for colorizing and restoring World War I photographs. Using the simple prompt "Turn this into a color photograph," the model was able to produce convincing colorizations while largely preserving the integrity of faces in the original black and white photos. While users noted that some colors might not be historically accurate and some results have the typical "colorized" look rather than true color photography, the community was generally impressed with the quality of the restoration and the model's ability to maintain the original image details.
Study Reveals Vision Language Model Biases (2025-06-03)
A new study highlighted on Reddit reveals significant biases in state-of-the-art Vision Language Models (VLMs). According to the findings, these models achieve 100% accuracy when counting elements in popular or common subjects (such as the three stripes in the Adidas logo or four legs on a dog) but drop to only around 17% accuracy for less common subjects. This research underscores the ongoing challenges in building truly generalizable AI vision systems and suggests that current VLMs may be relying heavily on memorization of common patterns rather than developing genuine counting abilities.

TECHNOLOGY
Open Source Projects
rasbt/LLMs-from-scratch
A comprehensive educational repository that guides users through implementing a ChatGPT-like LLM in PyTorch from scratch. The project serves as the official code companion to the book "Build a Large Language Model (From Scratch)" and has gained significant traction with over 50,500 stars on GitHub. Recent commits show active development, including DeBERTa-v3 baseline implementations and BPE tokenization improvements.
microsoft/ai-agents-for-beginners
An educational course by Microsoft offering 11 lessons to help beginners get started with building AI agents. With over 24,000 stars and recent translation updates, this project provides structured learning materials for those looking to understand the fundamentals of AI agent development. The repository includes practical examples and hands-on exercises to build functional AI agents.
Models & Datasets
Models
deepseek-ai/DeepSeek-R1-0528
DeepSeek's latest model release with significant adoption (nearly 48,000 downloads) and 1,687 likes. Built on the DeepSeek V3 architecture, it's optimized for text generation and conversational tasks with FP8 precision support and TGI compatibility.
ResembleAI/chatterbox
A text-to-speech model from ResembleAI that has quickly gained popularity (547 likes) for its voice cloning capabilities. Released under MIT license, Chatterbox focuses on high-quality English speech generation with personalized voice cloning features.
osmosis-ai/Osmosis-Structure-0.6B
A compact 0.6B parameter model specialized in structural understanding with both safetensors and GGUF formats available. Despite being relatively new with just 554 downloads, it has attracted 226 likes, suggesting strong interest in lightweight structural understanding models.
ByteDance-Seed/BAGEL-7B-MoT
ByteDance's any-to-any conversion model based on Qwen2.5-7B-Instruct, described in arxiv:2505.14683. With nearly 9,000 downloads and 948 likes, it implements a Mixture of Transformers (MoT) approach for versatile content transformation tasks.
Datasets
yandex/yambda
A large-scale dataset from Yandex with over 18,600 downloads, designed for recommendation systems and retrieval tasks. With both tabular and text modalities, this 1B+ sample dataset is described in arxiv:2505.22238 and supports multiple data processing libraries including Pandas, Polars, and MLCroissant.
open-r1/Mixture-of-Thoughts
A text generation dataset with over 20,600 downloads and 177 likes. Containing between 100K-1M English language samples in parquet format, it's designed to support diverse reasoning paths as described in two recent arXiv papers (2504.21318 and 2505.00949).
MiniMaxAI/SynLogic
A specialized dataset for logical reasoning with 66 likes. Recently published (May 2025) and described in arxiv:2505.19641, this dataset contains 10K-100K text samples in parquet format to help models develop better logical reasoning capabilities.
Developer Tools & Infrastructure
ResembleAI/Chatterbox
A Gradio-based demo space for ResembleAI's Chatterbox TTS technology that has garnered 660 likes. This interactive demo allows users to test the voice cloning and text-to-speech capabilities of the Chatterbox model directly through a user-friendly interface.
alexnasa/Chain-of-Zoom
A novel Gradio application with 136 likes that implements a "Chain of Zoom" approach, likely for progressive image detail enhancement or focused reasoning. The space uses MCP-server technology for improved performance.
Kwai-Kolors/Kolors-Virtual-Try-On
An extremely popular virtual try-on application with nearly 9,000 likes. Built by Kwai-Kolors using Gradio, this space allows users to virtually try on clothing items using AI, demonstrating practical applications of computer vision in e-commerce.
lmarena-ai/chatbot-arena-leaderboard
A comprehensive leaderboard for LLM performance comparisons with 4,440 likes. This Gradio-based space provides up-to-date rankings of various chatbots, helping developers and researchers track the state-of-the-art in conversational AI models.

RESEARCH
Paper of the Day
Circuit Stability Characterizes Language Model Generalization (2025-05-30)
Author: Alan Sun
Institution: University of California, Berkeley
This paper introduces a groundbreaking approach to evaluating language model performance through "circuit stability" rather than traditional benchmarks. As benchmark saturation becomes a growing problem and creating new challenging datasets remains labor-intensive, this work offers a novel mathematical framework for assessing how consistently a model applies its reasoning process (its "circuit") across varied inputs.
The research demonstrates that circuit stability strongly correlates with generalization performance, providing a more robust evaluation metric than conventional benchmarks. By analyzing a model's internal reasoning pathways rather than just outputs, this approach offers deeper insights into model capabilities and could significantly change how we evaluate and develop future language models.
Notable Research
HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts (2025-05-30)
Authors: Neil He, Rishabh Anand, Hiren Madhu, et al.
This paper introduces HELM, a novel architecture that incorporates hyperbolic geometry into large language models through a mixture-of-curvature experts approach, showing significant improvements in hierarchical reasoning tasks and overall performance while maintaining computational efficiency.
TiRex: Zero-Shot Forecasting Across Long and Short Horizons with Enhanced In-Context Learning (2025-05-29)
Authors: Andreas Auer, Patrick Podest, Daniel Klotz, et al.
TiRex adapts in-context learning for time series forecasting, enabling zero-shot prediction with past values serving as context, overcoming the inherent limitations of transformer architectures when handling long sequences through innovative position embedding techniques.
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents (2025-05-30)
Authors: Yaxin Luo, Zhaoyi Li, Jiacheng Liu, et al.
The researchers introduce the first web-based benchmark platform specifically designed to evaluate multimodal LLM agents' ability to solve CAPTCHAs, addressing a critical bottleneck in deploying web agents for real-world applications.
FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation (2025-05-30)
Authors: Junyu Luo, Zhizhuo Kou, Liming Yang, et al.
This research introduces a comprehensive financial multimodal evaluation dataset featuring over 11,000 high-quality samples across 18 financial domains and 6 asset classes, filling a significant gap in specialized evaluation resources for multimodal LLMs in the financial sector.

LOOKING AHEAD
As Q2 2025 winds down, the emergence of multimodal reasoning systems that can seamlessly integrate information across text, video, and physical sensors is poised to transform enterprise applications. The recent breakthrough in long-context windows extending to millions of tokens suggests that by Q4, we'll see the first wave of "institutional memory" models that can reason across an organization's entire knowledge base in real-time.
Looking toward 2026, the intersection of specialized domain models with general reasoning capabilities is gaining momentum. The regulatory frameworks taking shape in the EU and Asia will likely accelerate the development of explainable AI systems, while the competitive landscape shifts toward models optimized for energy efficiency rather than just scale. Expect significant announcements from smaller labs leveraging these trends to challenge the established leaders.

Don't miss what's next. Subscribe to AGI Agent: