TLDR of AI news

Subscribe
Archives
May 13, 2025

05-12-2025

Decentralized AI and Distributed Systems

  • Prime Intellect's INTELLECT-2, a 32B-parameter language model, was trained using globally distributed reinforcement learning (RL).

  • The model is based on the QwQ-32B base and utilizes the prime-rl asynchronous distributed RL framework, incorporating verifiable reward signals for math and coding tasks.

  • Architectural changes were made for stability and adaptive length control, with an optimal generation length between 2k–10k tokens.

  • INTELLECT-2's performance is comparable to QwQ-32B on benchmarks like AIME24, LiveCodeBench, and GPQA-Diamond, with slight underperformance on IFEval. The significance lies in its demonstration of decentralized RL training.

  • The project also explores post-training techniques and inference-during-training.

  • The work suggests potential for P2P or blockchain-inspired distributed compute and credit systems for AI training and inference.

New Model Releases and Significant Updates

  • ByteDance released DreamO on Hugging Face, a unified framework for image customization supporting ID, IP, Try-On, and Style tasks.

  • Qwen released optimized models for GPTQ, GGUF, and AWQ. Alibaba Qwen also officially released quantized versions of Qwen3 (GGUF, AWQ, GPTQ, INT8) deployable via Ollama, LM Studio, SGLang, and vLLM. The Qwen3 release includes official quantized models, open weights, and a permissive license.

  • Gemma surpassed 150 million downloads and 70,000 variants on Hugging Face.

  • Meta released model weights for its 8B-parameter Dynamic Byte Latent Transformer (BLT) for improved language model efficiency and reliability, and the Collaborative Reasoner framework to enhance collaborative reasoning. The BLT model, first discussed in late 2023, focuses on byte-level tokenization.

  • RunwayML’s Gen-4 References model was launched, described as offering infinite workflows without fine-tuning for near-realtime creation.

  • Mistral AI released Mistral Medium 3, a multimodal AI model, and Le Chat Enterprise, an agentic AI assistant for businesses with tools like Google Drive integration and agent building.

  • Google updated Gemini 2.5 Pro Preview with video understanding and improvements for UI, code, and agentic workflows. Gemini 2.0 Flash image generation received improved quality and text rendering.

  • DeepSeek, an open-source AI initiative, has reportedly nearly closed the performance gap with US peers in two years.

  • f-lite 7B, a distilled diffusion model, was released.

  • Microsoft updated Copilot with a “Pages” feature, similar to ChatGPT Canvas, but reportedly without coding capabilities.

  • Manus AI publicly launched, offering users free daily tasks and credits. The platform focuses on educational or content generation tasks. Some users reported regional availability issues.

  • JoyCaption Beta One, a free, open-source, uncensored Vision Language Model (VLM) for image captioning, was released with doubled training data, a new 'Straightforward Mode', improved booru tagging, and better watermark annotation. It achieved 67% normalized accuracy on human-benchmarked validation sets.

  • Sakana AI introduced Continuous Thought Machines (CTM), a neural architecture where reasoning is driven by neuron-level timing and synchronization. CTM neurons encode signal history and timing, aiming for complex, temporally-coordinated behaviors.

  • A new model, Drakesclaw, appeared on the LM Arena, with initial impressions suggesting performance comparable to Gemini 2.5 Pro.

  • The Absolute Zero Reasoner (AZR) paper details a model achieving state-of-the-art results on coding/math tasks via self-play with zero external data.

  • Mellum-4b-sft-rust, a CodeFIM (Fill-In-The-Middle) model for Rust, trained using Unsloth, was released on Hugging Face.

  • Facebook released weights for their Byte Latent Transformer (BLT).

  • The release of Grok 3.5 is on hold pending integration with X and another recently acquired company.

AI Engineering, Frameworks, and Tooling

  • Cline v3.15 introduced AI-assisted commit messages, UI for Windsurf & Cursor Rules, and batch history deletion. It also integrates Google's Gemini Implicit Caching for potential token discounts.

  • Dolphin-MCP, an open-source MCP client for using MCP with any AI model, received significant updates.

  • Anthropic's API now includes web search capabilities for grounded answers with citations. An MCP server using this for agentic search has been created.

  • OpenAI enabled Plus, Team, and Pro users to export deep research reports as formatted PDFs.

  • A tutorial demonstrated building an AI Research Agent using LangGraph and Ollama for web search and cited summaries.

  • Microsoft launched a GitHub connector for ChatGPT, allowing Deep Research to read and search source code and PRs.

  • Microsoft researchers introduced the ARTIST framework, which enhances LLMs like Qwen2.5 (7B and 14B) on mathematical reasoning benchmarks by combining agentic reasoning, dynamic tool use (including web search), and reinforcement learning (GRPO).

  • The MCP (Glama) ecosystem saw new tools: AiraHub's streamable HTTP protocol for MCP/A2A tools, DiffCalculia_MCP for AI-assisted large file editing, and fabric-mcp-server integrating Fabric patterns with Cline in VS Code.

  • Agentle, a Python framework for building type-safe AI agents, is slated for a May 16, 2025 release, featuring Streamlit chat, Langfuse tracing, and BlackSheep API docs.

  • Cyberdesk, a service enabling AI agents to control a virtual desktop, was open-sourced.

  • Unsloth AI's Dynamic 2.0 GGUF quants reportedly enable more human-like conversations. The community is simplifying tool calling using triple-quoted Python-style multi-line strings.

  • LM Studio API users noted a lack of documented methods for tool call determination, requiring workarounds.

  • The Modular (Mojo) community discussed removing autotuning due to complexity, planning for post-hoc trait conformance via extensions, and enabling bare metal programming by exposing compiler flags for no-stdlib binaries. A shared memory allocation bug was identified.

  • GPU MODE users discussed PyTorch torch.export specializing batch sizes and debugging methods, and the performance pitfalls of array-of-structs designs.

  • The DSPy community discussed a "DSPy Doctrine" outlining its design philosophy, progress on async LLM call support, and a presentation on using DSPy for optimizing correspondence templates in insurance.

  • LlamaIndex launched PapersChat for Arxiv/PubMed interaction, a Multilingual, Multimodal RAG System, a Deep Research Agent tutorial, and updated LlamaParse with new models and auto orientation detection. A tutorial on building an invoice reconciliation agent with LlamaIndex.TS and LlamaCloud was also released.

  • Aider v0.83.0 added support for gemini-2.5-pro-preview-05-06 and qwen3-235b models, with automatic parameter fetching for OpenRouter models. Its "architect mode" aids in planning multi-step code edits. On Azure, model routing similar to FrugalGPT is being explored.

  • tinypilot, a chatbot agent for learning the tinygrad framework, was introduced.

  • Perplexity API now supports specifying subdirectories within domains for precise filtering.

  • A new CodeFIM model for Rust, Mellum-4b-sft-rust, trained using Unsloth, is available.

  • Users are trying local LLMs with coding tools like Cursor AI by overriding the OpenAI API base URL with the LM Studio server URL.

  • The OptinBwd optimizer was rewritten as a drop-in replacement, seeking feedback for further testing in Torchtune.

Advancements in Agent-Based Systems

  • Langchain highlighted examples like a company researcher agent and a deep research framework using coordinated LangGraph agents.

  • The Turing Post provided a deep dive into Multi-Agent Systems (MAS), covering architectures, types, and trends.

  • FutureHouse released five ‘AI Scientist’ agents for research, chemistry, and biology discovery.

  • Microsoft announced adoption of Google’s Agent2Agent (A2A) framework for Azure AI Foundry and Copilot Studio.

  • Unsloth AI community anticipates future fine-tuning focused on agentic behavior and autonomy.

  • NotebookLM users are generating agents using Library and Information Science techniques for study and content creation.

  • CraigBot, a self-hosted Discord bot, is being integrated with NotebookLM for TTRPG session recording and transcript generation, creating searchable campaign archives.

  • Users expressed a desire for GitHub repository integration with NotebookLM to generate code base overviews.

LLM Performance, Evaluation, and Alignment Challenges

  • OpenAI launched HealthBench, a new evaluation benchmark for AI in healthcare, developed with input from over 250 physicians.

  • Latest models like Gemini 2.5 Pro and GPT-4.1 show advanced document parsing capabilities, though human review is still needed.

  • Tencent’s Hunyuan-Turbos model reportedly improved significantly, ranking #8 on lmarena_ai.

  • The METR_Evals benchmark's "doubling every ∼7 mo" trend primarily measures self-contained code and ML tasks.

  • Users reported a perceived decline in GPT-4's reasoning and adaptability, with newer versions (GPT-4o, mini-iterations) seen as more formal, prone to hallucinations, and less adept for technical tasks. Some noted ChatGPT becoming less reliable, with errors and memory issues. Hypothesized causes include MoE architectures, cost optimizations, and safety fine-tuning.

  • A former OpenAI researcher detailed persistent alignment issues in ChatGPT, particularly unreliable sycophancy and overcorrection, suggesting ineffective automated checks for these behaviors.

  • Gemini 2.5 Pro users reported tool call failures, file reading problems, BYOK billing issues on Google AI Studio, and new rate limits.

  • Qwen3 models reportedly generated invalid JSON for tool calls in LM Studio and had incompatibility with tool-calling in Unsloth AI.

  • Claude 3.7 experienced caching failures on Vertex AI for some users.

  • Gemini Exp 1206's context window reportedly fluctuated, raising questions about the utility of large context windows if not effectively used.

  • The validity of the LMArena leaderboard and inconsistencies in Global MMLU benchmark answers across languages were topics of discussion.

  • LLM hallucinations, particularly with historical facts, remain a challenge. Users reported Claude.ai's web UI losing work due to internal server errors.

  • Perplexity API users reported a bug with image URL formats.

  • Issues with Gemini 2.5 Pro, including failures in tool calls, inability to read files, and generation of empty diffs, were discussed by Cursor Community users.

  • Some discussions touched on the potential for public, task-specific benchmarks to enforce transparency and accountability in model safety.

Key Research Directions and Academic Highlights

  • A suggestion was made for a new LLM learning paradigm called "system prompt learning," resembling RL but using direct edits instead of gradient descent for the learning algorithm.

  • Observations were shared that AI models are improving on IQ tests but may not feel significantly smarter, with arguments that useful originality emerges steeply at very high intelligence levels.

  • Sakana AI's Continuous Thought Machines (CTM) introduce a novel architecture using neuron-level timing for reasoning.

  • Top AI/ML research papers summarized included Absolute Zero, RM-R1, Seed-Coder, Flow-GRPO, ZeroSearch, Discuss-RAG, Llama-Nemotron, The Leaderboard Illusion, and Reward Modeling as Reasoning.

  • A guide on turning research into high-quality ML papers with scientific integrity was shared.

  • Discussions in the Yannick Kilcher Discord covered the need for new AGI architectures, the Turing completeness of neural networks, and using RL for improving coding/math skills without external data (similar to AlphaZero).

  • A preprint showed that steering vectors can transfer effectively between different language models due to similarities in their token embedding spaces.

  • The issue of ReLU activation functions breaking manifold continuity, often patched empirically rather than resolved geometrically, was discussed as an ongoing challenge.

  • A user in the Nous Research AI Discord explored applying Daoist principles to machine learning, creating a neural network inspired by the Lo Shu magic square, reporting improved accuracy and training speed on a specific task.

Vision Language Models (VLMs) and Multimodality

  • Gemini 2.5 Pro's video understanding update allows processing up to 6 hours of video in a 2 million token context (at 'low resolution'), combining audio-visual understanding with code processing.

  • Llama.cpp now has VLM support, including for Gemma 3, Qwen2.5VL, and InternVL3.

  • JoyCaption Beta One, an open-source, uncensored VLM for image captioning useful for training diffusion models, was released.

Hardware and Infrastructure Developments

  • The concept of globally distributed GPUs for training, like Prime Intellect's approach, contrasts with the current dominance of large co-located GPU clusters.

  • NVIDIA 5090 driver updates (version 576.02) reportedly led to significant inference performance increases for models like Qwen3 30B MoE Q4 (over 170 t/s).

  • Speculation occurred around the AMD Ryzen AI Max 395 Mini PC, potentially offering 4-6 tkps for 70B models with quad-channel DDR5.

  • Discussions on Triton performance on the NVIDIA 50 series and the lack of nvbench alternatives in ROCm took place.

  • HuggingFace now offers serverless spaces using H200 GPUs, albeit with session time limits.

  • LM Studio continues to be recommended for running local LLMs like Llama and DeepSeek; however, GPT4All users reported boot issues needing AVX/AVX2 CPUs.

  • Users encountered a bug in Mojo with puzzles 8 and 9 where raw memory approach variants seemed to allocate too much shared memory due to a misinterpretation of stack_allocation parameters.

  • Performance of matrix multiplication in tinygrad on a T4 GPU was reported to be significantly slower than PyTorch, prompting investigation.

AI Ethics, Policy, and Societal Impact

  • A US Copyright Office pre-publication report suggested that training generative AI on copyrighted commercial content likely exceeds fair use, especially if competing with existing markets or involving illegal access. The head of the office was later fired.

  • The use of AI for grading student work sparked debate, with some seeing it as teacher obsolescence and others as a tool to free up teacher time, noting automated grading has existed for decades.

  • Discussions occurred around a report of OpenAI potentially providing LLMs for military drones, with some community members expressing significant concern.

  • Cursor's pricing model, especially a 20% API markup in "Max mode," caused confusion among users.


Don't miss what's next. Subscribe to TLDR of AI news:
This email brought to you by Buttondown, the easiest way to start and grow your newsletter.