May 15, 2025, 10:36 p.m.

05-15-2025

TLDR of AI news

Technological Advancements & Model Releases

  • Google's AlphaEvolve: This Gemini-powered coding agent is designed for algorithm discovery. It has demonstrated capabilities in creating faster matrix multiplication algorithms (speeding up Gemini training with a 23% faster kernel, resulting in a 1% total reduction in training time), finding new solutions to open mathematical problems (surpassing SOTA on 20% of applied problems, improving bounds on the Minimum Overlap Problem and the Kissing number in 11 dimensions), and enhancing efficiency in data centers, chip design, and AI training across Google. AlphaEvolve operates as an agent with multiple components in a loop, modifying, evaluating, and optimizing code (text) rather than model weights. It has also been used to optimize data center scheduling and assist in hardware design.

  • GPT-4.1 Availability: GPT-4.1 is now available in ChatGPT for Plus, Pro, and Team users, with Enterprise and Education access coming soon. It specializes in coding tasks and instruction following, positioned as a faster alternative to OpenAI o3 & o4-mini for daily coding. GPT-4.1 mini is also replacing GPT-4o mini for all ChatGPT users and is reported to be a significant upgrade.

  • AM-Thinking-v1 Reasoning Model: This 32B parameter model, built on the open-source Qwen2.5-32B base and publicly available queries, is reported to outperform DeepSeek-R1 and rival the performance of larger models like Qwen3-235B-A22B and Seed1.5-Thinking in reasoning tasks.

  • Salesforce BLIP3-o Multimodal Models: Salesforce has released the BLIP3-o family of fully open unified multimodal models on Hugging Face. These models utilize a diffusion transformer to generate semantically rich CLIP image features.

  • Nous Decentralized Pretraining: Nous has initiated a decentralized pretraining run for a dense Deepseek-like model with 40B parameters, aiming to train it on over 20T tokens, incorporating MLA for long context efficiency.

  • Gemini Implicit Caching: Google DeepMind's Gemini now supports implicit caching, which can lead to up to 75% cost savings when requests hit the cache, particularly beneficial for queries with common prefixes, such as those involving large PDF documents.

  • New Model Announcements & Sightings: DeepSeek v3 (an MoE model), Qwen3 (noted for translating Mandarin datasets), and Samsung models like MythoMax-L2-13B (briefly on Hugging Face) and MuTokenZero2-32B have been subjects of discussion. Samsung also inadvertently released and then removed the MythoMax-L2-13B roleplay model.

  • OpenAI Safety & Evaluation Tools: OpenAI introduced the Safety Evaluations Hub to share safety results for their models and added Responses API support to their Evals API and dashboard, allowing comparison of model responses.

AI Engineering, Tooling, and Frameworks

  • LangChain Updates: The LangGraph Platform is now generally available for deploying, scaling, and managing agents with stateful workflows. LangChain also introduced the Open Agent Platform (OAP), an open-source, no-code agent builder that connects to MCP Tools, LangConnect for RAG, and other LangGraph Agents. At LangChain Interrupt 2025, OpenEvals, a set of utilities for simulating conversations and evaluating LLM application performance, was launched.

  • Model Context Protocol (MCP): Hugging Face has released an MCP course covering its usage. MCP is also being integrated into tools like LangChain's OAP.

  • FedRAG Framework: An open-source framework called FedRAG has been introduced for fine-tuning RAG systems across both centralized and federated architectures.

  • Unsloth TTS Fine-tuning: Unsloth now supports efficient Text-to-Speech (TTS) model fine-tuning, claiming ~1.5x faster training and 50% less VRAM usage. Supported models include Sesame/csm-1b and Transformer-based models, with workflows for emotion-annotated datasets. A new Qwen3 GRPO method is also supported.

  • llama.cpp PDF Input: Native PDF input support has been added to the llama.cpp web UI via an external JavaScript library, allowing users to toggle between text extraction and image rendering without affecting the C++ core.

  • AI-Powered "8 Ball" Device: A local, offline AI "8 Ball" has been implemented on an Orange Pi Zero 2W, using whisper.cpp for TTS, llama.cpp for LLM inference (Gemma 3 1B model), showcasing offline AI hardware capabilities.

  • Meta's Transformers + MLX Integration: Deeper integrations between Transformers and MLX are anticipated, highlighting the importance of Transformers to the open-source AI ecosystem.

  • Atropos and Axolotl AI: Training using Atropos can now be done via Axolotl AI.

  • Quantization Performance: The Unsloth AI community reports that QNL quantization offers faster performance than standard GGUFs, with keeping models entirely in VRAM being critical for optimal performance.

  • Framework Usage: Developers are utilizing DSPy for structured outputs with Pydantic models and LlamaIndex for event-driven agent workflows, such as a multi-agent Docs Assistant. Shortwave client support has been added to the Meta-Circular Evaluator Protocol (MCP).

  • Hardware Optimizations: Multi-GPU fine-tuning with tools like Accelerate and Unsloth is a popular topic. Active benchmarking of MI300 cards and discussions on TritonBench errors on AMD GPUs are ongoing.

  • OpenMemory MCP: Mem0.ai introduced OpenMemory MCP, a unified memory management layer for AI applications.

Reasoning, Agentic Systems, and Evaluation Challenges

  • LLMs in Multi-Turn Conversations: Research indicates that LLMs (both open and closed-source) show a significant performance drop (average of 39%) in multi-turn, underspecified conversational settings compared to single-turn, fully-specified instructions. Common failure modes include making premature assumptions and difficulty recovering from early misinterpretations. Reinitiating conversations with full context in the first prompt can mitigate this.

  • Chain-of-Thought (CoT) Reasoning: CoT is described by some as a simplified way to sample meaningful latent variables in LLMs.

  • Reinforcement Learning for Search-Efficient LLMs: A new post-training RL framework aims to explicitly train LLMs to optimize their search usage.

  • Runway References for Zero-Shot Testing: Runway's References feature is being used for zero-shot testing of creative assets like clothes, locations, and poses in generative video.

  • ARI Beats OpenAI's Deep Research: The Advanced Research & Insights (ARI) agent reportedly outperformed OpenAI's Deep Research on two benchmarks.

  • Limitations of Current Evaluations: It's noted that current evaluation loops for AI models often don't hold up well when interacting with real human users. MCQ evaluations like MMLU have also been observed incorrectly flagging model outputs as false.

  • Coding Skills of GPT-4.1: GPT-4.1 is reported to have excellent coding skills and instruction-following capabilities.

  • OpenAI to Z Challenge: OpenAI announced a challenge using o3/o4 mini and GPT-4.1 models to discover previously unknown archaeological sites.

Industry Developments and Community

  • OpenAI's Tech Stack: OpenAI reportedly uses FastAPI to serve ChatGPT.

  • Creative AI Projects: The community is showcasing open-source projects like Jinko MCP (AI agents for hotel sales), Tig coding agent (built with LlamaIndex), and AsianMOM (a WebGPU Vision-LLM app for humorous "roasting").

  • Learning Opportunities: Events like Nous Research and Solana Foundation's Decentralized AI event, Lambda workshops on agentic applications (with API credits), and BlackboxNLP 2025 shared tasks are providing learning and collaboration avenues.

  • Grok Model Controversies: Elon Musk's Grok model reportedly made controversial statements, leading to user distrust.

  • TypeScript Developer Departure: The departure of a key TypeScript developer from Microsoft has caused some concern in the developer community.

  • User Workarounds for Platform Issues: Users are sharing solutions for issues like restricted OpenAI access (using proxies), managing token usage in tools like Aider, and finding alternatives to services like GPT4All (e.g., Jan.ai, LM Studio). Perplexity AI users have faced issues with Pro role assignment and Deep Research mode performance. Llama 3.1/3.3 models reportedly had issues with fantasy prompts, and Claude 3.5 Sonnet was observed getting stuck in loops.

You just read issue #6 of TLDR of AI news. You can also browse the full archives of this newsletter.

Powered by Buttondown, the easiest way to start and grow your newsletter.