05-19-2025

AI Model Releases and Performance

Meta KernelLLM 8B: This model reportedly outperformed GPT-4o and DeepSeek V3 in single-shot performance on KernelBench-Triton Level 1. With multiple inferences, it also surpassed DeepSeek R1.
Mistral Medium 3: Made a strong debut, ranking #11 overall in chat, #5 in Math, #7 in Hard Prompts & Coding, and #9 in WebDev Arena.
Qwen3 Models: This new series includes dense and Mixture-of-Expert (MoE) models ranging from 0.6B to 235B parameters, featuring a unified framework and expanded multilingual support.
DeepSeek-V3: This model utilizes hardware-aware co-design and addresses scaling challenges in AI architectures.
BLIP3-o: A family of fully open unified multimodal models using a diffusion transformer has been released, demonstrating superior performance on image understanding and generation tasks.
Salesforce xGen-Small: This family of small AI models includes a 9B parameter model showing strong performance on long-context understanding and math + coding benchmarks.
Bilibili AniSORA: An anime video generation model has been released.
Stability AI Stable Audio Open Small: This open-sourced text-to-audio AI model generates 11-second audio clips and is optimized for Arm-based consumer devices.
Google AlphaEvolve: This coding agent uses LLM-guided evolution to discover new algorithms and optimize computational systems. It reportedly found the first improvement on Strassen's matrix multiplication algorithm since 1969.
Qwen 2.5 Mobile Integration: Qwen 2.5 models (1.5B Q8 and 3B Q5_0) are now available in the PocketPal mobile app for iOS and Android.
Marigold IID: A new state-of-the-art open-source depth estimation model, Marigold IID, has been released, capable of generating normal maps and depth maps for scenes and faces.
Salesforce Lumina-Next: Released on a Qwen base, this model is reported to slightly surpass Janus-Pro.
Gemini Model Performance: Users have observed mixed performance with Gemini models. Gemini 2.5 Pro 0506 is noted as better for coding, while older versions (like 03-25) are reportedly better for math. The deprecation of Gemini 2.5 Pro Experimental has caused some user dissatisfaction due to filtering issues in newer versions.
GPT/O Series Speculation: There is speculation that GPT-5 might adopt a structure similar to Gemini 2.5 Pro, combining LLM and reasoning models, with a potential summer release. The delay of O3 Pro has led to some user frustration.

AI Safety, Reasoning, and Instruction Following

Chain-of-Thought (CoT) and Instruction Following: Research suggests that CoT reasoning can surprisingly harm a model’s ability to follow instructions. Mitigation strategies like few-shot in-context learning, self-reflection, self-selective reasoning, and classifier-selective reasoning (the most robust) can counteract these failures.
Generalization of Reasoning: Reasoning capabilities reportedly fail to generalize well across different environments, and prompting strategies can yield high variance, undermining the reliability of advanced reasoning techniques. Larger models benefit less from strategic prompting, and excessive reasoning can negatively impact smaller models on simple tasks.
AI Safety Paradox: It's argued that as the marginal cost of intelligence decreases, it could lead to better defense capabilities in biological or cyber warfare by enabling the identification and addressing of more attack vectors.
LLM Performance in Multi-Turn Conversations: A new study found that LLM performance degrades in multi-turn conversations due to increased unreliability.
J1 Incentivizing Thinking in LLM-as-a-Judge: Research is exploring RL techniques to incentivize "thinking" in LLM-as-a-Judge systems.
Predicting Reasoning Strategies: A Qwen study found a strong correlation between question similarity and strategy similarity, enabling the prediction of optimal reasoning strategies for unseen questions.
Fine-tuning for Reasoning: Researchers significantly improved an LLM's reasoning by fine-tuning it on just 1,000 examples.
Spontaneous Social Conventions in LLMs: A study revealed that universally adopted social conventions can spontaneously emerge in decentralized LLM populations through local interactions, leading to strong collective biases even without initial individual agent biases. Committed minority groups of adversarial LLM agents can reportedly drive social change.

AI Tools and Applications

Microsoft Adds Grok to Azure: Grok 3 is now available on Microsoft Azure through their foundry model collection.
GitHub Copilot Enhancements: GitHub Copilot now supports the entire software development lifecycle, including agent mode, team support, app modernization, and an SRE Agent.
OpenAI Codex Launch: OpenAI launched Codex, a new coding agent that can build features and fix bugs autonomously, available for Pro, Enterprise, and Team users.
Alibaba Qwen Chat Deep Research: The Qwen team made Deep Research for Qwen Chat available to all users, enabling the preparation of detailed reports.
Notion "AI for Work" Suite: Notion launched an "AI for Work" suite for business plan subscribers, offering AI meeting notes, access to various AI models, enterprise search, and a research mode.
Alibaba Wan2.1-VACE: A unified AI for video creation and editing, Wan2.1-VACE, is now available in 1.3B and 14B sizes.
MLX-Powered LLMs on Hugging Face Hub: MLX-powered LLMs can now be accessed directly from Hugging Face Hub for fast terminal-based intelligence.
Modal Labs Dicts Serverless KV Store: New features include no scale limits, LRU-cache semantics, distributed locking, and durability.
LangChain LangGraph Node-Level Caching: LangGraph now supports node-level caching for faster iteration.
Genspark AI Sheets: An application that allows users to interact with their spreadsheets using natural language.
Ollama v0.7 Multimodal Support: Ollama now natively supports multimodal models.
Clara AI Workspace: An open-source, fully offline, modular AI workspace aiming to unify LLMs, agents, automation (via n8n), and local image generation (Stable Diffusion/ComfyUI).
Kokoro-JS Text-to-Speech: An open-source, client-side TTS system that runs entirely in the browser using an ONNX model and local resources (WebGPU/WebGL).
OuteTTS 1.0: A 0.6B parameter multilingual TTS model based on Qwen-3, optimized for efficient batch inference and supporting multiple inference backends.
Model Context Protocol (MCP): This protocol enables communication between AI agents, even across different machines. Qwen 3 235B reportedly supports it natively. The MCPControl server v0.2.0 now has SSE support.
DSPy and LlamaIndex for Agent Workflows: DSPy 2.6 has updated its suggestion/assertion mechanisms. LlamaIndex Agents feature improved memory and support for multi-agent workflows with Weaviate.
AWS Strands Agents SDK: Amazon released this open-source SDK to streamline agent creation.
Sherlog Canvas: An AI-powered debugging interface integrating MCP-powered cells for logs and metrics.
MCP UI SDK: Adds rich web interaction capabilities to MCP servers.
NotebookLM Mobile App and Video Uploads: The NotebookLM mobile app is available on iOS and Android. The web version now supports video uploads with automatic transcription, though some users report changes in writing style and potential censorship issues.
OpenRouter API Changes: Google is deprecating Gemini 2.5 Pro Experimental on OpenRouter. The free DeepSeek V3 0324 is undergoing maintenance. Issues with the Kluster provider for Qwen3 235B required OpenRouter to switch providers.
Hugging Face Tool Updates: LlamaIndex announced LlamaParse updates and first-class support in Azure AI Foundry Agent Service. Vitalops released datatune, an open-source tool for data transformations via natural language.

AI Business and Strategy

Sakana AI and MUFG Bank Partnership: Sakana AI and MUFG Bank have signed a comprehensive partnership to integrate AI into MUFG's systems, potentially making Sakana AI profitable within a year.
Cohere and Dell Partnership: Cohere is partnering with Dell to offer secure, agentic enterprise AI solutions on-premises.
Perplexity on Whatsapp: Perplexity's Whatsapp integration is reportedly snappier, faster, and more conversational, leading to increased usage.
New Product Development Law: Teams are encouraged to cultivate a culture of experimentation with generative models to discover new product experiences rapidly.
Setting Company Values: The importance of establishing company values early is emphasized, as later course correction is difficult.
AI-Driven Layoffs and Workforce Restructuring: Several major tech companies have announced layoffs attributed to AI-focused restructuring or the elimination of roles perceived as less relevant in an AI-driven context. However, there is skepticism about attributing all layoffs solely to AI, with suggestions that broader economic factors and routine workforce adjustments also play a role.

Infrastructure, Tools, and Datasets

NVIDIA Physical AI Models: NVIDIA has open-sourced Physical AI models, which are reasoning models designed to understand physical common sense and generate appropriate embodied decisions.
Meta KernelLLM 8B Release: Meta released KernelLLM 8B on Hugging Face.
SaharaLabsAI SIWA Testnet: The SIWA Testnet is live, powering scalable compute for their development platform.
Marin Open Lab for AI: Marin, an open lab for AI, was established to promote open-source AI with open development practices.
Open Molecules 2025 (OMol25) and Meta UMA: OMol25, a new Density Functional Theory (DFT) dataset for molecular chemistry, and Meta's Universal Model for Atoms (UMA), a machine learning interatomic potential, have been released.
Tsinghua University HuB Framework: Researchers detailed HuB, a unified framework to help humanoids handle extreme balancing tasks.
Intel Arc Pro GPUs: Intel launched the Arc Pro B50 (16GB VRAM, 299) and Arc Pro B60 (24GB VRAM, 299) and Arc Pro B60 (24GB VRAM, ~299) and Arc Pro B60 (24GBVRAM, 500) GPUs, targeting professional and AI workstation markets, particularly for memory-intensive LLM/AI workflows. "Project Battlematrix" workstations will feature the B60. A dual-GPU B60 configuration could offer 48GB VRAM for under $1,000.
ParScale Model and Paper: Qwen released the ParScale model and a corresponding paper detailing a parallel scaling method for transformers. This method uses P parallel streams and suggests that scaling with P streams is theoretically comparable to increasing parameter count by O(log P), potentially offering better efficiency than MoE models.
GPU Hardware Performance: The Intel Arc Pro B60 is noted for its VRAM capacity and price point. Macbooks are praised for efficient local LLM execution. Enabling Resizable BAR on GPUs can significantly boost LM Studio performance.
Triton, CUDA, and ROCm Challenges: Users are encountering challenges integrating FSDP and Flash Attention 2 with trl, and debugging CUDA errors. Debates continue regarding ROCm Triton's kpack argument impact on performance.
Quantization and Kernel Optimization: Discussions focus on FP8-MM and MoE performance on MI300 hardware. Users are experimenting with Quantization Aware Training (QAT) and tools like CuTeDSL for low-level kernel optimization.

May 19, 2025, 8:35 p.m.

TLDR of AI news