TLDR of AI news

Subscribe
Archives
May 13, 2025

05-09-2025

Large Language Models (LLMs) and Model Performance

  • Gemini 2.5 Flash: Reported to be 150x more expensive than Gemini 2.0 Flash due to higher output token costs and increased token usage for reasoning. Despite this, a 12-point increase in an intelligence index may justify its use. Reasoning models are generally pricier per token due to longer outputs.

  • Mistral Medium 3: Performance rivals Llama 4 Maverick, Gemini 2.0 Flash, and Claude 3.7 Sonnet, showing gains in coding and math. It is priced lower than Mistral Large 2 ($0.4/$2 per 1M Input/Output tokens vs. $2/$6), though it may use more tokens due to more verbose responses.

  • Qwen3 Model Family: Alibaba's Qwen3 includes eight open LLMs supporting an optional reasoning mode and multilingual capabilities across 119 languages. It performs well in reasoning, coding, and function-calling, and features a Web Dev tool for building webpages/apps from prompts.

  • DeepSeek Models: Huawei’s Pangu Ultra MoE achieved performance comparable to DeepSeek R1 on 6K Ascend NPUs. DeepSeek is suggested to have set a new LLM default, with reports of new compute resources acquired, potentially for V4 training.

  • Reinforcement Fine-Tuning (RFT) on o4-mini: OpenAI announced RFT availability for o4-mini, using chain-of-thought reasoning and task-specific grading to improve performance, aiming for flexible and accessible RL.

  • X-REASONER: Microsoft’s vision-language model, X-REASONER, is post-trained solely on general-domain text for generalizable reasoning across modalities and domains.

  • Scalability of Reasoning Training: The rapid scaling of reasoning training is expected to slow down within approximately a year.

  • HunyuanCustom: Tencent released weights for their HunyuanCustom model on Hugging Face. The full-precision (FP8) weight size is 24GB, considered large for many users.

  • Advanced Local LLM Inference Optimization: A technique of offloading individual FFN tensors (e.g., ffn_up weights) instead of entire GGUF model layers in llama.cpp/koboldcpp can reportedly increase generation speed by over 2.5x at the same VRAM usage for large models. This granular approach keeps only the largest tensors on CPU, allowing all layers to technically execute on GPU.

  • Qwen3 Reasoning Emulation: A method was described to make the Qwen3 model produce step-by-step reasoning by prefacing outputs with a template, mimicking Gemini 2.5 Pro's style, though this doesn't inherently improve the model's intelligence.

  • Gemini 2.5 Pro Performance Issues: Users across various platforms (LMArena, Cursor, OpenAI) reported that Gemini 2.5 Pro (especially version 0506) exhibits a ‘thinking bug,’ memory loss, slow request processing, and chain-of-thought failures after approximately 20k tokens.

  • Upcoming OpenAI Open-Source Model: OpenAI plans to release an open-source model in summer 2024, though it will be a generation behind their current frontier models. This is intended to balance competitiveness and limit rapid adoption by potential adversaries. Skepticism exists regarding its true openness and competitiveness.

AI Applications and Tools

  • Deep Research and GitHub Integration: ChatGPT can now connect to GitHub repos for deep research, allowing it to read and search source code and PRs, generating detailed reports with citations.

  • Agent2Agent (A2A) Protocol: Google’s A2A protocol aims to be a common language for AI agent collaboration.

  • Web Development with Qwen Chat: Qwen Chat includes a "Web Dev" tool for building webpages and applications from simple prompts.

  • LocalSite Tool: An open-source local alternative to "DeepSite" called "LocalSite" allows creating web pages and UI components using local LLMs (via Ollama, LM Studio) or cloud LLMs.

  • Vision Support in llama-server: llama.cpp’s server component now has unified vision support, processing image tokens alongside text within a single pipeline using libmtmd.

  • Unsloth AI Tooling: Users resolved tokenizer embedding mismatches and achieved 4B model finetuning on 11GB VRAM with BFloat11. A synthetic data notebook collaboration with Meta was highlighted.

  • Aider Updates: Aider now supports gemini-2.5-pro-preview-05-06 and qwen3-235b. It features a new spinner animation and a workaround for Linux users connecting to LM Studio’s API.

  • Mojo Language: Discussions around Mojo included efficient memory handling with the out argument and a move to explicit trait conformance in the next release. A static Optional type was proposed.

  • Torchtune: Community members highlighted the importance of apply_chat_template for tool use and debated the trade-offs of its optimizer-in-backward feature.

  • Perplexity API: Users discussed costs of the Deep Research API and noted image quality caps, suspecting cost-saving measures. Domain filters now support subdirectories for more granular control.

  • LM Studio API: Users find that LM Studio's API lacks clear methods for determining tool calls with model.act. The community awaits a full LM Studio Hub for presets.

  • Cohere API: Users reported payment issues and an Azure AI SDK issue where extra parameters for Cohere embedding models were disregarded.

  • NotebookLM: Praised for its new mind map feature, but criticized for not parsing handwritten notes or annotated PDFs. Reports of hallucinated answers persist. A mobile app beta is upcoming.

  • VoyageAI & MongoDB: A new notebook demonstrated combining VoyageAI’s multi-modal embeddings with MongoDB’s multi-modal indexes for image and text retrieval.

  • LLM Ad Injection Threat: Concerns were raised that ads injected into LLM training data could corrupt recommendations.

AI Safety and Alignment

  • Scientist AI: Yoshua Bengio presented "Scientist AI" as a practical and more secure alternative to current agency-driven AI development trajectories.

  • AI Control and Usability: The importance of secure usability with appropriate schemes for AI control was emphasized.

  • Prompt Disambiguation Failure: An interaction with the Hugging Face smolagents computer-agent resulted in the AI displaying inappropriate content in response to a benign prompt ("ball bouncing inside the screen"), highlighting issues with natural language understanding and content filtering.

  • AI Model Guardrails: Aggressive censorship measures in AI models can sometimes lead to unintended behaviors, such as surfacing inappropriate outputs for benign prompts.

  • Data Leakage and Prompt Injection: LLMs can sometimes output portions of their training data, including explicit content, under certain prompt injection attempts.

AI-Generated Content Trends

  • Analysis suggests a sharp increase in em dash usage in entrepreneurship-related subreddits in 2024, potentially indicating a rise in ChatGPT-generated or -assisted content, as em dashes are a linguistic hallmark of some OpenAI models.

  • Discussions highlighted the difficulty in distinguishing between fully AI-written, AI-polished, or AI-edited content based solely on stylistic markers like punctuation.

  • Some users noted that their own writing styles are being influenced by frequent use of AI tools like ChatGPT.

Robotics and Embodied AI

  • NVIDIA reportedly trained humanoid robots to move like humans using zero-shot sim-to-real transfer, condensing "10 years of training in only 2 hours" with a policy of only 1.5 million parameters.

  • An image depicted an early-stage assembly area for Tesla Optimus humanoid robots, suggesting development or pilot production rather than full-scale automated manufacturing.

  • A "balance test" associated with Figure 02 was mentioned, though the linked video was inaccessible. Community comments emphasized the importance of practical, everyday manipulation tasks as robotics benchmarks.

  • Users anticipate that store associate jobs involving online order assembly are likely to be automated by humanoid robots in the near future.

Hardware Developments

  • MI300 GPUs featured prominently on GPU MODE’s amd-fp8-mm leaderboard, with one submission achieving 122 µs.

  • Hugging Face upgraded its ZeroGPU offering for Pro accounts from A100s to 10 H200s.

  • CUDA users discussed torch.compile performance issues, CUDA memcpy errors, and debate efficient data structures, favoring HPC formats like COO.

  • Memory optimization techniques discussed included BFloat11 (Unsloth users finetuning 4B models on 11GB VRAM), FSDP CPU offload, and the high bandwidth of Intel’s Data Center GPU Max 1550.

  • Nvidia’s Thor architecture (Compute Capability 10.1) supports Linux/Windows and x86_64/ARM CPUs. RTX Pro Blackwell is believed to be SM_120.


Don't miss what's next. Subscribe to TLDR of AI news:
This email brought to you by Buttondown, the easiest way to start and grow your newsletter.