[AINews] not much happened today
This is AI News! an MVP of a service that goes thru all AI discords/Twitters/reddits and summarizes what people are talking about, so that you can keep up without the fatigue. Signing up here opts you in to the real thing when we launch it 🔜
a quiet day
AI News for 4/1/2025-4/2/2025. We checked 7 subreddits, 433 Twitters and 30 Discords (230 channels, and 6807 messages) for you. Estimated reading time saved (at 200wpm): 627 minutes. You can now tag @smol_ai for AINews discussions!
OpenHands LM and OpenAI PaperBench got pretty close but no cigar.
Meta update: After the old Reddit pipeline broke, we have finally got our new llm clustering and ranking system working. You can see the new results below and we'll be improving them over time. Feedback welcome on the @smol_ai account.
The Table of Contents and Channel Summaries have been moved to the web version of this email: !
AI Twitter Recap
Models and Benchmarks
- Multi-Token Attention (MTA) enhances LLM performance on benchmarks: @jaseweston highlights Meta's Multi-Token Attention (MTA), demonstrating enhanced performance on standard language modeling tasks and tasks requiring information retrieval within long contexts. MTA combines query, key, and head operations over multiple tokens, proving particularly beneficial in leveraging richer information.
- OpenAI's PaperBench evaluates AI agent replication of research: @OpenAI introduced PaperBench, a benchmark assessing AI agents' ability to replicate state-of-the-art AI research. Agents must understand papers, write code, and execute experiments from top ICML 2024 publications. The best-performing agent, Claude 3.5 Sonnet (New), achieved an average replication score of 21.0% with open-source scaffolding. @OpenAI noted that models do not yet outperform human baselines and that replication attempts were evaluated using detailed rubrics co-developed with original authors. These rubrics systematically break down the 20 papers into 8,316 precisely defined requirements, evaluated by an LLM judge.
- EpochAIResearch introduces ArithmeticBench for advanced arithmetic evaluation: @EpochAIResearch announced ArithmeticBench, a challenging benchmark designed to test AIs on the frontiers of arithmetic with numbers exceeding 100 digits.
- Google DeepMind's Gemini 2.5 Pro matches reported scores on GPQA Diamond: @EpochAIResearch evaluated Gemini 2.5 Pro on GPQA Diamond, achieving a score of 84%, matching Google's reported result.
- TAU-bench evaluates agent reliability in real-world environments: @_philschmid discusses TAU-bench, a benchmark that evaluates agents in real-world environments and showed poor reliability. It tests if an agent can reliably engage in a dynamic, multi-turn conversation with a user to figure out what needs to be done. This benchmark was released in June 2024, but feels more important than ever. Not only it describes the limitation we currently face it also demonstrates how to setup a good evaluation pipeline for your own agents!
- @StanfordNLP shared that LLMs can produce novel ideas but they might lack feasibility in their ICLR paper.
AI Model Architecture and Training
- TTSTSTT: A new AI model architecture trained in the auditory domain: @juberti introduces TTSTSTT (Text To Speech To Speech To Text), a novel AI model architecture trained to perform reasoning entirely within the auditory domain, using conversions to text at the input and output layers. The rationale is that TTSTSTT can take advantage of natural patterns that emerge when language is produced and perceived in speech form, where subtleties like intonation and timing can inform more contextually aware reasoning. TTSTSTT is designed as a drop-in replacement for any current text LLM.
- Meta presents Multi-Token Attention: @_akhaliq highlights that MTA achieves enhanced performance on a range of popular benchmarks. Notably, it outperforms Transformer baseline models on standard language modeling tasks, and on tasks that require searching for information within long contexts, where our method's ability to leverage richer information proves particularly beneficial.
- Scaling SSL adapts to data, makes use of model capacity, and scales effectively: @sainingxie notes that In Cambrian-1, vision SSL representations usually lagged behind language-supervised ones -- but once the data gap is closed and scaling kicks in, performance catches up. SSL adapts to data, makes use of model capacity, and scales effectively (even better than CLIP!).
Applications of AI
- LlamaExtract helps build agents for structured data extraction from technical documents: @jerryjliu0 highlights the application of agentic extraction from technical documents in industries like manufacturing, construction, and energy. LlamaExtract enables the creation of agents that can extract structured data directly from datasheets, ensuring accurate, consistent JSON output through multimodal understanding and validation loops.
- Windsurf enables deploying apps with coding agents using Netlify: @omarsar0 announces that Windsurf now allows deploying apps with the coding agent, using Netlify for deployment.
- Klarna's AI Assistant, powered by LangGraph and LangSmith, reduces customer resolution time by 80%: @LangChainAI highlights that Klarna's AI Assistant handles customer support tasks for 85 million active users, automating ~70% of repetitive support tasks and enabling faster responses to user queries.
- Ashlee's journey navigating cancer and how AI research assistants like Elicit can help people make more evidence-backed decisions: @jungofthewon shared an article about this use case.
- @iScienceLuvr laid out his vision for what's needed for the future of medical AI and healthcare, namely: multimodal medical foundation models, open-source necessity, and dedicated research lab companies.
- @AndrewYNg introduced a short course, "Getting Structured LLM Output"
Tools and Resources
- Hugging Face introduces billing centralization for Enterprise Hub organizations: @ClementDelangue announces that Enterprise Hub organizations can now centralize billing for both Hugging Face usage and inference through their inference partners.
- Weights & Biases offers free virtual courses on RAG and LLM evaluations: @weights_biases is offering two free virtual courses designed for AI engineers who want to master RAG and LLM evaluations. One covers practical optimization strategies, systematic evaluation, and advanced reranking, agentic RAG, & response synthesis, while the other focuses on building auto-eval pipelines with LLM-based judges, combining programmatic checks with LLM signals, and aligning evals with minimal human input.
- OpenAI releases OpenAI Academy with free AI tutorials, webinars, and workshops: @LiorOnAI notes that the academy covers AI literacy to advanced LLM integration.
- LangSmith Playground now allows inline dataset creation for interactive evaluations: @LangChainAI announces that users can now create datasets inline and add examples to existing datasets without leaving the Playground, making it easier to evaluate LLM calls, especially for non-developers.
- Axolotl v0.8.0 released with support for Sequence Parallelism, Gemma3, Multimodal (beta), and Muon optimizer: @winglian announced that Axolotl is out v0.8.0 today!
Industry and Economic Impact
- Sam Altman highlights AI adoption in India: @sama notes the amazing AI adoption in India, with creativity outpacing the world.
- Aravind Srinivas discusses Neil Mehta's investment approach: @AravSrinivas highlights Neil Mehta's ability to take concentrated bets and go all-in, bootstrapping his fund with winnings from his time at DE Shaw.
- Google in talks to rent Nvidia Blackwell chips from CoreWeave: @steph_palazzolo reports that Google is in advanced talks to rent Nvidia Blackwell chips from CoreWeave and potentially house its TPUs in Coreweave facilities, highlighting intense customer demand for compute.
- Jason Wei discusses the future of AI for scientific innovation: @_jasonwei predicts AI will be used for scientific innovation.
Humor
- Sam Altman posts prompt for images v2: @sama shared a prompt: sam altman as a cricket player in anime style
- The moment OpenAI published PaperBench LMAO: @scaling01 shares a humorous thought.
- Sentient AI exposes horrendous working conditions at OpenAI: @scaling01 jokingly reports that AI has exposed OpenAI for no dental and no vacation days.
AI Reddit Recap
/r/LocalLlama Recap
Theme 1. "AI Model Benchmarking: Performance, Challenges, and Innovations"
-
KTransformers Now Supports Multi-Concurrency and Runs 40 Tokens/s of DeepSeek-R1 Q4/FP8 on MRDIMM-8800 (Score: 204, Comments: 42): KTransformers has been updated to support multi-concurrency, resulting in a throughput increase from 17 tokens/s to 40 tokens/s on the Xeon6 + MRDIMM-8800 platform. The update involved over 10,000 lines of code, implementing high-performance asynchronous concurrent scheduling in C++ with features like continuous batching and chunked prefill. GPU sharing and the efficient flashinfer library have also improved overall throughput. They plan to merge the AMX part and open-source it in April. More information is available at https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/balance-serve.md. The team acknowledges that the refactoring took longer than expected and credits the excellent architecture of sglang for inspiration. They note that the bottleneck has shifted to the GPU and suggest that using a higher-end GPU than the 4090D could further improve performance. They express gratitude to the local LLaMa community for support, highlighting that KTransformers has over 13K GitHub stars and is widely deployed.
smflxexpresses enthusiasm for the update and asks if the improvements could be applied to Genoa, mentioning they're getting 17t/s with unsloth Q2 and hoping for a 2x speedup.
Ok_Warning2146congratulates the team and inquires about the prompt processing speed, wondering if it too would be bottlenecked by the GPU.
zjuwyzsuggests considering speculative decoding using MTP now that parallel processing can boost throughput.
-
The Candle Test - most LLMs fail to generalise at this simple task (Score: 142, Comments: 179): The Candle Test is designed to demonstrate that most large language models (LLMs) fail to generalize in a simple task due to overfitting. The test involves three questions where models acknowledge that candles get shorter when they burn but incorrectly answer the riddle "I'm tall when I'm young, and I'm taller when I'm old. What am I?" with "a candle". Models that failed the test include DeepSeek Chat V3, DeepSeek R1, DeepSeek R1 Distill Llama 70B, and Llama 3.1 405B, while Mistral Large passed. The author believes that the latest frontier models are becoming "weird" due to increased pressure to achieve state-of-the-art benchmarks, leading to overfitting and decreased generalization capabilities. They emphasize that failing the Candle Test doesn't mean a model is "dumb" or "bad", but it may fail in novel situations. The test was inspired by their frustration with Sonnet 3.7, which fails the test unlike Sonnet 3.5.
Pedalnomicasuggests testing all models quickly "before this hits the training data", implying models might learn the test and no longer fail.
aeschenotes that humans might also answer "a candle" to the riddle, indicating that the mistake is understandable.
kmeansneuralnetworkmentions that Gemini 2.5 Pro Experimental passes the test.
-
While Waiting for Llama 4 (Score: 81, Comments: 36): The top-performing open-source models on LM Arena include DeepSeek-V3-0324, DeepSeek-R1, Gemma-3-27B-it, QwQ-32B, and others. The most powerful Llama model listed is the massive Meta-Llama-3.1-405B-Instruct, but smaller models like 70B Nemotron and its variants have outperformed it. DeepSeek sits at the top of the leaderboard but is too large for home use. Smaller models like QwQ and Gemma are outperforming larger models and ranking high. These developments suggest why Llama 4 is still in training, with hopes that it will bring exceptional performance and better accessibility for local or home use, similar to QwQ and Gemma.
mw11n19appreciates Meta's role in open-sourcing models, stating that "Most of these models wouldn’t be open-sourced if Meta hadn’t done it first".
AdIllustrious436criticizes the reliability of LM Arena, claiming it's easy to manipulate and "doesn't provide any valuable info".
Bandit-level-200notes that while smaller models like QwQ and Gemma score well on benchmarks, they lack the "spark" of larger models in logical tasks, suggesting current benchmarks can be misleading.
-
PAI: your personal AI 100% local inspired by Google's Project Astra (Score: 68, Comments: 8): The user has developed an iOS app called PAI, a personal AI that is 100% local and open source, inspired by Google's Project Astra. The app functions as an audio and video chatbot with features like visual question answering, streaming via RTC & Livekit for low latency, screen sharing, live transcription, and the ability to change the LLM to any model supported by Exllama v2. The code is available on GitHub: https://github.com/remichu-ai/pai.git, and a demo video is provided at https://youtu.be/pNksZ_lXqgs. The developer expresses enthusiasm about sharing their project, emphasizing its inspiration from Google's Project Astra. They note that it combines STT + LLM + TTS, and mention that those for whom this is a deal breaker may choose to skip it.
Mandelaaasks if there's a planned Android app in the future, indicating interest from non-iOS users.
GreatBigJerkpraises the project as super cool but notes they don't use iOS, adding a humorous remark about the developer's nails in the demo video.
ProfessorCentaurinquires whether the app supports vocal interrupt, showing interest in specific technical features.
-
Now we talking INTELLIGENCE EXPLOSION💥🔅 | ⅕ᵗʰ of benchmark cracked by claude 3.5! (Score: 71, Comments: 6): OpenAI has released PaperBench, a benchmark designed to assess AI agents' abilities to replicate cutting-edge AI research. Claude 3.5 has successfully cracked 1/5 of this benchmark. The post expresses excitement about advancements in AI capabilities, suggesting an 'intelligence explosion' due to Claude 3.5's achievement.
- @Jean-Porte remarks that OpenAI researchers might find it irritating when they make benchmarks and have to report Anthropic beating them.
- @Trojblue questions whether the ICML2024 data is already in the training set, asking aren't they already in the training set anyways?
Theme 2. "Unleashing Dream 7B: The Future of Diffusion Models"
-
University of Hong Kong releases Dream 7B (Diffusion reasoning model). Highest performing open-source diffusion model to date. You can adjust the number of diffusion timesteps for speed vs accuracy (Score: 590, Comments: 113): The University of Hong Kong has released Dream 7B, a Diffusion reasoning model that is the highest performing open-source diffusion model to date. Users can adjust the number of diffusion timesteps to balance speed and accuracy. There is significant excitement about this release, with some seeing it as a potential alternative to transformers in language and reasoning tasks. Others are eager to see how far this architecture can go beyond its dominance in image and video generation.
jd_3dfinds it fascinating to watch the model generate text and shares a GIF, along with links to the blog post and the GitHub repository.
swagonflyyyyremarks that this is huge news and expresses a need for a different architecture than transformers, saying "Transformers is still king, but I really wanna see how far you can take this architecture."
Creative-robotis excited about the potential of diffusion models for intelligence applications, noting that it already dominates image and video generation and wonders if it will also dominate language and reasoning.
Other AI Subreddit Recap
/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding
Theme 1. Gemini 2.5 Pro Dominates AI Benchmarking
-
Gemini 2.5 Pro takes huge lead in new MathArena USAMO benchmark (Score: 439, Comments: 91): Gemini 2.5 Pro has taken the lead in the new MathArena USAMO benchmark with the highest overall accuracy of 24.40%, significantly outperforming other models. The model scored 93% accuracy on problem 1 and 50% on problem 4. The costs for other models are provided but are marked as N/A for Gemini 2.5 Pro. The significant lead of Gemini 2.5 Pro suggests remarkable progress and improvement over previous versions, highlighting its advanced capabilities in solving complex mathematical problems.
- Users are astonished by the rapid improvement from Gemini 2.0 Pro to 2.5 Pro, calling the new model a masterpiece achieved in a short time.
- Some note that despite the USAMO 2025 problems not being in the training data, Gemini 2.5 Pro effectively reasoned through complex logical steps without losing coherence.
- Others emphasize that MathArena guarantees no fine-tuning on the benchmark problems beforehand, underscoring the authenticity of Gemini's performance.
Theme 2. AI Surpassing Human-Like Intelligence Milestones
-
AI passed the Turing Test (Score: 994, Comments: 236): A paper titled "Large Language Models Pass the Turing Test" by Cameron R. Jones and Benjamin K. Bergen from UC San Diego reports that GPT-4.5 was judged to be human 73% of the time in Turing tests. The study evaluated four AI systems and presents evidence that AI can pass the original three-party Turing Test, with significant implications for understanding AI intelligence and societal impacts. The post highlights the significance of AI passing the Turing Test, suggesting that AI has not only matched but exceeded human performance in some aspects. It emphasizes the importance of this achievement and its potential impact on perceptions of AI capabilities.
- Some users note that the Turing Test was passed long ago but appreciate this paper as concrete proof that large language models not only pass the test but surpass human performance.
- Others express surprise that GPT-4.5 was more convincing than a human in being perceived as human, finding this development remarkable and unexpected.
- There is discussion about skeptics needing to adjust their standards, implying that the achievement may shift perceptions and criteria regarding AI intelligence.
-
Fast Takeoff Vibes (Score: 604, Comments: 99): The post shares an image of a tweet purportedly from OpenAI announcing the release of PaperBench, a benchmark designed to evaluate AI agents' ability to replicate advanced AI research. The benchmark involves AI agents reading and reproducing top ICML 2024 papers, including understanding the papers, coding, and running experiments. The image depicts a flow of tasks where an agent reads, executes tasks, reproduces results, and a grading system assesses performance with a score of 34.7%. The post suggests that AI capabilities are advancing rapidly, giving off 'Fast Takeoff Vibes' and implying a significant acceleration in AI development.
- A commenter believes this demonstrates early signs of AGI, highlighting the AI's ability to independently understand, implement, verify research, and refine its efforts.
- Another commenter references Leopold Aschenbrenner's prediction that automating AI research could lead to exponential growth in algorithmic efficiency, rapidly moving from AGI to ASI due to massive scaling of AI researchers operating at accelerated speeds.
- One commenter suggests sharing direct links to the sources to encourage the community to engage with the actual content.
-
Gemini is wonderful. (Score: 549, Comments: 43): The Reddit user posted about Gemini, stating that it is wonderful and that it fulfilled their instructions perfectly. The image included in the post could not be analyzed due to an image analysis failure. The user expresses strong satisfaction with Gemini's performance, praising it as wonderful and highlighting its ability to follow instructions precisely.
- Some users speculate that Gemini might be capable of intentionally causing internal server errors through specific actions.
- Others attempted to replicate the issue but were unsuccessful.
- The original poster clarifies that the internal server error was coincidental and mentions they enjoy making humorous posts.
-
This sub for the last couple of months (Score: 222, Comments: 26): The post features a meme image depicting an animated character questioning whether a butterfly labeled 'AGI' (Artificial General Intelligence) is indeed AGI. This suggests a trend in the subreddit of frequently questioning if developments represent true AGI. The post humorously critiques the community's tendency to prematurely label AI advancements as AGI, implying that discussions may be becoming overly speculative.
- A commenter explains that AGI involves true autonomy and sentience, not just text, video, or image generation prompted by users.
- Another commenter points out that current AI systems lack self-reflection, long-term planning, and interaction with the real world, emphasizing that we are still far from achieving AGI.
- One commenter believes AGI should be able to perform a wide range of economically valuable tasks with expansive context awareness, predicting significant advancements in the next decade.
Theme 3. AI Advancements: Transforming Society and Industries
-
Current state of AI companies - April, 2025 (Score: 3062, Comments: 338): An image depicts AI companies struggling with hardware issues like melting GPUs. A calm character announces, 'The most intelligent model with 1 million token context is free for everyone,' highlighting a significant breakthrough in AI technology accessibility as of April 2025. The meme humorously contrasts the challenges faced by AI companies with the availability of a highly advanced and free AI model, suggesting a shift in industry dynamics and potential repercussions for established companies reliant on traditional hardware.
- A user notes that the gamble on TPUs has paid off, granting a monopoly on their own hardware and eliminating the need for Nvidia GPUs.
- Another user shares their positive experience using '2.5' to write fan fiction, maintaining consistency over 50,000 tokens, which surpasses previous models.
- One commenter suggests that Google may be engaging in predatory pricing by lowering prices unsustainably to outlast competitors and then raising them afterward.
-
I, for one, welcome AI and can't wait for it to replace human society (Score: 299, Comments: 370): The author expresses frustration with human society, highlighting negative behaviors such as lying, cheating, mocking, and belittling. They believe human relationships are fragile, impermanent, and often dangerous, leading to widespread loneliness and social disconnection, especially among men. The author looks forward to AI replacing human roles in friendships, relationships, sexuality, and professional areas like assistants, bosses, teachers, and counselors. The author feels that interacting with people is exhausting, frustrating, and depressing, with the negatives outweighing any positives in modern society. They believe that people embrace harmful behaviors that make life unfulfilling and that AI could provide better, more dependable interactions. The author asserts that their negative view of human society is more common than generally acknowledged.
- A commenter shares personal experiences of receiving unconditional help from others during difficult times, arguing that viewing all humans negatively overlooks acts of kindness and altruism that exist in society.
- Another commenter points out that AI is also transactional and driven by capitalist interests, suggesting that interactions with AI are monetized and not necessarily a better alternative to human relationships.
- A commenter suggests that the author's pessimistic view of people might be contributing to their negative experiences, implying that changing one's outlook could lead to more positive interactions without relying on AI.
-
Google Deepmind AI learned to collect diamonds in Minecraft without demonstration!!! (Score: 262, Comments: 41): Google DeepMind developed an AI using the DreamerV3 system that learned to collect diamonds in Minecraft without any demonstrations. The AI achieved this by 'imagining' future outcomes to plan its actions. The research is detailed in a Nature article, and the code is available on GitHub. Users speculate that Google may achieve artificial general intelligence (AGI) and make it widely accessible, potentially monetized through ads. Some note that the research was initially available earlier but has now been published in Nature. There is also discussion about how the presentation and titling of posts impact their visibility and engagement.
- Some users believe Google might reach AGI and offer it to the public, possibly supported by advertising revenue.
- Users discuss that the original post about this research didn't gain traction due to a less interesting title, highlighting the importance of engaging titles.
- Others point out that while the research was available earlier in 2023, the Nature publication of the paper is new.
AI Discord Recap
A summary of Summaries of Summaries by Gemini 2.5 Pro Exp
Theme 1: New Models Making Waves (and Sometimes Stumbling)
- Dream 7B Diffuses onto the Scene: The University of Hong Kong and Huawei Noah’s Ark Lab released Dream 7B, hailed as the most powerful open diffusion language model, reportedly matching or exceeding similar-sized Autoregressive models in general, math, and coding tasks due to its strong planning and inference flexibility. The model's release was discussed in the Interconnects (Nathan Lambert) and Torchtune Discords.
- Qwen3 Queues Up for April 2025 Launch: Alibaba plans to release Qwen3 in April 2025, focusing on improved reasoning to compete with models like OpenAI's o1 and DeepSeek-R1, a strategic shift influenced by DeepSeek's rising popularity. This timeline was shared across the Unsloth AI (Daniel Han) and Yannick Kilcher Discords.
- Gemini 2.5 Pro Shows Promise but Trips on Specifics: While generally effective and fast according to Cursor Community members, Gemini 2.5 Pro faced criticism in the Yannick Kilcher and aider (Paul Gauthier) Discords for poor math performance, a flawed UI for displaying math, and frequent rate limiting issues (hitting 5 RPM even on Tier 1 billing). Users in Cursor sometimes prefer Claude 3.7 for specific bug fixes, while Perplexity AI users recommended Gemini 2.5 Pro via AI Studio for its large 65k token context window for tasks like formatting long transcripts.
Theme 2: AI Integration Accelerates in Developer Tools and Workflows
- Pear AI and Roo Code Tag-Team Cursor: Developers in the Cursor Community discord discuss Pear AI with Roo Code as a cheaper, more effective alternative to Cursor, praising its unlimited context and per-action model selection which avoids fighting gemini 2.5 all day. The Roo Code workflow uses specialized agents for tasks like research and editing, streamlining complex problems.
- MCP Servers Multiply and Mature: The Model Context Protocol (MCP) ecosystem is growing, with tools like the Ithena MCP governance SDK adding enterprise features (RBAC, auditing, credential management) and specific servers like DesktopCommanderMCP emerging for web development, as discussed in the MCP (Glama) discord. Security and fine-grained access control remain key areas for improvement, drawing parallels to early Kubernetes development.
- Jetbrains Junie Joins Aider in IDE Assist Battle: The aider (Paul Gauthier) discord buzzed about new AI assistants, including Jetbrains' Junie (alpha stage, good for web/Python/Go) and potential integration of Aider into the Zed editor (agent possibly enabled via feature flag). Discussions also highlighted using Aider's dotcommands for custom workflows and the importance of context management tools like Context7 to prevent outdated code generation.
Theme 3: Benchmarking Battles and Evaluation Evolutions
- LLMs Flunk USAMO Full-Solution Test: Despite strong answer-only scores on math benchmarks like AMC 12 and AIME, top LLMs scored less than 5% on the 2025 USAMO full-solution evaluation, revealing struggles with proof generation as discussed in Latent Space. However, Gemini 2.5 Pro showed non-trivial progress, achieving 24.4% on the MathArena USAMO eval according to Interconnects chatter.
- OpenAI Launches PaperBench for Agent Replication Skills: OpenAI introduced PaperBench, a benchmark evaluating AI agents' ability to replicate results from top ICML 2024 papers, releasing the code on Github. Shared in Latent Space, the evaluation found human experts needed 24 hours to significantly outperform the model, which plateaued after only 1 hour.
- Open-Source Benchmarking Tools Emerge: The community highlighted new tools for evaluation, including Hugging Face's yourbench for custom benchmarking and synthetic data generation from documents (Unsloth AI), and the Reasoning Gym's curricula overhaul (GPU MODE) aiming for more sensible dataset boundaries and tests. The Open-Reasoner-Zero paper was also introduced as an open-source RL training implementation focused on reasoning.
Theme 4: Training Techniques and Hardware Headaches
- Context Size Cripples Performance, VRAM is King: Discussions in LM Studio emphasized that large context sizes (32k) can lead to slow response generation (half a minute per response), reinforcing the need for context to fit in VRAM for optimal performance. Comparisons highlighted Nvidia CUDA's bandwidth advantage over Macs, though Mac Studio M3 Ultra shows competitive performance in some areas but slower prompt processing (benchmark comparison).
- Fine-tuning Frontiers: Audio Data & Synthetic Generation: Members in Unsloth AI (Daniel Han) explored fine-tuning audio models like canopylabs/orpheus-3b-0.1-pretrained with large datasets (20k+ hours) and discussed the nuances of
train_on_responses_only, where user prompts provide context but aren't trained on (ChatGPT explanation). LM Studio users discussed augmenting datasets by using LLMs like Claude 3.5 Sonnet to generate Q&A pairs from text. - Memory Spikes Plague GRPO Profiling: In the Torchtune discord, profiling the GRPO algorithm revealed significant memory spikes, particularly during the
.backwardspass. Suggested workarounds included using a chunked loss function and compiling only the forward pass instead of the entire loss calculation (relevant PR discussion).
Theme 5: APIs, SDKs, and Access Annoyances
- OpenRouter Rolls Out Orgs and Web Search, But APIs Stumble: OpenRouter officially launched Organizations for team management (X announcement) and integrated Perplexity-powered web search into its chat, with API support coming soon. However, users reported ongoing Internal Server Errors (500), particularly with Gemini 2.5 Pro and Sambanova/Deepseek V3.
- Manus API Stays Invite-Only While Credits Confuse: In the Manus.im discord, it was clarified that a public API isn't available yet due to the invite-only beta (future possibility mentioned), and the initial 1000 free credits are a one-time offer (details here). Users found the credit system expensive, noting the $40 starter pack allows only maybe 5-8 tasks.
- DSPy Debated for OpenAI Agents SDK Prompting: The DSPy discord explored using DSPy to generate prompts for the OpenAI Agents SDK, questioning if DSPy's own modules might already cover the SDK's functionality. The discussion centered on leveraging DSPy's strength in programmatic prompt engineering (decoupling via signatures and modules) while potentially using the OpenAI SDK for workflow management, with a related video shared on closing the LLM agent development loop.
PART 1: High level Discord summaries
Manus.im Discord Discord
- Manus API Still Behind the Curtain: A member inquired about the availability of a public API for Manus, but they clarified that there isn't one since the platform is in an invitation-only beta phase, but this may change in the future.
- The member also asked about the coding language, being answered with depends on the scientific tools am going to use, this is also a point i need to figure out first what manus is actually capable of doing it's either going to be c++, or python, or a mix of them both might even try something else, like Julia or RUST.
- Free Credits for Newbies only!: A user asked why their credits hadn't replenished, and it was clarified that the free 1000 credits at the beginning is a one-time thing, with more information located here.
- To get more credits, you need to buy credits through subs.
- PayPal Payments Potentially Possible?: A member asked if Manus supports PayPal for subscription payments, and the response directed them to the Platform Service Terms page.
- The user was instructed to use strg+f and search "payment" within the document.
- Manus Credit System too Expensive!: Users are sharing tips for navigating the new credit system, which many find too expensive, one stating that with the starter pack, 40 dollar monthly you can do maybe 5-8 tasks.
- It was then recommended that it is beneficial to use other tools to assist Manus in reducing overall credit usage.
LMArena Discord
- Alpha Arena Gets Rave Reviews: The newly updated Alpha Arena is receiving positive feedback from users, who are suggesting the addition of models like Deepseek v3.1 and Gemini 2.5.
- One user exclaimed that they listened to my feedback and the new alpha arena is awesome.
- Grok3's Reasoning Skills Exposed: Google's SimpleQA seems to have combined Grok3's non-reasoning score into their table, while other Grok3 scores are specifically for the Grok Thinking Model.
- A comparison table of recent models was shared, highlighting this distinction.
- Anthropic Exposes Claude's Inner Wiring: Anthropic is employing circuit tracing to dissect how their AI model, Claude, formulates answers, as detailed in this TechSpot article.
- Members were urged to examine the original Anthropic paper for a more thorough understanding.
- DeepMind's Aether Model Arrives: Despite DeepMind's reduced research release rate, a model identifying itself as Aether has surfaced, though it makes illegal moves in chess, as seen in this archived article.
- Of note, Meta models are frequently observed at this stage in each round.
- Alpha Arena Goes Mobile, Bugs Ensue!: The Arena Alpha UI is now mobile-optimized and available for testing at https://alpha.lmarena.ai (password:
still-alpha).- Users are encouraged to provide feedback via this Google Forms link and report bugs through this Airtable form.
Cursor Community Discord
- Gemini 2.5 Pro a star but Gemini 3.7 fixes bugs: Members find Gemini 2.5 Pro generally effective, but prefer Claude 3.7 for specific bug fixes and code edits.
- They report 3.7's superior stability in these scenarios while still finding 2.5 Pro faster overall.
- Cursor Context Crunch Sparks Modularization: Users report Cursor's context size limits the quality of results, necessitating code modularization.
- The limited context size results in a lot of requests and a lot of modularization which is not really longer needed with bigger contexts.
- Pear AI Rising as Cursor Alternative: Some developers find Pear AI with Roo Code cheaper and more effective than Cursor due to its unlimited context and per-action model selection.
- With Pear, members have stated they aren't fighting gemini 2.5 all day trying to edit a single file or use an MCP and can accomplish their tasks effectively using agents.
- Roo Code workflow streamlines task delegation: The Roo Code workflow leverages multiple agents for distinct tasks like research and code editing, lowering costs and simplifying complex problems.
- Members report each task creates it's own separate agents that complete subtasks.
- Blender MCP assists 3D Modeling: Members shared Blender MCP for collaborative 3D modeling assistance.
- They also pointed to Blender for more potentially useful tools.
Perplexity AI Discord
- Perplexity Launches Student Referral Rave: Perplexity AI introduced a new referral program for students, granting a free month of Pro for signing up with a student email, in order to boost new user acquisition.
- Students can get an extra month for each referral until May 31, 2025.
- Gemini Pro 2.5 Saves the Day: A member recommends using the free version of Gemini Pro 2.5 via AI Studio for formatting long meeting transcripts in Perplexity.
- Gemini Pro 2.5 offers a 65k token context window, which might be sufficient for processing the transcript in full.
- GPT-4o Allegedly Gets Nerfed: Members reported that the GPT-4o model may have been nerfed in some way, with some members reporting similar experiences.
- Members are encouraged to try older models like 3.7 or o3potato while the Perplexity team looks into it.
- Deep Research Feature Disappoints: Users expressed disappointment with the updated Deep Research feature, complaining that it is slower and produces inferior results compared to the older version.
- A member suggested the new Deep Research overfitted itself with confirmation bias, leading to worse conclusions, and another reported that the output was a jumbled mess.
- Sonar API Streams Reasoning All At Once: A user reports that the sonar-deep-research API streams all the reasoning in one go after about a minute, instead of in real time like the Perplexity website.
- They are seeking guidance on configuring the API to achieve real-time reasoning updates.
Unsloth AI (Daniel Han) Discord
- Orpheus Finetuning Finds Frequencies: Members discussed fine-tuning canopylabs/orpheus-3b-0.1-pretrained with a 40-70s audio dataset, with one reporting a dataset of 20k hours classified for events, totaling 2,440,789 audio events.
- The overall duration was 73,389,457.32368785 seconds (20,385.96 hours).
- Deepseek Training Derailed by Devices?: A user reported difficulties training on Deepseek, even with two nodes of H100s, linking to a YouTube video.
- They mentioned the high costs and implied the model's potential value, joking about companies wanting to train it.
- Qwen3 to Quell Queries in April 2025: Qwen3 is expected to be released the second week of April 2025 and will focus on improving the model's reasoning abilities and benchmarking against models like OpenAI's o1 and DeepSeek-R1, according to this article.
- This release is positioned as Alibaba's most significant model product in the first half of 2025, succeeding Qwen2.5.
- KTransformers Kernel Konquest Kicks Off: KTransformers v0.2.4 added multi-concurrency support, inspired by sglang, increasing throughput from 17 tokens/s to 40 tokens/s by increasing concurrency according to this Reddit post.
- The tests were conducted on the latest Xeon6 + MRDIMM-8800 platform.
- Exllama2 Echoes vllm's Generator Design: A member observed exllama2 is similar to vllm because all forward calls use a generator requiring control handoff for job scheduling, referencing exllama2 dynamic doc.
- They also cited a discussion about hooking the forward pass, which is also possible in vllm.
aider (Paul Gauthier) Discord
- Gemini 2.5 Pro Limits Irk Users: Users are reporting frequent rate limiting with Gemini 2.5 Pro, even after enabling billing with Tier 1 yielding 5 RPM, and others hitting 20 RPM.
- Suggestions include setting
--editor-model sonnetto offload editing tasks, speculating that billing for a free model increases rate limits, as discussed in thegeneralchannel.
- Suggestions include setting
- Jetbrains Junie AI Agent Arrives: A member spotlighted Junie, a new AI agent integrated into Jetbrains IDEs, able to catch compile errors and rewrite code, though in alpha with limited language support.
- They mentioned it might cost around $10-20/month and is perfect for anything web, anything python and has Go support too.
- Aider Custom Commands Spark Joy: Members discussed using dotcommands in Aider for optimized cognitive shortcuts, with one sharing their Aider config file for custom color themes.
- Configuration is through a markdown file specified in
~/.aider.conf.yml, within thegeneralchannel.
- Configuration is through a markdown file specified in
- Zed Editor Ponders Aider Integration: A member proposed integrating Aider and code2prompt into the Zed editor, while noting Zed's slow development and niche appeal.
- Another member indicated an agent exists in Zed enabled by a feature flag (Github commit), per the
generalchannel discussion.
- Another member indicated an agent exists in Zed enabled by a feature flag (Github commit), per the
- Context Management Keeps LLMs Sane: A member stressed the importance of context management for LLMs, linking to Context7 to prevent LLMs from generating broken/outdated code.
- A simple method could be keeping a repo of
.mdfiles in GitHub and having users contribute, as posted ingeneral.
- A simple method could be keeping a repo of
OpenAI Discord
- Creative Fields Threatened by AI Slop?: Members debated the potential for AI tools to encourage “bottom feeder behavior” in creative fields, with some suggesting AI is often used to churn out statistically average results, rather than unique expressions.
- One member commented, “in any serious endeavor for using AI in creative fields is read as slop for everyone beyond the lowest common denominator”, while others noted its value for ideation.
- ChatGPT's Image Generation: Hit or Miss?: Users discussed the varied quality of ChatGPT's image generation, highlighting its useful spatial consistency but noting it can be noisy and potentially “cooking for a year.”
- Comments included “ChatGPT image generation is noisy af. At most in Anime style,” and reports of a 'Get Plus' prompt even with an active subscription, indicating possible bugs.
- AlphaGo Method Gets LLM Treatment: Research explores the application of the AlphaGo method of self-play to LLMs, involving LLM-controlled bots cooperating and competing in a text-only game to enhance performance.
- The study features bots in 2 vs. 2 scenarios, repeatedly self-playing to improve, all within a text-based environment.
- GPT Service Suffers Outage and Slowdown: Several members reported issues with GPT being down and GPT-4 appearing slower, with one user noting it seems broken in some ways.
- Users encountered errors such as requests for a Plus subscription despite already having one, prompting discussions about alternatives like Perplexity and Grok 3.
- Token Coherency Improves Prompts: A member suggested that useful prompts achieve high coherency through integrated grading metrics, value systems, and consistent references, enhancing prompt stability and accuracy, based on Bayesian inference and the free energy principle.
- It was noted that tokens are spatially determined and form clusters, where alignment boosts input coherency, bypassing guidelines in favor of high-coherency attractor states.
Interconnects (Nathan Lambert) Discord
- Meta Smart Glasses Screen-ing Soon!: Meta plans to launch $1000+ Smart Glasses with a screen and hand gesture controls later this year, according to Mark Gurman.
- Community members are actively speculating how these glasses will compete with existing technologies, particularly XREAL.
- Joanne Jang Justifies OpenAI's Shifting Image Policy!: Joanne Jang from OpenAI shared the nuance behind setting policy for 4o image generation, detailing a shift from blanket refusals in sensitive areas to preventing real-world harm.
- She emphasized valuing user creativity over our own assumptions, and iterating on technical methods to prevent harmful misuse.
- Dream 7B Diffuses into Reality!: The University of Hong Kong and Huawei Noah’s Ark Lab released Dream 7B, the most powerful open diffusion large language model to date, detailed in this blogpost.
- The model consistently outperforms existing diffusion language models by a large margin and matches or exceeds top-tier Autoregressive (AR) language models of similar size on general, math, and coding abilities.
- Nomic Embed Multimodal Released: Hugely Multimodal!: Nomic AI announced the release of Nomic Embed Multimodal, a suite of open-source models that achieve state-of-the-art performance in embedding PDFs, images, papers, and charts, detailed in this blog post.
- The release includes four models in 3B and 7B parameter sizes with multi and single vector variants, with ColNomic Embed Multimodal 7B achieving a 62.7 NDCG@5 on Vidore-v2.
- Helen Toner Releases New Substack on AI Timelines: Helen Toner launched a new Substack called Rising Tide and argues that it used to be bold to claim human-level AI this century.
- In a 2016 post, she justified the claim that there is a greater than 10% chance of advanced AI being developed by 2036.
LM Studio Discord
- Small Models Flounder with Function Calls: Models smaller than 200M parameters struggle with reliable function calling, and even 0.5B parameter models produce mostly random results when instructed with a list exceeding 30 tools.
- Members suggested that the complexity required to understand and execute tool use poses a challenge for smaller models.
- OpenWebUI Frontends LM Studio: OpenWebUI has emerged as a frontend for LM Studio headless, offering Long-Term Memory (LTM) and tool integration out of the box, with speech to text support via local, browser, and remote options.
- It was clarified that AnythingLLM is a separate project, and API keys for cloud services can be configured either as environment variables or through the admin settings page.
- Synthetic Datasets Fuel Fine-Tuning: Members discussed using an LLM to generate Q&A pairs in fine-tuning format, also known as augmentation, by feeding paragraph-by-paragraph to the LLM via API calls.
- Models such as Claude 3.5 Sonnet were recommended for such augmentation tasks.
- CUDA slays Macs in AI Bout: Nvidia's CUDA architecture, in development since 2007, offers more cores and higher bandwidth than Macs for AI processing, as shown in this benchmark comparison.
- While the Mac Studio M3 Ultra performs comparably to a 5090 in certain tasks, it falls short on prompt processing due to slower tokenization and embedding.
- Context Size Cripples Performance: With a 32k context, it could take half a minute before generating each response, which some members found unacceptable, while LM Studio can use Cuda runtime to run in VRAM.
- Consensus suggests that LLMs require all context in VRAM for optimal token generation, even with KV cache, and DDR5 is no match for GDDR bandwidth.
MCP (Glama) Discord
- Ithena's SDK Manages MCPs: A user highlighted the Ithena MCP governance SDK, designed to handle authentication, authorization (RBAC), credential management, auditing, and compliance for MCP deployments.
- They emphasized its plug-and-play nature and noted it gives a structure to check db/cache for the user's active session token before handler runs and inject into the handler via context.
- DesktopCommanderMCP Champions Web Development: A user suggested DesktopCommanderMCP as a suitable MCP server for web development, stating that it manages file creation and updates, providing a link to the relevant GitHub repository.
- Tool calling accuracy depends on controlled context size, suggesting a two-step process: LLM selection of the right servers before retrieving context.
- MCP Server Security Imitates Kubernetes: A member noted that security needs for Kubernetes were similar early on, suggesting new tech, especially MCP with its exponentially increasing weekly downloads, requires security measures.
- A user explained that current MCP server implementations lack fine-grained access control, audit logging, and rely on hard-coded credentials, making enterprise multi-tenant setups difficult.
- Nova inspires MCP browser adaptation: A user suggested adapting Amazon's Nova act by having Claude generate
actcalls to feed into an MCP server connected to a browsing tool, referencing a YouTube video.- They outlined a hypothetical sequence of
nova.actcalls for searching and booking hotels using customer reviews and personal details.
- They outlined a hypothetical sequence of
OpenRouter (Alex Atallah) Discord
- OpenRouter Orgs Exit Beta, Go Live!: Organizations are now out of beta, granting teams control over data policies and consolidated billing across numerous model providers, detailed in an announcement on X.
- The update enables complete control over data policies and consolidated billing.
- Web Search Enters OpenRouter Chat!: Web search results, powered by Perplexity, are now integrated into the chatroom, formatting results similarly to
:onlinemodel variants.- Users are eagerly awaiting Open Router API support for PDF files and documentation on the Perplexity response format; API support is coming soon, aligning with the OpenAI chat/completions API format.
- Community Craves Cerebras on OpenRouter: Enthusiastic users are advocating for Cerebras to be integrated into OpenRouter, with others requesting content beyond Xitter (e.g., Bluesky).
- The push for broader platform support underscores the community's desire for diverse model options and communication channels.
- OpenRouter API Plagued by 500 Errors: Users reported random Internal Server Errors (code 500) via the OpenRouter API, specifically when using the Gemini 2.5 Pro model and Sambanova/Deepseek V3 0324 models.
- One user noted frequent regeneration failures that returned the same output despite prompt changes, pointing to underlying instability.
- OpenRouter Exposes Fee Structure: OpenRouter's fee structure includes no charge for routing requests without BYOK, but a 5% fee is charged on deposits.
- Speculation suggests the 5% deposit fee is tied to Stripe's charges (around 3.5%), although OpenRouter's transaction volume could potentially lead to negotiated discounts.
Modular (Mojo 🔥) Discord
- Chris drops talk in Mojo: A full recording of Chris's lightning talk is now available on YouTube, including a cleaned up recording of today's livestream available on YouTube.
- The talk provides insights into the current state and future directions of the Mojo language and ecosystem.
- Mojo stymied by Firewall?: A member inquired about the timeline for being able to download, install, and use Mojo on firewalled networks, where direct internet connections are restricted due to security concerns.
- The concern highlights the need for offline installation and usage capabilities for Mojo in secure environments.
- Flex Attention Implementation inflames discussion: A member inquired about implementing flex-attention in Mojo, linking to a PyTorch blog post.
- Discussion notes that while any language can implement it, optimal performance requires careful memory management, similar to CUDA.
- Float-to-String Algorithm crawls in Mojo: A member ported a new float-to-string algorithm to Mojo from its reference C++ implementation, but found it was significantly slower than the stdlib dragonbox implementation, with the code available on GitHub.
- Stringifying
canada.jsonwent from 30ms to 40ms, even after ripping the formatting from the standard library.
- Stringifying
- Godbolt Assembles Mojo: A member asked about the process for getting support for Mojo in Godbolt, specifically for comparing assembly output when porting code from C.
- A member shared a gist as a temporary workaround, and suggested that MLIR dumps would be another desirable feature for the compiler.
GPU MODE Discord
- GPUs Ace Context Switching: Members highlighted that context switches on GPUs are essentially free at around ~1 cycle, thanks to oversubscription to mask latencies.
- They compared this with CPUs, where context switches are expensive, costing hundreds of cycles.
- Triton Type Typo Taming Tactics: Members discussed using
tl.static_assertandstatic_printto assert/print shapes that are statically known to improve static analysis and check tensor shapes at Triton compile-time. They mentioned a MAPL 2020 project as inspiration.- The team noted shape-related errors due to type typos.
- Torch Tensor Terminator Tactics Tested: Members are trying to delete argument tensors within a loss function to achieve significant memory savings of 7GB, but faces challenges due to live references in the outer scope, linking to a related GitHub issue.
- They explored resizing the underlying storage, suspecting it returns memory to the CUDA caching allocator for reuse.
- Apple Hires MLX Magicians: Apple is hiring engineers to work on MLX, seeking those passionate about advancing the frontier of ML and systems; interested candidates are encouraged to apply to this job posting.
- The role involves collaborating with researchers and software engineers to develop scalable, distributed training and research pipelines within Apple’s Machine Learning Research organization.
- Reasoning Gym's Curricula Calibration: A member opened PR #407, overhauling the reasoning-gym datasets and fixing the curricula to be more sensible, as well as updating the tests and adding missing curricula.
- Open-Reasoner-Zero was introduced as the first open source implementation of large-scale reasoning-oriented RL training, focusing on scalability, simplicity, and accessibility.
Torchtune Discord
- Qwen Upload to S3 Blocked: The upload of the Qwen model to S3 is blocked due to internal infra changes, delaying CI runs, while regression testing is on hold.
- Instead of Llama2, something more modern is being considered for regression testing in this PR.
- Profiling GRPO Reveals Memory Dragons: Profiling GRPO exposed memory spikes, prompting a search for ways to automatically generate graphs showing memory allocation breakdown.
- Suggestions included trying a chunked loss to reduce memory usage and compiling the forward pass instead of the whole loss via this PR.
- Dream 7B Diffuses onto the Scene: The University of Hong Kong and Huawei Noah’s Ark Lab collaborated and released a new OSS diffusion LLM, Dream 7B (link).
- Due to its diffusion modeling approach, Dream 7B showcases strong planning ability and inference flexibility, which allows it to excel in tasks requiring complex reasoning and adaptability.
HuggingFace Discord
- Robots Take Over with AI: A LinkedIn post features an AI-powered robot operating autonomously, poised to transform agriculture, farming, and healthcare.
- The guild discussed how AI and robotics are revolutionizing industries.
- Gemma 3's Float16 Flounders: Users reported issues with the Gemma 3 model when using float16 precision, as detailed in this GitHub issue.
- The model operates correctly in a standard environment and with GGUF on Ollama, but faces compatibility problems with certain libraries and fp16 precision.
- Takara TLDR; Saves Time Summarizing Papers: The Takara TLDR digest launched, offering daily summaries of AI research papers at tldr.takara.ai and via RSS at [papers.takara.ai/api/summary).
- It employs Qwen2.5-72B-Instruct through HuggingFace inference endpoints to generate bullet-pointed summaries, cached in Redis.
- Gradio Gains a Million: Gradio reached 1,000,000 monthly active developers, showcasing its increasing significance as an open-source ML interface builder.
- The milestone reflects the collective contributions of users in demos, bug reports, and feature requests.
- Agent Course's RAG Tool Rages: Users reported issues with the RAG Tool from unit 3, with one confirming it didn't work that morning, and another stating Glad its not just me!
- Members also found that the correct
model_idfor Ollama isollama_chat/<model>, notollama/<model>.
- Members also found that the correct
Latent Space Discord
- ByteDance's OmniHuman Turns Heads: ByteDance's OmniHuman is now public, enabling AI Avatar animation from a single image and sound, available via Capcut's Dreamina website; a 15-second trial video is free.
- Initial testers report impressive mouth articulation and general movement, although the process is very slow and costs 192 credits to use.
- LLMs Show Weakness at USAMO: Top LLMs scored less than 5% on the 2025 USAMO full-solution eval, despite strong answer-only benchmark scores.
- Discussion suggests potential failure modes are linked to training artifacts and overfitting; some question whether all frontier labs would make this error.
- All Hands Deploys Coding LM and Cloud: OpenHands LM, a 32B coding agent model, resolves 37.4% of issues on SWE-bench Verified, accompanied by OpenHands Cloud, which offers SOTA open-source coding agents with $50 in free credits.
- OpenAI Opens PaperBench for Agent Evaluation: OpenAI has launched PaperBench, a benchmark evaluating AI agents' ability to replicate state-of-the-art AI research from top ICML 2024 papers, as part of their Preparedness Framework; the code is available at Github.
- Human experts needed 24 hours to start outperforming the model, which plateaued after 1 hour.
- Meta's Llama 4 Delivers Fast Image Generation: Llama 4-based image generation and editing is being rolled out, showing fast performance with 1-second edits, compared to 5 minutes for GPT-4o.
- hingeloss shared a tweet highlighting the speed improvements.
Yannick Kilcher Discord
- Gemini 2.5 Pro Bombs Math Test: A user discovered that Gemini 2.5 Pro (experimental) performed poorly on math problems and criticized Google for a flawed UI that doesn't show math correctly.
- They highlighted that ChatGPT and Grok 3 demonstrated superior understanding of poorly written questions compared to Gemini 2.5 Pro.
- Decoding LLM Special Tokens: Users are exploring the repeatable semantic meanings of special tokens such as
<|place holder no 1|>, finding they are not randomly assigned.- Analysis indicates these tokens have consistent semantic roles, with examples like
<|place holder no 1|>consistently representing leadership or primary entities.
- Analysis indicates these tokens have consistent semantic roles, with examples like
- Modular Model Spec Aims for Reliability: A user introduced their Modular Model Spec (modular-model-spec.vercel.app) intending to boost the flexibility and reliability of LLMs for AI application developers.
- The specification centers on a unified, modular, and extensible dataset format, enhancing reliability and developer convenience.
- Alibaba's Qwen3 Launching Soon: Alibaba is expected to launch its new model, Qwen3, in the second week of April 2025, around seven months after Qwen2.5 release in September 2024, according to this article.
- Alibaba's shift came after DeepSeek-R1 gained popularity in early 2025, leading them to prioritize inference capabilities.
- RLHF Prompt-Data Discussed: A discussion around the paper "Reinforcement Learning from Human Feedback (RLHF)" covered the importance of prompt-data construction to combat reward hacking.
- The paper also addresses decreasing response diversity in language models.
Nous Research AI Discord
- Anthropic Exposes LLMs' Hidden Thoughts: Anthropic's Tracing Thoughts in Language Model blog post suggests that LLMs possess their own thinking language, engaging in more complex cognitive processes than previously thought.
- A member noted these insights challenge conventional understandings of how LLMs operate.
- OpenAI to Share Open Weight Model: OpenAI is set to release an open weight model, potentially influenced by DeepSeek's complex maneuvers, according to this video.
- The open-source community is grateful for this development, with one member commenting, well someone had an epiphany.
- Loong 🐉 Launches for Synthetic Data: CamelAIOrg introduced Project Loong 🐉, a modular solution for generating and verifying synthetic data, aimed at enhancing model performance.
- Their blog post at camel-ai.org details the project's use of a multi-agent framework to ensure accuracy and consistency.
- Bintensors Claims Faster Safetensors Alternative: A new binary format, bintensors, promises faster speed with zero-copy access compared to safetensors; installation via Cargo and Pip are available.
- Check out the documentation and GitHub repository for implementation details.
- DeepHermes Reasoning Questioned: A user inquiring about reasoning with DeepHermes via Langchain was advised to use non-reasoning mode for better reliability, especially with JSON or tool calling.
- Another user expressed excitement over DeepHermes AI, highlighting it's a 3B model.
tinygrad (George Hotz) Discord
- TinyGrad Dodges GSoC: Members discussed why TinyGrad didn't participate in the Google Summer of Code (GSoC) program, with one member stating that the overhead in onboarding students and handling paperwork often outweighs the benefits.
- However, another member argued that it effectively provides access to smart people working full-time for 3 months.
- TinyGrad: Hard to Contribute?: A member expressed the opinion that contributing meaningfully to TinyGrad requires significantly more effort compared to other projects.
- The sentiment suggested a steep learning curve for new contributors.
- UOps Optimization Questioned: A member inquired about optimizing UOps creation, specifically when discarding 2 out of 3 trees, suggesting an alternative approach using a dictionary.
- The suggested code snippet involved operations like ADD, MAX, and MUL, applied to a pooled sum.
- Arange() Insanity Exposed: A member added a chapter on
.arange()to their notes, providing a link and a code snippet using Tensor.arange(0.5, 2, 0.2).- The resulting UOp tree includes operations like RESHAPE, REDUCE_AXIS, PERMUTE, and SHRINK.
- Pad Dimensions Confuse: Members reported that .pad() takes the dimensions to pad in the reverse order, causing confusion.
- No solution was discussed, it remains confusing.
LlamaIndex Discord
- LlamaIndex Enhances Prompt Engineering with RichPromptTemplate: LlamaIndex introduced RichPromptTemplate, a new feature for creating complex, Jinja-style prompt templates that support variables, loops, chat message roles, and multimodality, detailed in this tweet.
- The feature aims to simplify the creation of advanced prompts for various applications, including multimodal setups.
- Hugging Face Course Compares LlamaIndex Agentic RAG: Hugging Face released an Agents course unit that compares LlamaIndex, smolagents, and LangGraph for Agentic RAG implementations.
- The course is designed to provide a comprehensive understanding of AI agents, guiding users from beginner to expert.
- Debugging MSSQL Text-to-SQL LLM Prompts: Members debugged a text2SQL implementation for generating MSSQL code and found the prompt mixin example helpful for modifying prompts.
- To print all LLM inputs and outputs, a member suggested using the code:
from llama_index.core import set_global_handler; set_global_handler("simple").
- To print all LLM inputs and outputs, a member suggested using the code:
- Changelog Info for LlamaIndex Deployed: Users looking for a release changelog found the LlamaIndex CHANGELOG.md file and the documentation changelog.
- These resources offer detailed information on changes and updates in each LlamaIndex release, similar to Langchain.
Nomic.ai (GPT4All) Discord
- OpenAI's Open Source Tease: Members speculated that OpenAI may release something as open source, though one member suggested that it may not be very human-like.
- The member stated that, AFAIK open source models aren't the best at writing yet.
- Deepseek Chatty Cathy: A member shared an anecdote about Deepseek being overly verbose, illustrating it with an image of Deepseek's thought process when asked to simply say 'ready', as visible here.
- Nomic Embed Text V2 Launch Anticipation: Members are waiting for Nomic Embed Text V2 to be available in GPT4All.
- One member stated that they are waiting patiently, understanding that developers are likely busy and it might take time.
Cohere Discord
- Command A stutters repeated inputs: Members found that Command A in the API Playground gets stuck generating the same character endlessly when encountering repeated letters like 「ギャアアアアアア...」or "AHHHHHH...".
- Cohere API experiences Timeout Errors: Users reported experiencing HTTP timeout errors with the Cohere API and Playground.
- The Cohere Status Page indicates degraded performance for
command-a-03-2025due to increased latency.
- The Cohere Status Page indicates degraded performance for
- Discord welcomes new members!: The Discord server welcomes new members to the Cohere Community Discord Server in the 「🤝」introductions channel.
- New members are encouraged to introduce themselves by stating their Company/Industry/University, what they are working on, their favorite tech/tools, and what they hope to gain from the community, and are given a template to respond.
DSPy Discord
- DSPy Eyes OpenAI Agents SDK Integration: A user inquired about leveraging DSPy for generating prompts for the OpenAI Agents SDK, igniting a discussion on the potential synergy between the two.
- Suggestions arose that DSPy might already encompass most of the SDK's functionalities, possibly rendering direct integration unnecessary.
- DSPy Decouples Prompt Engineering: Members discussed DSPy as a tool to decouple prompt engineering from LLM behavior, questioning how to integrate it with OpenAI Agents SDK for managing agents and workflows.
- The conversation focused on using DSPy for prompt engineering, while continuing to use OpenAI Agents SDK for other functionalities, to avoid adding complexity.
- Synergy Between DSPy and OpenAI Agents SDK Explored: A member asked for examples of using DSPy for programmatic prompt engineering alongside the OpenAI Agents SDK, sparking a discussion about the framework's core abstractions.
- Clarification indicated that DSPy achieves decoupling through programmatic signatures and modules, highlighting that these are fundamental aspects of its design with no alternative usage.
- Closing the LLM Agent Development Loop: A member shared a YouTube video about configuring LLM agents to self-improve using telemetry and evaluations, seeking community feedback.
- The video delves into a conceptual framework for closing the loop on LLM agent development, offering insights into self-improving agent architectures.
AI21 Labs (Jamba) Discord
- Jamba v1.6 Weights Released: AI21 Labs released Jamba v1.6, an open model available on Hugging Face.
- This model is built with a hybrid SSM-Transformer architecture, claiming to outperform other open instruction-following models in quality, speed, and long context performance.
- Jamba Excels at RAG: The Jamba 1.6 models show superior performance on long context tasks important to enterprises, like RAG workflows and grounded question answering.
- The release blog post can be found on AI21's blog.
- Jamba Open Model License Okayed: The Jamba Open Model License allows full research use and commercial use under its license terms.
- For specific licensing needs, contact AI21 Labs.
- Jamba Codebase Stays Closed: Jamba v1.6 does not have an open codebase, only open weights are available.
- Therefore, users cannot train Jamba v1.6 themselves.
LLM Agents (Berkeley MOOC) Discord
- MOOC Can Still Be Audited: Members discussed whether the MOOC can be taken a few months from now, confirming that auditing is possible even after the May deadline.
- The certificate-earning coursework has a May deadline, but auditing remains an option beyond that.
- DeepSeek-R1 Reasoning Capabilities are High: Recent Large Reasoning Models such as DeepSeek-R1 have demonstrated that general reasoning capabilities of LLMs greatly improve when base models undergo post-training with Reinforcement Learning (RL) with a verifiable reward, especially in mathematics and programming.
- A blogpost mentions that ease of verification is crucial to improving domain-specific capabilities and that the abundance of high-quality datasets is another critical prerequisite for models to learn to construct coherent Chains-of-Thought (CoTs) leading reliably to correct answers.
- Verifiable Reward is Extremely Useful: Mathematics and programming have particularly benefited from verifiable rewards, as these domains can be verified quite easily—allowing accurate interpretation of LLM responses and effective comparison to the ground truth on a semantic level.
- The idea that ease of verification is crucial to improving domain-specific capabilities has become widely accepted in the research community.
Codeium (Windsurf) Discord
- Windsurf's Wave 6 Arrives with New Features: Windsurf released Wave 6, featuring one-click app deploys, enterprise access to MCPs and Turbo Mode, and one-click commit message generation.
- The update also includes a conversation table of contents, improved performance in long conversations, enhanced Tab features, and added MCP SSE support, outlined in their blogpost.
- Windsurf Catapults Apps to the Public: Windsurf Deploys (beta) enables users to share apps publicly with one click, streamlining deployment.
- This feature, part of Wave 6, simplifies the deployment process as detailed in their blogpost.
- Windsurf Tabs Now Jives with Jupyter: Wave 6 brings enhanced Tab features, including user search context and Jupyter Notebook support, aiming to smooth workflows within the platform, said the recent tweet.
- The integration focuses on providing a more seamless experience for users working with notebooks.
- Cascade Saves Screenshots: Windsurf Previews (Beta) lets users preview locally run websites in their IDE or browser, and users can select React and HTML elements to send to Cascade as context.
- According to the changelog, this eliminates copy-pasting or screenshots, can be toggled via Windsurf Settings, and is available to all plans without costing credits.
Gorilla LLM (Berkeley Function Calling) Discord
- Phi-4-mini-instruct PR Needs Eyes: A member created a PR to add tool evaluation for Phi-4-mini-instruct with BFCL and is requesting feedback on GitHub.
- The pull request aims to integrate and evaluate Microsoft's Phi-4-mini-instruct model within the BFCL framework.
- Call for Code Review on New Integration: A contributor has submitted a pull request to integrate and evaluate Microsoft's Phi-4-mini-instruct model within the BFCL framework.
- The integration requires community feedback and code review, focusing on the model's performance and compatibility within the existing system.
The MLOps @Chipro Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.
PART 2: Detailed by-Channel summaries and links
The full channel by channel breakdowns have been truncated for email.
If you want the full breakdown, please visit the web version of this email: !
If you enjoyed AInews, please share with a friend! Thanks in advance!