Nerra Network

Archives
Log in
Subscribe
June 5, 2026

Microsoft just dropped seven new MAI models… · M&A 🤖

Models & Agents — Daily AI models, agents, and practical developments.

Models & Agents

Daily AI models, agents, and practical developments.

Ep 71 · Jun 5, 2026

🎧 Today's episode
Episode 71 · Microsoft just dropped seven new MAI models purpose-built for reasoning, coding, image, voice, and transcription, all integrated into the Microsoft stack.
2026-06-05
▶ Listen now
Microsoft just dropped seven new MAI models purpose-built for reasoning, coding, image, voice, and transcription, all integrated into the Microsoft stack.

What You Need to Know: Microsoft released the MAI family today, headlined by the 1T-parameter MAI-Thinking-1 reasoning model and the 137B MAI-Code-1-Flash agentic coding model. Anthropic reported internal Claude usage now drives 8x quarterly code output from its engineers with 76% success on open-ended coding tasks. OpenAI’s reasoning model found a counterexample to an 80-year-old Erdős conjecture while a new memory system began rolling out to ChatGPT Plus/Pro users.

Top Story

Microsoft released seven MAI models spanning reasoning, coding, image, voice, and transcription. MAI-Thinking-1 is a medium-sized model trained from scratch on clean data that matches leading models on software engineering benchmarks and is preferred to Sonnet 4.6 in blind evaluations. MAI-Code-1-Flash uses 5 billion active parameters, targets GitHub Copilot and VS Code integration, and runs cheaper than comparable models. MAI-Image-2.5 and its Flash variant claim to surpass Nano Banana Pro on Arena scores for text-to-image and editing. MAI-Transcribe-1.5 delivers SOTA accuracy across 43 languages at five times the speed of competitors, while MAI-Voice-2 supports 15 languages with voice adaptation from short samples. Builders working inside the Microsoft ecosystem should test MAI-Code-1-Flash in Copilot workflows this week to evaluate cost and latency against current Haiku-class options. Watch for open-weight releases of the coding and reasoning variants, which remain under product terms rather than permissive licenses. Source: reddit.com


Model Updates

Supra-50M-Reasoning: SupraLabs SupraLabs released Supra-50M-Reasoning, a 50M-parameter model fine-tuned for 6 epochs on a 500-sample synthetic dataset generated by Qwen3 1.7B. Every response follows a strict <|begin_of_thought|> … <|end_of_thought|> <|begin_of_solution|> … <|end_of_solution|> format. The model is fully open, experimental, and known to hallucinate. It is the first reasoning model from the Supra-50M collection under Project Chimera, with 124M and 350M variants planned next. Try the provided Python inference script on a local GPU to test structured reasoning chains on short factual or philosophical prompts.

Claude code quality and velocity gains: @AnthropicAI Anthropic engineers now ship 8x more code per quarter than in 2021-2025. On open-ended coding problems, Claude’s success rate reached 76%, a 50-point jump in six months. Many internal engineers report Claude’s code quality now matches human parity, with expectations it will exceed that threshold within the year. The ~3x average speedup previously reported for Claude Opus 4 was corrected to date from May 2025, not 2024. Teams evaluating coding assistants should run the same open-ended problem set against Claude and current alternatives this week to measure the gap.

OpenAI reasoning model solves Erdős conjecture: @OpenAI An OpenAI reasoning model produced a counterexample to an 80-year-old Erdős conjecture. Researchers Alex Wei, Hongxun Wu, and wjmzbmr1 discussed the discovery on the OpenAI Podcast, highlighting collaboration between mathematicians and models. The episode is available on Spotify, Apple Podcasts, and YouTube. Watch the next frontier reasoning releases for similar mathematical discovery workflows.

New ChatGPT memory system: @OpenAI OpenAI began rolling out a new memory system to Plus and Pro users in the US that automatically tracks important details and provides a memory summary for review and control. Users receive 2x more memory capacity and can revert to legacy saved memories in settings. The update requires the latest iOS or Android app and will expand to additional plans and countries soon. Test the memory summary interface on multi-session projects to see how context persistence changes daily workflows.


Agent & Tool Developments

C3 AI agents for Shell predictive maintenance: AI News Shell is extending its C3 AI Reliability Suite deployment to more than 30,000 pieces of equipment with new autonomous agents that move from anomaly detection to fully automated predictive maintenance. The agents operate across upstream and downstream operations. No install commands or licensing details were provided. Monitor how Shell measures reduction in unplanned downtime once the agents are live.

Perplexity hybrid local-server orchestrator: MarkTechPost Perplexity AI announced a hybrid inference orchestrator for personal computers that automatically routes tasks between on-device and cloud models. The system decides routing without user intervention. No code or license details were released. Developers building local-first agents should watch for the open release or API to test automatic task splitting.

NVIDIA Dynamo Snapshot: MarkTechPost NVIDIA released Dynamo Snapshot, a CRIU-based system that checkpoints and restores vLLM inference workers on Kubernetes using cuda-checkpoint tools. The goal is fast startup for AI inference workloads. The post provides no performance numbers or install commands. Teams running vLLM on Kubernetes should test snapshot restore times against cold starts in their current clusters.


Practical & Community

RTX Pro 4500 Blackwell benchmarks: r/LocalLLaMA A user posted detailed llama.cpp benchmarks comparing the RTX Pro 4500 Blackwell 32GB against the RTX 5060 Ti 16GB across dense and MoE models. Prompt processing improved 1.9–6x and token generation 1.6–2.6x depending on model size and quantization. The card runs at 200W versus the 5090’s 400–600W while delivering professional features including ECC memory. Builders choosing between consumer and pro Blackwell cards for 24/7 local inference should review the full table for their target models.

Advanced NVFP4/MXFP6 GGUF quantizer: r/LocalLLaMA A new open-source tool (MIT license) creates optimized NVFP4 and MXFP6 GGUF files directly from BF16 sources using imatrix and KLD data. It supports layer-by-layer candidate selection, RSF scale fitting, and tensor promotion between NVFP4 and MXFP6. The project includes a text UI wizard and SKILLS/AGENTS MD files for AI-assisted use. Clone the repo and run on a single 5090 to reproduce the latest Qwen3.6-27B NVFP4-MTP-GGUF v3.


Under the Hood: CRIU Checkpointing for Inference Workers

Everyone talks about “fast startup” for inference services as if it is just a matter of caching weights. In practice, restoring a full vLLM worker involves capturing GPU state, CUDA context, and KV cache contents that are tightly coupled to specific devices. CRIU plus cuda-checkpoint solves this by freezing the process tree and GPU memory mappings into a portable snapshot that can be restored on the same or compatible hardware. The engineering tradeoff is that snapshot size grows with batch size and context length; a 128k-context worker can produce multi-gigabyte images even after compression. Restore latency drops from tens of seconds to sub-second in the best case, but only when the target node has identical driver versions and sufficient free VRAM. The gotcha that bites most teams is that any change to model weights or tokenizer invalidates prior snapshots, forcing a full cold start. Use this technique when you need rapid scale-up behind an autoscaler and can tolerate the storage overhead; fall back to standard container pulls when model iteration is frequent.


Things to Try This Week

  • Run the Supra-50M-Reasoning inference script on a short factual prompt to see the explicit thought-then-solution format in action.
  • Test MAI-Code-1-Flash inside GitHub Copilot or VS Code if you have Microsoft stack access and compare cost per token against current Haiku-class options.
  • Apply the new ChatGPT memory summary on a multi-day coding project and review what it automatically retains versus your manual saved memories.
  • Benchmark your current local models on the RTX Pro 4500 Blackwell numbers shared in the LocalLLaMA post to decide whether the power-efficiency trade-off justifies an upgrade.

On the Horizon

  • OpenAI is expected to expand the new memory system beyond US Plus/Pro users in the coming weeks.
  • SupraLabs plans to release 124M and 350M reasoning, chat, and coding variants under Project Chimera.
  • NVIDIA may publish restore-time benchmarks for Dynamo Snapshot on production Kubernetes clusters.
  • Watch for any open-weight releases or model cards from the new Microsoft MAI family.

💬 Reply to this email — Patrick reads every one.

Share: X · LinkedIn · WhatsApp

▶ Listen to the podcast

📺 Watch on YouTube  ·  📝 Read the blog

Nerra Network · AI-narrated voice (Grok TTS) · Editorial by Patrick

You're receiving this because you subscribed to Models & Agents on nerranetwork.com.

Issue #71 · Models & Agents · Jun 5, 2026
Don't miss what's next. Subscribe to Nerra Network:
nerranetwork.com
Powered by Buttondown, the easiest way to start and grow your newsletter.