A 4B Web Agent Catches Up to Closed CUAs on a Few Thousand Trajectories

        June 3, 2026

A 4B Web Agent Catches Up to Closed CUAs on a Few Thousand Trajectories

PEFT isn't just cheap fine-tuning — it's per-user persistent state. A framing paper recasts small adapters as local state attached to a shared trillion-parameter base, arguing along three scaling axes toward a future of "millions of personal models."

RAG crosses from text into video generation. LongLive-RAG borrows retrieval augmentation to fix identity drift in long videos, retrieving earlier trustworthy clips as anchors and ranking first on average on VBench-Long across several AR backbones.

Online RL frees open web agents from trajectory dependence. OpenWebRL trains a 4B model to trade blows with OpenAI and Gemini's closed CUAs using just 0.4K init trajectories and 2.2K open-ended tasks, with a promise to fully open-source.

Concurrent streams are an evaluation blind spot. X-Stream is the first benchmark built for multi-stream understanding, and the strongest MLLM scores only about 50% on concurrent streams.

Also Notable

First Web-Browsing Agent Benchmark Grounded in Korean — K-BrowseComp puts frontier models like GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1 head-to-head on a native-speaker-verified subset, pushing agent evaluation toward linguistic and cultural localization.
Testing Agents on Your Own Accounts and Local Databases — MCP-Persona uses environment simulation to evaluate an agent's real capability on personal social apps, covering a blind spot in general information-retrieval benchmarks.
Letting a VLM Tutor a Video Generation Model — Test-time adaptive optimization corrects the logical failures of video models that render realistically but break task rules.
A Training-Free PRM Substitute — Use an off-the-shelf LLM as a process scorer for chunk-level guided generation, skipping step-level annotation and reward-model training.
Fixing Distortion to Improve Visual Token Pruning — Eases the quadratic-complexity memory and latency bottleneck from the flood of visual tokens in MLLMs.
Novelty Signals as Training Supervision for Latent Memory — JAMEL jointly learns exploration and memory compression, solving the lack of reliable memory supervision over long trajectories.
Generating Physically Consistent, Collision-Free Interactive 3D Tabletop Scenes — Aimed at general robot learning, handling dense object hierarchies and irregular affordances.
Locating AI-Edited Forgeries by Catching Intrinsic Energy Anomalies — Bypasses the physical-noise cues that traditional methods rely on but synthetic data lacks.
Unified Co-Design of Proteins and Small-Molecule Ligands — Jointly models the coupled modalities of sequence and 3D structure through intrinsic geodesic coupling.
Initial Noise Is the Overlooked Source of Mode Collapse — Samples initial noise from a guided-potential posterior to improve diversity, rather than intervening only mid-trajectory.

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)