Open-Source Search Agent Wins With 12K Samples, Agent Skills Mostly Fail

        March 19, 2026

Open-Source Search Agent Wins With 12K Samples, Agent Skills Mostly Fail

An open-source search agent trained on 12K synthetic samples beats closed-source competitors. OpenSeeker nearly doubles the second-best on BrowseComp with fully open data and weights. Deep Research is no longer a big-lab monopoly.

Cross-layer attention keeps deep signals from fading. MoDA lets each attention head attend to KV pairs from preceding layers, trading 3.7% extra FLOPs for +2.11% on downstream tasks. Open-sourced.

Agent skill injection sounds great; 39 of 49 skills produce zero improvement. SWE-Skills-Bench is the first rigorous evaluation of agent skills in real-world SWE. Average gain: +1.2%.

A mathematician formalized a plasma physics theorem in Lean 4 with zero code in 10 days. The full AI-assisted workflow is publicly archived for $200 total cost.

Also Notable

Human-Scene Interaction Reconstruction Deploys Directly to Humanoid Robots. HSImul3R uses physics simulator as bidirectional optimization supervisor, bridging the gap between visual reconstruction and physics engines (141 HF upvotes). Source
Video DiT Editing Trained on 2D Images Only. ViFeEdit decouples spatial independence through architectural reparameterization, requiring zero video training data. Source
City-Scale World Model Grounded in Real Seoul Streets. SWM anchors video generation to retrieved street views, maintaining spatial consistency across hundreds of meters (121 HF upvotes). Source
Code LLM and Test LLM Co-Evolve Through Adversarial Training. Code-A1's architectural separation eliminates self-collusion risk, making white-box test generation safe. Source
"Wait" Tokens Aren't the Key to Reasoning; Uncertainty Externalization Is. Information-theoretic framework unifies explanations of LLM "Aha moments." Purely procedural reasoning stagnates informationally. Source
464-Person Red Team Competition: Every Frontier Model Falls to Indirect Prompt Injection. Claude Opus 4.5 is most resistant (0.5% success rate), Gemini 2.5 Pro most vulnerable (8.5%). Capability and robustness show weak correlation. Source
Hallucination Detection Recast as Geometric Anomaly in Cognitive Trajectories. Information-theoretic probes map VLM generation to a low-dimensional cognitive state space, reaching SOTA under weak supervision. Source
Aleph Alpha Releases 70B Tokenizer-Free Model. HAT architecture operates at byte level, reuses Llama 3.1 backbone, outperforms original Llama in both German and English. Source
Unified Multimodal Model Inference Accelerated 1.78-2.01x, Training-Free. FlashU tailors optimization strategies separately for generation and understanding tasks (CVPR 2026). Source

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)