Open-Source Search Agent Wins With 12K Samples, Agent Skills Mostly Fail
- An open-source search agent trained on 12K synthetic samples beats closed-source competitors. OpenSeeker nearly doubles the second-best on BrowseComp with fully open data and weights. Deep Research is no longer a big-lab monopoly.
- Cross-layer attention keeps deep signals from fading. MoDA lets each attention head attend to KV pairs from preceding layers, trading 3.7% extra FLOPs for +2.11% on downstream tasks. Open-sourced.
- Agent skill injection sounds great; 39 of 49 skills produce zero improvement. SWE-Skills-Bench is the first rigorous evaluation of agent skills in real-world SWE. Average gain: +1.2%.
- A mathematician formalized a plasma physics theorem in Lean 4 with zero code in 10 days. The full AI-assisted workflow is publicly archived for $200 total cost.
Also Notable
- Human-Scene Interaction Reconstruction Deploys Directly to Humanoid Robots. HSImul3R uses physics simulator as bidirectional optimization supervisor, bridging the gap between visual reconstruction and physics engines (141 HF upvotes). Source
- Video DiT Editing Trained on 2D Images Only. ViFeEdit decouples spatial independence through architectural reparameterization, requiring zero video training data. Source
- City-Scale World Model Grounded in Real Seoul Streets. SWM anchors video generation to retrieved street views, maintaining spatial consistency across hundreds of meters (121 HF upvotes). Source
- Code LLM and Test LLM Co-Evolve Through Adversarial Training. Code-A1's architectural separation eliminates self-collusion risk, making white-box test generation safe. Source
- "Wait" Tokens Aren't the Key to Reasoning; Uncertainty Externalization Is. Information-theoretic framework unifies explanations of LLM "Aha moments." Purely procedural reasoning stagnates informationally. Source
- 464-Person Red Team Competition: Every Frontier Model Falls to Indirect Prompt Injection. Claude Opus 4.5 is most resistant (0.5% success rate), Gemini 2.5 Pro most vulnerable (8.5%). Capability and robustness show weak correlation. Source
- Hallucination Detection Recast as Geometric Anomaly in Cognitive Trajectories. Information-theoretic probes map VLM generation to a low-dimensional cognitive state space, reaching SOTA under weak supervision. Source
- Aleph Alpha Releases 70B Tokenizer-Free Model. HAT architecture operates at byte level, reuses Llama 3.1 backbone, outperforms original Llama in both German and English. Source
- Unified Multimodal Model Inference Accelerated 1.78-2.01x, Training-Free. FlashU tailors optimization strategies separately for generation and understanding tasks (CVPR 2026). Source
Don't miss what's next. Subscribe to AI Research Brief: