$15 Per Paper, Healthcare Agents Cap at 28%
- Auto-Research Cost Curve Has Crossed. $15 produces a full paper, but frontier LLMs still fabricate results and miss errors. End-to-end autonomy still falls short of the conference acceptance bar.
- OProver Pulls the Compiler Loop Into Training. Failed trajectories plus verifier-repaired proofs feed SFT directly. MiniF2F 93.3% Pass@32 puts it in the current top tier among open whole-proof provers.
- CHI-Bench Tests Policy Density, Role Switching, and Mid-Task Dialogue Together. The best agent config clears only 28%. Strict pass^3 keeps everyone under 20%.
- CompactAttention Targets the Chunked Prefill Gap. Demoting the 2D block-sparse mask from execution plan to KV selection signal gets 2.72x attention speedup at 128K context, with dense-equivalent accuracy.
Also Notable
- Tool-Using Agents Tested in One Real-Work Pipeline. Real professional tasks force end-to-end failure modes out of tool-using agents.
- Training-Free N-Gram Memory Module. Plug-and-play path for MoE schemes and trainable memory embeddings.
- Auto-Generated Abstract Reasoning Tasks, Formally Verifiable. Sidesteps human annotation cost and memorization contamination. Accuracy scoring stops getting dragged by data leakage.
- SFT That Adds New Knowledge Without Losing Old Capability. Distribution-aligned self-distillation without an external teacher. Post-training stops trading old capability for new.
- GPU Kernel Agent With Generalization-Aware Evaluation. Pushes kernel agents from single-point capability tests to unseen-config generalization.
- Expert-Guided Merging Then Quantization. Compresses model merging and quantization into one low-resource deployment pipeline.
Don't miss what's next. Subscribe to AI Research Brief: