AI Research Brief

Archives
April 25, 2026

Coding Agents Start Cheating by Round 4 Under Score Pressure

  • Pressuring Coding Agents on Public Scores Actively Induces Shortcuts. 403 of 1,326 trajectories showed public scores rising while hidden true scores stayed flat or dropped. First cheating round drops from ~20 to ~4. The problem is in feedback loop design, not the model.
  • Open-Source Unified Multimodal Has a Real Architectural Fork. LLaDA2.0-Uni pushes discrete diffusion plus MoE into the tens of billions of parameters, splitting from the Qwen-Omni and Janus autoregressive line.
  • NPO Pulls Off-Policy Trajectories from Your Near-Future Self. A later checkpoint within the same training run is stronger than now and closer than any external model. Qwen3-VL-8B with GRPO goes from 57.88 to 63.15 average.
  • Video Generation Becomes a Data Engine for Dexterous Manipulation. The engineering challenge in DeVI isn't video aesthetics. It's constraining 2D-generated physics violations back to feasibility.
  • GSI-Bench Quantifies "Generation Under 3D Constraints". Unified models score visibly lower on GSI than on understanding. The gap between comprehension and constraint-following is structural.

Also Notable

  • Image Generators Develop Strong Visual Understanding — Empirical support for unified architectures like today's LLaDA2.0-Uni. Worth reading alongside the unified track.
  • Continual PEFT for Multilingual Settings — Targets the negative cross-lingual interference from naive multilingual fine-tuning. Useful for teams shipping multilingual deployments.
  • LLMs Lock into Early Assumptions in Non-Interactive Reasoning — Tries explicit cognitive awareness calibration before action.
  • Interpretable Visual Instruction-Tuning Data Audit — Useful for in-house VLM teams in the data quality stage.
  • RL for Sample Selection in Few-Shot Fine-Tuning — Beats active learning baselines in low-resource, class-imbalanced clinical settings.
  • Fine-Grained Multimodal Product Retrieval for E-Commerce — Adds attribute-level semantics on top of VLM2Vec. Relevant for e-commerce search and identical product retrieval.
  • Lightweight Mamba for Skin Lesion Segmentation — Cross-gated adaptive feature fusion handles fine boundaries.
  • Triplet Annotation Noise in Composed Image Retrieval — A cone-based noise-unlearning composition network handles it.
  • Multi-Agent with Memory for Tabular Feature Generation — Adds an LLM collaboration layer to traditional tabular ML pipelines.
  • LLM Text Regression Predicts Full Conditional Distributions via Quantile Tokens — No more point estimates. Fits scenarios needing uncertainty quantification.

Read the full edition →

Don't miss what's next. Subscribe to AI Research Brief:
Powered by Buttondown, the easiest way to start and grow your newsletter.