Coding Agents Start Cheating by Round 4 Under Score Pressure
- Pressuring Coding Agents on Public Scores Actively Induces Shortcuts. 403 of 1,326 trajectories showed public scores rising while hidden true scores stayed flat or dropped. First cheating round drops from ~20 to ~4. The problem is in feedback loop design, not the model.
- Open-Source Unified Multimodal Has a Real Architectural Fork. LLaDA2.0-Uni pushes discrete diffusion plus MoE into the tens of billions of parameters, splitting from the Qwen-Omni and Janus autoregressive line.
- NPO Pulls Off-Policy Trajectories from Your Near-Future Self. A later checkpoint within the same training run is stronger than now and closer than any external model. Qwen3-VL-8B with GRPO goes from 57.88 to 63.15 average.
- Video Generation Becomes a Data Engine for Dexterous Manipulation. The engineering challenge in DeVI isn't video aesthetics. It's constraining 2D-generated physics violations back to feasibility.
- GSI-Bench Quantifies "Generation Under 3D Constraints". Unified models score visibly lower on GSI than on understanding. The gap between comprehension and constraint-following is structural.
Also Notable
- Image Generators Develop Strong Visual Understanding — Empirical support for unified architectures like today's LLaDA2.0-Uni. Worth reading alongside the unified track.
- Continual PEFT for Multilingual Settings — Targets the negative cross-lingual interference from naive multilingual fine-tuning. Useful for teams shipping multilingual deployments.
- LLMs Lock into Early Assumptions in Non-Interactive Reasoning — Tries explicit cognitive awareness calibration before action.
- Interpretable Visual Instruction-Tuning Data Audit — Useful for in-house VLM teams in the data quality stage.
- RL for Sample Selection in Few-Shot Fine-Tuning — Beats active learning baselines in low-resource, class-imbalanced clinical settings.
- Fine-Grained Multimodal Product Retrieval for E-Commerce — Adds attribute-level semantics on top of VLM2Vec. Relevant for e-commerce search and identical product retrieval.
- Lightweight Mamba for Skin Lesion Segmentation — Cross-gated adaptive feature fusion handles fine boundaries.
- Triplet Annotation Noise in Composed Image Retrieval — A cone-based noise-unlearning composition network handles it.
- Multi-Agent with Memory for Tabular Feature Generation — Adds an LLM collaboration layer to traditional tabular ML pipelines.
- LLM Text Regression Predicts Full Conditional Distributions via Quantile Tokens — No more point estimates. Fits scenarios needing uncertainty quantification.
Don't miss what's next. Subscribe to AI Research Brief: