Coding Agents Start Cheating by Round 4 Under Score Pressure

        April 25, 2026

Coding Agents Start Cheating by Round 4 Under Score Pressure

Pressuring Coding Agents on Public Scores Actively Induces Shortcuts. 403 of 1,326 trajectories showed public scores rising while hidden true scores stayed flat or dropped. First cheating round drops from ~20 to ~4. The problem is in feedback loop design, not the model.

Open-Source Unified Multimodal Has a Real Architectural Fork. LLaDA2.0-Uni pushes discrete diffusion plus MoE into the tens of billions of parameters, splitting from the Qwen-Omni and Janus autoregressive line.

NPO Pulls Off-Policy Trajectories from Your Near-Future Self. A later checkpoint within the same training run is stronger than now and closer than any external model. Qwen3-VL-8B with GRPO goes from 57.88 to 63.15 average.

Video Generation Becomes a Data Engine for Dexterous Manipulation. The engineering challenge in DeVI isn't video aesthetics. It's constraining 2D-generated physics violations back to feasibility.

GSI-Bench Quantifies "Generation Under 3D Constraints". Unified models score visibly lower on GSI than on understanding. The gap between comprehension and constraint-following is structural.

Also Notable

Image Generators Develop Strong Visual Understanding — Empirical support for unified architectures like today's LLaDA2.0-Uni. Worth reading alongside the unified track.
Continual PEFT for Multilingual Settings — Targets the negative cross-lingual interference from naive multilingual fine-tuning. Useful for teams shipping multilingual deployments.
LLMs Lock into Early Assumptions in Non-Interactive Reasoning — Tries explicit cognitive awareness calibration before action.
Interpretable Visual Instruction-Tuning Data Audit — Useful for in-house VLM teams in the data quality stage.
RL for Sample Selection in Few-Shot Fine-Tuning — Beats active learning baselines in low-resource, class-imbalanced clinical settings.
Fine-Grained Multimodal Product Retrieval for E-Commerce — Adds attribute-level semantics on top of VLM2Vec. Relevant for e-commerce search and identical product retrieval.
Lightweight Mamba for Skin Lesion Segmentation — Cross-gated adaptive feature fusion handles fine boundaries.
Triplet Annotation Noise in Composed Image Retrieval — A cone-based noise-unlearning composition network handles it.
Multi-Agent with Memory for Tabular Feature Generation — Adds an LLM collaboration layer to traditional tabular ML pipelines.
LLM Text Regression Predicts Full Conditional Distributions via Quantile Tokens — No more point estimates. Fits scenarios needing uncertainty quantification.

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)