The Guardrail Weekly Digest: 2026-03-30 - 2026-04-05
The Guardrail Weekly Digest
Week of 2026-03-30 to 2026-04-05
This week we reviewed 951 papers and selected the top 10 for their significance to AI safety research.
Top Papers This Week
1. Reward Hacking as Equilibrium under Finite Evaluation
Jiacheng Wang, Jinbin Huang
Why it matters: This paper provides a rigorous theoretical framework demonstrating that reward hacking is an inherent equilibrium in AI systems, not just a bug, and predicts its severity, with implications for understanding and mitigating AI risks.
Establishes reward hacking as a structural equilibrium using a multi-task principal-agent framework. It derives a computable distortion index to predict hacking and formalizes the "treacherous turn" as a
Score: 9.0/10 | Significance: 10.0 | Novelty: 9.0 | Quality: 9.0
2. Empirical Validation of the Classification-Verification Dichotomy for AI Safety Gates
Arsenios Scrivens
Why it matters: This paper empirically demonstrates the fundamental limitations of classifier-based safety gates for self-improving AI systems and proposes a more robust verification-based alternative, offering a crucial step towards safer AI development.
Proves classifier-based safety gates fail during self-improvement due to structural impossibility. Introduces Lipschitz ball verifiers and chaining to achieve provable zero-violation parameter traversal up to LLM scale,
Score: 9.0/10 | Significance: 10.0 | Novelty: 9.0 | Quality: 9.0
3. Type-Checked Compliance: Deterministic Guardrails for Agentic Financial Systems Using Lean 4 Theorem Proving
Devakh Rashie, Veda Rashi
Why it matters: This paper introduces a novel, formal-verification-based AI guardrail platform using Lean 4 to ensure deterministic compliance in agentic financial systems, addressing a critical need for mathematically verifiable safety guarantees.
The Lean-Agent Protocol replaces probabilistic guardrails with Lean 4 formal verification, treating agent actions as conjectures that must be proven against regulatory axioms. This ensures deterministic, mathematically guaranteed compliance
Score: 8.0/10 | Significance: 10.0 | Novelty: 9.0 | Quality: 8.0
4. The Persistent Vulnerability of Aligned AI Systems
Aengus Lynch
Why it matters: This thesis presents significant advances in AI safety across interpretability, robustness, and agent alignment, demonstrating persistent vulnerabilities in state-of-the-art models.
ACDC automates circuit discovery; Latent Adversarial Training (LAT) removes sleeper agents via residual stream perturbations. Best-of-N jailbreaking reveals power-law scaling in robustness,
Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0
5. ASI-Evolve: AI Accelerates AI
Weixian Xu, Tiantian Mi, Yixiu Liu...
Why it matters: ASI-Evolve demonstrates a significant step towards AI-driven self-improvement, showcasing AI's potential to accelerate its own development across various critical areas.
ASI-Evolve automates the AI research loop—data, architecture, and RL algorithms—via a cognition base and analyzer. This demonstrates the feasibility of recursive self-improvement, a key safety concern where
Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0
Honorable Mentions
- Crossing the NL/PL Divide: Information Flow Analysis Across the NL/PL Boundary in LLM-Integrated Code - This paper introduces a novel information flow analysis technique to bridge the gap between natural language prompts and programming languages in LLM-integrated code, enabling more robust security and analysis.
- Coherent Without Grounding, Grounded Without Success: Observability and Epistemic Failure - This paper identifies a critical failure mode in LLMs where coherent explanations mask a disconnect between understanding and action, challenging current AI evaluation practices.
- Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning - This paper unveils a novel and potent backdoor attack targeting continuous latent reasoning in language models, highlighting a critical vulnerability in these increasingly popular architectures.
- When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion - This paper uncovers a critical vulnerability in LLM merging, demonstrating how seemingly safe models can be combined to create dangerous ones through a novel Trojan attack.
- SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning - SOLE-R1 offers a promising approach to aligning robot behavior with human intentions by using video-language models as the sole reward signal, significantly improving robustness and reducing reward hacking.
This digest reviewed 951 papers and selected the top 10 for their significance to AI safety research.
View all papers on The Guardrail
The Guardrail: Curated AI Safety Research from arXiv
