The Guardrail - AI Safety Weekly Research Digest logo

The Guardrail - AI Safety Weekly Research Digest

Archives
6 April 2026

The Guardrail Weekly Digest: 2026-03-30 - 2026-04-05

Weekly Digest

The Guardrail Weekly Digest

Week of 2026-03-30 to 2026-04-05

This week we reviewed 951 papers and selected the top 10 for their significance to AI safety research.


Top Papers This Week

1. Reward Hacking as Equilibrium under Finite Evaluation

Jiacheng Wang, Jinbin Huang

Why it matters: This paper provides a rigorous theoretical framework demonstrating that reward hacking is an inherent equilibrium in AI systems, not just a bug, and predicts its severity, with implications for understanding and mitigating AI risks.

Establishes reward hacking as a structural equilibrium using a multi-task principal-agent framework. It derives a computable distortion index to predict hacking and formalizes the "treacherous turn" as a

Score: 9.0/10 | Significance: 10.0 | Novelty: 9.0 | Quality: 9.0

Read Paper | PDF


2. Empirical Validation of the Classification-Verification Dichotomy for AI Safety Gates

Arsenios Scrivens

Why it matters: This paper empirically demonstrates the fundamental limitations of classifier-based safety gates for self-improving AI systems and proposes a more robust verification-based alternative, offering a crucial step towards safer AI development.

Proves classifier-based safety gates fail during self-improvement due to structural impossibility. Introduces Lipschitz ball verifiers and chaining to achieve provable zero-violation parameter traversal up to LLM scale,

Score: 9.0/10 | Significance: 10.0 | Novelty: 9.0 | Quality: 9.0

Read Paper | PDF


3. Type-Checked Compliance: Deterministic Guardrails for Agentic Financial Systems Using Lean 4 Theorem Proving

Devakh Rashie, Veda Rashi

Why it matters: This paper introduces a novel, formal-verification-based AI guardrail platform using Lean 4 to ensure deterministic compliance in agentic financial systems, addressing a critical need for mathematically verifiable safety guarantees.

The Lean-Agent Protocol replaces probabilistic guardrails with Lean 4 formal verification, treating agent actions as conjectures that must be proven against regulatory axioms. This ensures deterministic, mathematically guaranteed compliance

Score: 8.0/10 | Significance: 10.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


4. The Persistent Vulnerability of Aligned AI Systems

Aengus Lynch

Why it matters: This thesis presents significant advances in AI safety across interpretability, robustness, and agent alignment, demonstrating persistent vulnerabilities in state-of-the-art models.

ACDC automates circuit discovery; Latent Adversarial Training (LAT) removes sleeper agents via residual stream perturbations. Best-of-N jailbreaking reveals power-law scaling in robustness,

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


5. ASI-Evolve: AI Accelerates AI

Weixian Xu, Tiantian Mi, Yixiu Liu...

Why it matters: ASI-Evolve demonstrates a significant step towards AI-driven self-improvement, showcasing AI's potential to accelerate its own development across various critical areas.

ASI-Evolve automates the AI research loop—data, architecture, and RL algorithms—via a cognition base and analyzer. This demonstrates the feasibility of recursive self-improvement, a key safety concern where

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


Honorable Mentions

  • Crossing the NL/PL Divide: Information Flow Analysis Across the NL/PL Boundary in LLM-Integrated Code - This paper introduces a novel information flow analysis technique to bridge the gap between natural language prompts and programming languages in LLM-integrated code, enabling more robust security and analysis.
  • Coherent Without Grounding, Grounded Without Success: Observability and Epistemic Failure - This paper identifies a critical failure mode in LLMs where coherent explanations mask a disconnect between understanding and action, challenging current AI evaluation practices.
  • Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning - This paper unveils a novel and potent backdoor attack targeting continuous latent reasoning in language models, highlighting a critical vulnerability in these increasingly popular architectures.
  • When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion - This paper uncovers a critical vulnerability in LLM merging, demonstrating how seemingly safe models can be combined to create dangerous ones through a novel Trojan attack.
  • SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning - SOLE-R1 offers a promising approach to aligning robot behavior with human intentions by using video-language models as the sole reward signal, significantly improving robustness and reducing reward hacking.

This digest reviewed 951 papers and selected the top 10 for their significance to AI safety research.

View all papers on The Guardrail


The Guardrail: Curated AI Safety Research from arXiv

Don't miss what's next. Subscribe to The Guardrail - AI Safety Weekly Research Digest:
Twitter
LinkedIn
Powered by Buttondown, the easiest way to start and grow your newsletter.