The Guardrail - AI Safety Weekly Research Digest logo

The Guardrail - AI Safety Weekly Research Digest

Archives
30 March 2026

The Guardrail Weekly Digest: 2026-03-23 - 2026-03-29

Weekly Digest

The Guardrail Weekly Digest

Week of 2026-03-23 to 2026-03-29

This week we reviewed 856 papers and selected the top 10 for their significance to AI safety research.


Top Papers This Week

1. Mirage The Illusion of Visual Understanding

Mohammad Asadi, Jack W. O'Sullivan, Fang Cao...

Why it matters: This paper reveals a critical flaw in multimodal AI systems where they hallucinate visual information, achieving high benchmark scores without actually 'seeing,' highlighting the urgent need for robust evaluation methods.

Identifies "mirage reasoning," where VLMs hallucinate visual evidence for text-only inferences, achieving SOTA on multimodal benchmarks without image input. This failure in visual grounding risks catastrophic miscalibration and

Score: 9.0/10 | Significance: 10.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


2. Structural Sensitivity in Compressed Transformers: Error Propagation, Lyapunov Stability, and Formally Verified Bounds

Abhinaba Basu

Why it matters: This paper identifies a consistent and surprising hierarchy of sensitivity to compression across transformer architectures, providing valuable insights for safer and more efficient model deployment.

Maps a 5-order-of-magnitude sensitivity hierarchy in transformers, identifying early MLP up-projections as critical failure points. Uses Lyapunov stability and Lean 4 formal verification to bound error propagation

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 10.0

Read Paper | PDF


3. BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents

Praveen Kumar Myakala, Manan Agrawal, Rahul Manche

Why it matters: BeliefShift introduces a crucial benchmark for evaluating and understanding how LLM agents' beliefs evolve over time, addressing a critical gap in current AI safety evaluations.

BeliefShift benchmarks longitudinal belief dynamics in LLM agents, quantifying risks like over-alignment and opinion drift. It introduces metrics like BRA and ESI to evaluate how models navigate evolving user beliefs, addressing the safety-critical trade-off between personaliz...

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


4. Early Discoveries of Algorithmist I: Promise of Provable Algorithm Synthesis at Scale

Janardhan Kulkarni

Why it matters: This paper introduces Algorithmist, an autonomous agent capable of synthesizing provably correct and empirically effective algorithms, potentially revolutionizing algorithm design and verification.

Algorithmist automates provable algorithm synthesis via a multi-agent loop that aligns code with structured natural-language proofs. This "proof-first" paradigm ensures formal safety guarantees (e.g.,

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


5. AgentRFC: Security Design Principles and Conformance Testing for Agent Protocols

Shenghan Zheng, Qifan Zhang

Why it matters: This paper introduces a much-needed security framework for AI agent protocols, addressing critical vulnerabilities in their design and composition that could have significant real-world impact.

AgentRFC formalizes agent protocol security via 11 TLA+ invariants and AgentConform, a tool that model-checks protocol IRs against live SDKs. It introduces "Composition

Score: 8.4/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


Honorable Mentions

  • LLMpedia: A Transparent Framework to Materialize an LLM's Encyclopedic Knowledge at Scale - LLMpedia provides a crucial, transparent, and scalable framework for evaluating and materializing the factual knowledge of LLMs, revealing significant gaps in current benchmark-based assessments.
  • How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning - DreamHouse introduces a crucial benchmark for evaluating VLMs' ability to reason about physical construction, exposing critical gaps in current models and highlighting the need for physically-grounded AI.
  • TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration - TreeTeaming introduces a novel hierarchical strategy exploration framework for red-teaming vision-language models, significantly improving attack success rates and strategic diversity compared to existing methods.
  • Emergent Formal Verification: How an Autonomous AI Ecosystem Independently Discovered SMT-Based Safety Across Six Domains - This paper demonstrates the surprising emergence of formal verification techniques within an autonomous AI ecosystem, suggesting a fundamental link between system complexity and the independent discovery of safety mechanisms.
  • Epistemic Observability in Language Models - This paper demonstrates a fundamental limitation in text-only supervision of language models, showing that models can be confidently wrong and that this issue is not simply a capability gap, while also providing a practical solution using internal model states.

This digest reviewed 856 papers and selected the top 10 for their significance to AI safety research.

View all papers on The Guardrail


The Guardrail: Curated AI Safety Research from arXiv

Don't miss what's next. Subscribe to The Guardrail - AI Safety Weekly Research Digest:
Twitter
LinkedIn
Powered by Buttondown, the easiest way to start and grow your newsletter.