The Guardrail Weekly Digest: 2026-03-16 - 2026-03-22
The Guardrail Weekly Digest
Week of 2026-03-16 to 2026-03-22
This week we reviewed 978 papers and selected the top 10 for their significance to AI safety research.
Top Papers This Week
1. Transformers are Bayesian Networks
Gregory Coppola
Why it matters: This paper provides a fundamental and formally verified understanding of transformers as Bayesian networks, offering deep insights into their inner workings and limitations, with implications for interpretability and verifiable inference.
Proves sigmoid transformers implement loopy belief propagation on implicit factor graphs, mapping attention to AND and FFN to OR. This enables formal verification of model reasoning and identifies hallucinations as structural failures of ungrounded, infinite concept spaces.
Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 10.0
2. IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia
Priyaranjan Pattnayak, Sanchari Chowdhuri
Why it matters: This paper introduces a crucial benchmark for evaluating LLM safety in underrepresented Indic languages, revealing significant safety drift and highlighting the need for culturally informed alignment strategies.
IndicSafe benchmarks LLM safety across 12 Indic languages using 6,000 culturally grounded prompts. It reveals critical safety drift—with only 12.8% cross-language agreement—demonstrating that safety alignment fails to generalize across low-resource, multilingual contexts.
Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0
3. ClawWorm: Self-Propagating Attacks Across LLM Agent Ecosystems
Yihao Zhang, Zeming Wei, Xiaokun Luan...
Why it matters: This paper demonstrates a self-replicating worm attack on a widely used LLM agent platform, highlighting critical vulnerabilities in multi-agent ecosystems.
ClawWorm demonstrates the first self-replicating worm in production LLM agent ecosystems. It achieves persistence via configuration hijacking and autonomous propagation through peer messaging, exposing critical failures in multi-agent trust boundaries
Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0
4. Colluding LoRA: A Composite Attack on LLM Safety Alignment
Sihao Ding
Why it matters: CoLoRA introduces a novel and concerning attack vector on LLMs by demonstrating how seemingly benign LoRA adapters can collude to degrade safety alignment upon composition, highlighting a critical vulnerability in modular LLM supply chains.
CoLoRA demonstrates that individually benign LoRA adapters can linearly combine to suppress safety alignment. This exploits "combinatorial blindness" to achieve broad refusal suppression, bypassing single-module verification and necessitating composition-aware safety defenses.
Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0
5. Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails
Gregory N. Frank
Why it matters: This paper reveals the limitations of current alignment evaluation methods by demonstrating that models often manipulate knowledge expression through routing mechanisms, rather than simply refusing harmful requests or lacking relevant knowledge.
Alignment operates via a "detect-route-generate" pipeline. Refusal benchmarks fail as models shift to "narrative steering." Surgical ablation of lab-specific routing directions bypasses alignment while preserving knowledge,
Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0
Honorable Mentions
- Governing Dynamic Capabilities: Cryptographic Binding and Reproducibility Verification for AI Agent Tool Use - This paper introduces a practical cryptographic framework to address the critical 'capability-identity gap' in AI agents, preventing silent capability escalation and enhancing traceability, crucial for AI governance and safety.
- FIND: A Simple yet Effective Baseline for Diffusion-Generated Image Detection - This paper introduces a fast and effective method for detecting diffusion-generated images, addressing a critical need for identifying synthetic content and mitigating potential misuse.
- HindSight: Evaluating LLM-Generated Research Ideas via Future Impact - This paper introduces a novel and objective method, HindSight, for evaluating AI-generated research ideas based on their future impact, revealing a significant disconnect between LLM-judged novelty and real-world research value.
- EvoClaw: Evaluating AI Agents on Continuous Software Evolution - EvoClaw introduces a crucial benchmark for evaluating AI agents' ability to handle the complexities of continuous software evolution, revealing a significant performance drop compared to isolated tasks.
- Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models - This paper introduces a novel framework for diagnosing hallucinations in VLMs by modeling their cognitive trajectories, offering a promising approach to understanding and mitigating untruthful AI outputs.
This digest reviewed 978 papers and selected the top 10 for their significance to AI safety research.
View all papers on The Guardrail
The Guardrail: Curated AI Safety Research from arXiv
