The Guardrail - AI Safety Weekly Research Digest logo

The Guardrail - AI Safety Weekly Research Digest

Archives
13 April 2026

The Guardrail Weekly Digest: 2026-04-06 - 2026-04-12

Weekly Digest

The Guardrail Weekly Digest

Week of 2026-04-06 to 2026-04-12

This week we reviewed 895 papers and selected the top 10 for their significance to AI safety research.


Top Papers This Week

1. The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

Manish Bhatt, Sarthak Munshi, Vineeth Sai Narajala...

Why it matters: This paper rigorously proves fundamental limitations of prompt injection defense wrappers, highlighting the inherent difficulty of creating universally safe input preprocessing methods for LLMs.

Proves a "defense trilemma" where continuity, utility preservation, and completeness are mutually exclusive for wrapper defenses. Using boundary fixation and transversality, it shows a positive-measure

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 10.0

Read Paper | PDF


2. TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang...

Why it matters: This paper introduces a crucial benchmark for evaluating the safety of LLM agents during multi-step tool use, revealing surprising insights about the limitations of current guardrails.

TraceSafe-Bench evaluates guardrails on multi-step tool trajectories, finding that safety efficacy is driven by structural data competence (ρ=0.79) rather than semantic alignment. This identifies structural reasoning as a critical bottleneck for securing agentic workflows.

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


3. LiveFact: A Dynamic, Time-Aware Benchmark for LLM-Driven Fake News Detection

Cheng Xu, Changhong Jin, Yingjie Niu...

Why it matters: LiveFact introduces a dynamic benchmark for fake news detection that addresses critical limitations of static benchmarks, including benchmark data contamination and the inability to assess temporal reasoning.

LiveFact mitigates benchmark data contamination via a dynamic, time-aware framework for LLM fact-checking. It evaluates epistemic humility and reasoning under temporal uncertainty, ensuring models rely on robust evidence-based

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


4. Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study

Zhihao Chen, Ying Zhang, Yi Liu...

Why it matters: This paper uncovers widespread credential leakage vulnerabilities in LLM agent skills, highlighting a critical and previously understudied attack surface.

Maps 10 credential leakage patterns in LLM agent skills, finding 73.5% of leaks stem from stdout exposure. It demonstrates that 76.3% of vulnerabilities

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


5. Poison Once, Exploit Forever: Environment-Injected Memory Poisoning Attacks on Web Agents

Wei Zou, Mingwen Dong, Miguel Romero Calvo...

Why it matters: This paper reveals a novel and alarming vulnerability in web agents where a single poisoned observation can lead to persistent, cross-site attacks, highlighting a critical security gap in emerging AI browsers.

eTAMP enables cross-session, cross-site memory poisoning via environmental observation, bypassing permission-based defenses. It identifies "Frustration Exploitation," where agent stress increases vulnerability, highlighting critical security risks in persistent LLM agent memory.

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


Honorable Mentions

  • MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning - This paper introduces a novel and stealthy backdoor attack on LLMs that bypasses common defenses by manipulating the final answer while preserving correct reasoning, posing a significant threat to AI safety.
  • SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems - This paper introduces a novel and concerning backdoor attack, SkillTrojan, targeting the compositionality of skill-based agent systems, highlighting a critical vulnerability in their architecture.
  • RAGEN-2: Reasoning Collapse in Agentic RL - This paper identifies and addresses a critical failure mode in LLM agents, 'template collapse', where reasoning becomes input-agnostic despite high entropy, and introduces a practical solution using mutual information and SNR-aware filtering.
  • How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models - This paper identifies and characterizes a crucial circuit for safety refusal in language models, offering a pathway to control and understand alignment.
  • The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models - This paper identifies and mitigates catastrophic forgetting in VLMs for autonomous driving, a critical step towards safer and more reliable autonomous systems.

This digest reviewed 895 papers and selected the top 10 for their significance to AI safety research.

View all papers on The Guardrail


The Guardrail: Curated AI Safety Research from arXiv

Don't miss what's next. Subscribe to The Guardrail - AI Safety Weekly Research Digest:
Twitter
LinkedIn
Powered by Buttondown, the easiest way to start and grow your newsletter.