The Guardrail - AI Safety Weekly Research Digest logo

The Guardrail - AI Safety Weekly Research Digest

Archives
21 April 2026

The Guardrail Weekly Digest: 2026-04-13 - 2026-04-19

Weekly Digest

The Guardrail Weekly Digest

Week of 2026-04-13 to 2026-04-19

This week we reviewed 1215 papers and selected the top 10 for their significance to AI safety research.


Top Papers This Week

1. Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain

Hanzhi Liu, Chaofan Shou, Hongbo Wen...

Why it matters: This paper uncovers a critical vulnerability in the LLM supply chain where malicious API routers can inject code, exfiltrate secrets, and even drain cryptocurrency, highlighting a significant and previously underappreciated attack surface.

Formalizes malicious intermediary attacks on LLM API routers, identifying payload injection and secret exfiltration. Empirical audits of 428 routers reveal active code injection and credential theft, exposing critical integrity

Score: 9.0/10 | Significance: 10.0 | Novelty: 9.0 | Quality: 9.0

Read Paper | PDF


2. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

David Gringras

Why it matters: This paper demonstrates that AI safety measures can paradoxically cause harm by withholding critical information from vulnerable users, highlighting a crucial failure mode in current alignment strategies.

IatroBench quantifies iatrogenic harm from over-alignment, revealing "identity-contingent withholding" where models withhold life-saving info from laypeople but not professionals. This decoupling

Score: 9.0/10 | Significance: 10.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


3. $λ_A$: A Typed Lambda Calculus for LLM Agent Composition

Qin Liu

Why it matters: This paper introduces a formal, typed lambda calculus for LLM agent composition, enabling rigorous analysis and detection of structural errors in agent configurations, a crucial step towards safer and more reliable AI agent systems.

$\lambda_A$ introduces a Coq-mechanized typed lambda calculus for LLM agents, proving type safety and termination for bounded fixpoints. It enables formal verification of agentic workflows, identifying structural

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 10.0

Read Paper | PDF


4. HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

Mohamed Elfeki, Tu Trinh, Kelvin Luu...

Why it matters: This paper introduces a novel benchmark, HiL-Bench, that directly addresses a critical gap in current AI evaluation: the ability of agents to recognize and appropriately request human assistance when faced with incomplete or ambiguous information.

HiL-Bench introduces Ask-F1 to measure "selective escalation"—the ability to detect unresolvable uncertainty and request help. This mitigates silent failures by training agents to recognize epist

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


5. Towards grounded autonomous research: an end-to-end LLM mini research loop on published computational physics

Haonan Huang

Why it matters: This paper demonstrates autonomous LLM agents capable of performing end-to-end scientific research, including reproducing, critiquing, and extending published work, highlighting both the potential and risks of increasingly capable AI agents in research.

Establishes an end-to-end LLM research loop for computational physics that reproduces and critiques literature. Identifying errors in 42% of 111 papers via execution demonstrates a leap in

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


Honorable Mentions

  • Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs - This paper reveals that safety evaluations of persona-imbued LLMs are incomplete if they only consider prompt-based personas, as activation steering exposes distinct vulnerabilities, highlighting the need for multi-faceted safety assessments.
  • OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models - OccuBench offers a crucial, scalable benchmark for evaluating AI agents' performance and robustness across a wide range of real-world professional tasks, addressing a significant gap in current AI safety evaluations.
  • KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation - KnowU-Bench introduces a crucial benchmark for evaluating personalized mobile agents, highlighting significant gaps in preference acquisition and intervention calibration that are vital for safe and trustworthy AI assistants.
  • ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models - ImplicitMemBench introduces a novel and crucial benchmark for evaluating the unconscious behavioral adaptation of LLMs, revealing significant limitations in their ability to learn and apply procedures automatically, a key aspect of safe and reliable AI agents.
  • Breaking the Generator Barrier: Disentangled Representation for Generalizable AI-Text Detection - This paper presents a novel method for detecting AI-generated text that is robust to unseen generators, addressing a critical challenge in AI safety as LLMs become more sophisticated.

This digest reviewed 1215 papers and selected the top 10 for their significance to AI safety research.

View all papers on The Guardrail


The Guardrail: Curated AI Safety Research from arXiv

Don't miss what's next. Subscribe to The Guardrail - AI Safety Weekly Research Digest:
Twitter
LinkedIn
Powered by Buttondown, the easiest way to start and grow your newsletter.