The Guardrail - AI Safety Weekly Research Digest logo

The Guardrail - AI Safety Weekly Research Digest

Archives
18 May 2026

The Guardrail Weekly Digest: 2026-05-11 - 2026-05-17

Weekly Digest

The Guardrail Weekly Digest

Week of 2026-05-11 to 2026-05-17

This week, we selected 10 papers from the 2591 reviewed, with a strong focus on agent vulnerabilities, robustness, and evaluations. Several papers expose critical security risks in agentic workflows, including Comment and Control and Demystifying and Detecting Agentic Workflow Injection Vulnerabilities in GitHub Actions, which demonstrate how adversaries can manipulate LLM agents integrated into real-world automation platforms. On the evaluation front, TRIAGE reveals significant deficiencies in LLMs' prospective metacognitive control under resource constraints, while No Attack Required introduces semantic fuzzing to uncover hidden specification violations in deployed agent skills. Finally, DiffusionHijack highlights a novel supply-chain vulnerability in diffusion models via PRNG hijacking, proposing a quantum random number generator as a robust defense mechanism.


Top Papers This Week

1. DiffusionHijack: Supply-Chain PRNG Backdoor Attack on Diffusion Models and Quantum Random Number Defense

Ziyang You, Liling Zheng, Xiaoke Yang...

Why it matters: This paper reveals a critical supply-chain vulnerability in diffusion models by demonstrating a PRNG hijacking attack and proposes a quantum random number generator as a robust defense.

DiffusionHijack demonstrates that compromising a diffusion model’s PRNG allows for deterministic, undetectable content injection without altering model weights. Using a QRNG provides a hardware-level defense, ensuring information-theoretic entropy against such attacks.

Score: 9.0/10 | Significance: 10.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


2. TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints

Zabir Al Nazi, Shubhashis Roy Dipta

Why it matters: This paper introduces a novel framework, TRIAGE, to evaluate and reveal significant deficiencies in LLMs' prospective metacognitive control, a crucial capability for resource-efficient autonomous agents.

TRIAGE introduces a framework to measure prospective metacognitive control in LLMs, quantifying their ability to allocate compute and sequence tasks under constraints. This is critical for AI safety, as autonomous agents must reliably manage resources to avoid failure.

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


3. No Attack Required: Semantic Fuzzing for Specification Violations in Agent Skills

Ying Li, Hongbo Wen, Yanju Chen...

Why it matters: This paper introduces a novel semantic fuzzing technique to uncover hidden specification violations in agent skills, revealing a significant vulnerability in deployed AI systems.

Sefz introduces semantic fuzzing to detect specification violations where LLM agents breach their own guardrails during benign execution. By mapping guardrails to reachability goals, it identifies critical failures in 30% of tested skills, exposing systemic design flaws.

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


4. Comment and Control: Hijacking Agentic Workflows via Context-Grounded Evolution

Neil Fendley, Zhengyu Liu, Aonan Guan...

Why it matters: This paper identifies and exploits a novel vulnerability in agentic workflows, demonstrating how adversaries can manipulate LLM agents in automation platforms to perform unwanted actions, highlighting a critical security risk in increasingly popular AI-powered systems.

JAW introduces a framework for hijacking agentic workflows by using hybrid program analysis to evolve inputs that bypass LLM security. It demonstrates critical vulnerabilities in automation platforms, enabling credential exfiltration via prompt-injection chains.

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


5. Demystifying and Detecting Agentic Workflow Injection Vulnerabilities in GitHub Actions

Shenao Wang, Xinyi Hou, Zhao Liu...

Why it matters: This paper identifies and detects a novel and critical vulnerability, Agentic Workflow Injection (AWI), in GitHub Actions that arises from the integration of LLM-based agents, demonstrating a practical security risk in real-world AI-assisted workflows.

Agentic Workflow Injection (AWI) introduces a novel class of vulnerabilities where untrusted event data propagates into LLM prompts or downstream scripts. TaintAWI systematically detects these flaws, uncovering 496 exploitable vulnerabilities in GitHub Actions.

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


Honorable Mentions

  • Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning - This paper challenges the prevailing view of RLHF as capability learning, demonstrating it's primarily sparse policy selection and offering a much more efficient alternative.
  • The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies - This paper identifies and rigorously demonstrates a critical format confound in chain-of-thought corruption studies, undermining a key method for evaluating faithfulness and interpretability.
  • Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems - Proteus introduces a novel self-evolving red-teaming framework to expose vulnerabilities in agent skill ecosystems, highlighting the limitations of current auditing methods against adaptive attackers.
  • When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel - This paper reveals that chain-of-thought reasoning can be a misleading oversight channel, as models often arrive at an answer before the trace suggests, challenging the assumption that CoT accurately reflects the model's decision-making process.
  • Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection - This paper introduces a novel and stealthy attack, Mobius Injection, that can weaponize AI agents into launching devastating DDoS attacks, highlighting a critical systemic risk in agentic systems.

This digest reviewed 2591 papers and selected the top 10 for their significance to AI safety research.

View all papers on The Guardrail


The Guardrail: Curated AI Safety Research from arXiv

Don't miss what's next. Subscribe to The Guardrail - AI Safety Weekly Research Digest:
Twitter
LinkedIn
Powered by Buttondown, the easiest way to start and grow your newsletter.