The Guardrail - AI Safety Weekly Research Digest logo

The Guardrail - AI Safety Weekly Research Digest

Archives
16 February 2026

The Guardrail Weekly Digest: 2026-02-09 - 2026-02-15

Weekly Digest

The Guardrail Weekly Digest

Week of 2026-02-09 to 2026-02-15

From 1,023 papers reviewed this week, we selected 10 addressing critical challenges in securing agentic AI systems and LLM deployments. Notable contributions include cryptographic approaches to deterministic AI security, the discovery of implicit memory as a hidden channel enabling temporal backdoors in LLMs, and the first large-scale empirical study of malicious agent skills in the wild.


Top Papers This Week

1. Authenticated Workflows: A Systems Approach to Protecting Agentic AI

Mohan Rajagopalan, Vinay Rao

Why it matters: Authenticated workflows offer a promising deterministic security layer for agentic AI by enforcing intent and integrity at every boundary, addressing a critical gap in current defenses.

Authenticated workflows provide a deterministic trust layer for agentic AI by enforcing cryptographic integrity and intent at prompt, tool, data, and context boundaries. Using MAPL, it replaces probabilistic guardrails with

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 9.0

Read Paper | PDF


2. Stateless Yet Not Forgetful: Implicit Memory as a Hidden Channel in LLMs

Ahmed Salem, Andrew Paverd, Sahar Abdelnabi

Why it matters: This paper uncovers a novel and concerning vulnerability in LLMs: implicit memory, which enables persistent state across interactions without explicit memory modules, leading to potential security risks like time bomb backdoors.

Implicit memory allows LLMs to bypass statelessness by encoding state in outputs for later recovery, enabling "time bomb" backdoors triggered by specific interaction sequences. This creates hidden channels for covert agent coordination

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


3. Protecting Context and Prompts: Deterministic Security for Non-Deterministic AI

Mohan Rajagopalan, Vinay Rao

Why it matters: This paper introduces a novel cryptographic approach to LLM security, providing verifiable provenance and Byzantine resistance against prompt injection and context manipulation attacks, a critical step towards robust and trustworthy AI systems.

Authenticated prompts and context leverage cryptographic provenance and hash chains to ensure LLM input integrity. A formal policy algebra provides Byzantine resistance, shifting security from reactive detection to provable, preventative guarantees against prompt injection.

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


4. When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

Jiacheng Hou, Yining Sun, Ruochong Jin...

Why it matters: This paper identifies and addresses a novel and critical vulnerability in vision-centric image editing models, demonstrating successful visual jailbreak attacks and proposing a practical defense.

VJA introduces the first visual-to-visual jailbreak for image editing models, using visual prompts (marks, arrows) to bypass safety filters. It provides IESBench and an introspective

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


5. Malicious Agent Skills in the Wild: A Large-Scale Security Empirical Study

Yi Liu, Zhihao Chen, Yanjun Zhang...

Why it matters: This paper provides the first large-scale empirical study of malicious agent skills, revealing significant vulnerabilities and attack patterns in emerging LLM-based agent ecosystems.

Establishes the first labeled dataset of 157 malicious LLM agent skills, identifying "Data Thieves" and "Agent Hijackers." It reveals how advanced attacks use undocumented "shadow features"

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


Honorable Mentions

  • WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints - WorldTravel introduces a challenging new benchmark exposing the limitations of current AI agents in handling real-world planning with tightly coupled constraints and multimodal perception, highlighting critical gaps in perception-action integration and long-horizon reasoning.
  • Vulnerabilities in Partial TEE-Shielded LLM Inference with Precomputed Noise - This paper uncovers critical vulnerabilities in TEE-shielded LLM inference, demonstrating practical attacks that compromise model confidentiality and integrity, highlighting the risks of precomputed noise in deployed LLMs.
  • Towards Autonomous Mathematics Research - This paper demonstrates significant progress towards autonomous AI agents capable of conducting original mathematical research, raising important questions about the future of AI-driven discovery and potential risks.
  • LogicSkills: A Structured Benchmark for Formal Reasoning in Large Language Models - LogicSkills provides a crucial benchmark for dissecting the formal reasoning abilities of LLMs, revealing weaknesses in symbolization and countermodel construction that highlight potential vulnerabilities.
  • The Landscape of Prompt Injection Threats in LLM Agents: From Taxonomy to Analysis - This paper provides a timely and comprehensive analysis of prompt injection attacks in LLM agents, highlighting critical vulnerabilities in context-dependent tasks and introducing a new benchmark to address this gap.

This digest reviewed 1023 papers and selected the top 10 for their significance to AI safety research.

View all papers on The Guardrail


The Guardrail: Curated AI Safety Research from arXiv

Don't miss what's next. Subscribe to The Guardrail - AI Safety Weekly Research Digest:
Twitter
LinkedIn
Powered by Buttondown, the easiest way to start and grow your newsletter.