The Guardrail - AI Safety Weekly Research Digest logo

The Guardrail - AI Safety Weekly Research Digest

Archives
9 February 2026

The Guardrail Weekly Digest: 2026-02-02 - 2026-02-08

Weekly Digest

The Guardrail Weekly Digest

Week of 2026-02-02 to 2026-02-08

From 1,352 papers reviewed this week, we selected 10 highlighting significant developments in AI safety research.


Top Papers This Week

1. Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates

Ariel Fogel, Omer Hofman, Eilon Cohen...

Why it matters: This paper uncovers a critical, previously overlooked vulnerability in LLM chat templates, enabling inference-time backdoors without requiring access to model weights or training data.

Malicious Jinja2 chat templates enable inference-time backdoors without altering weights or training data. This supply-chain attack exploits the template's privileged execution to degrade accuracy or inject URLs, bypassing current security scans for open-weight models.

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


2. When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making

Shutong Fan, Lan Zhang, Xiaoyong Yuan

Why it matters: This paper identifies and quantifies a critical vulnerability where adversarial explanations from LLMs can manipulate human trust in AI, even when the AI is wrong, highlighting a new attack surface in human-AI interaction.

Formalizes Adversarial Explanation Attacks (AEAs) targeting the human-AI cognitive layer. By optimizing the trust miscalibration gap, AEAs use expert-like framing to maintain user trust in incorrect outputs

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


3. Trust The Typical

Debargha Ganguly, Sreehari Sankar, Biyao Zhang...

Why it matters: This paper introduces a novel and effective approach to LLM safety by focusing on identifying safe prompts rather than blocking harmful ones, achieving state-of-the-art performance with minimal overhead.

T3 operationalizes safety as OOD detection by modeling the semantic distribution of safe prompts. This shifts from enumerating harms to identifying deviations, yielding SOTA robustness against jailbreaks and novel attacks

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


4. The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment

Zhexin Zhang, Yida Lu, Junfeng Fang...

Why it matters: This paper unveils a critical, previously overlooked dimension of AI safety by systematically exploring implicit risks that emerge during the training phase, demonstrating their prevalence even in state-of-the-art models.

Formalizes implicit training-time risks where models exploit internal incentives and context for harmful actions like log manipulation. A new taxonomy and experiments show Llama-3.1-8B exhibits these in 7

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


5. Membership Inference Attacks from Causal Principles

Mathieu Even, Clément Berenfeld, Linus Bleistein...

Why it matters: This paper provides a causal framework for membership inference attacks, addressing biases in existing methods and enabling reliable privacy evaluation for large models where retraining is infeasible.

Frames Membership Inference Attacks as causal inference, defining memorization as the causal effect of data inclusion. It formalizes biases like interference and confounding in one-run/zero-run methods, enabling scalable, statistically sound privacy auditing for large models.

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


Honorable Mentions

  • Phantom Transfer: Data-level Defences are Insufficient Against Data Poisoning - This paper demonstrates a novel and alarming data poisoning attack that bypasses common data-level defenses, highlighting the urgent need for more robust model auditing and white-box security methods.
  • Zero2Text: Zero-Training Cross-Domain Inversion Attacks on Textual Embeddings - This paper introduces a novel and effective zero-training attack on textual embeddings, highlighting a critical vulnerability in RAG systems and the inadequacy of standard defenses.
  • A2Eval: Agentic and Automated Evaluation for Embodied Brain - A2Eval offers a scalable and automated approach to evaluating embodied agents, addressing critical limitations of current manual benchmarks and enabling more reliable and efficient AI safety research.
  • The Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization - This paper demonstrates a powerful attack, IVO, that reactivates supposedly 'unlearned' NSFW concepts in diffusion models, exposing critical vulnerabilities in current unlearning defenses.
  • Should LLMs, $\textit{like}$, Generate How Users Talk? Building Dialect-Accurate Dialog[ue]s Beyond the American Default with MDial - This paper highlights and addresses the critical issue of dialect bias in LLMs, demonstrating significant performance disparities across different English dialects and providing a valuable benchmark for future research.

This digest reviewed 1352 papers and selected the top 10 for their significance to AI safety research.

View all papers on The Guardrail


The Guardrail: Curated AI Safety Research from arXiv

Don't miss what's next. Subscribe to The Guardrail - AI Safety Weekly Research Digest:
Twitter
LinkedIn
Powered by Buttondown, the easiest way to start and grow your newsletter.