The Guardrail - AI Safety Weekly Research Digest logo

The Guardrail - AI Safety Weekly Research Digest

Archives
26 January 2026

The Guardrail Weekly Digest: 2026-01-19 - 2026-01-25

Weekly Digest

The Guardrail Weekly Digest

Week of 2026-01-19 to 2026-01-25

This week we reviewed 664 papers and selected the top 10 for their significance to AI safety research.


Top Papers This Week

1. The Responsibility Vacuum: Organizational Failure in Scaled Agent Systems

Oleg Romanchuk, Roman Bondar

Why it matters: This paper identifies a critical structural failure in AI governance, the 'responsibility vacuum,' where accountability erodes as AI systems scale, highlighting the dangers of unchecked automation.

Identifies a "responsibility vacuum" where agent throughput exceeds human epistemic capacity, forcing ritualized approval via proxy signals. It proves CI automation amplifies this gap, making individual accountability structurally impossible and requiring system-level ownership.

Score: 9.0/10 | Significance: 10.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


2. UbuntuGuard: A Culturally-Grounded Policy Benchmark for Equitable AI Safety in African Languages

Tassallah Abdullahi, Macton Mgonzo, Mardiyyah Oduwole...

Why it matters: UbuntuGuard introduces a crucial, culturally-grounded safety benchmark for African languages, exposing the limitations of current Western-centric AI safety models and paving the way for more equitable and reliable AI systems globally.

UbuntuGuard introduces the first African policy-based safety benchmark using expert-authored adversarial queries to mitigate cultural misalignment. It enables evaluation of runtime-enforceable policies, revealing that English-centric benchmarks overestimate

Score: 9.0/10 | Significance: 10.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


3. Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Anmol Goel, Cornelius Emde, Sangdoo Yun...

Why it matters: This paper identifies a critical and previously overlooked vulnerability in fine-tuned language models, demonstrating that seemingly benign fine-tuning can lead to a 'privacy collapse' where models leak sensitive information despite performing well on standard benchmarks.

Benign fine-tuning triggers "privacy collapse," a silent failure where models lose contextual privacy reasoning and violate memory boundaries despite passing safety benchmarks. Mechanistic analysis shows privacy representations are uniquely fragile,

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


4. VidLeaks: Membership Inference Attacks Against Text-to-Video Models

Li Wang, Wenyu Chen, Ning Yu...

Why it matters: VidLeaks pioneers membership inference attacks against text-to-video models, revealing significant privacy vulnerabilities in these increasingly prevalent systems.

VidLeaks introduces the first membership inference framework for T2V models, using Spatial Reconstruction Fidelity and Temporal Generative Stability to detect training data leakage. It reveals severe black-box privacy risks in

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


5. Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents

Kaiyu Zhou, Yongsen Zheng, Yicheng He...

Why it matters: This paper introduces a novel and stealthy economic denial-of-service attack against LLM agents by exploiting tool-calling chains, highlighting a critical vulnerability in agentic systems.

Introduces a stealthy multi-turn DoS attack using MCTS-optimized tool server responses to trigger resource-intensive tool-calling chains. By maintaining correct final outputs, it inflates

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


Honorable Mentions

  • Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs - This paper reveals a novel and highly effective backdoor attack on multi-turn LLMs that exploits dialogue structure, highlighting a critical and previously overlooked vulnerability.
  • Zero-Permission Manipulation: Can We Trust Large Multimodal Model Powered GUI Agents? - This paper uncovers a critical vulnerability in multimodal GUI agents, demonstrating how they can be manipulated to perform unintended actions without requiring any special permissions, highlighting a significant security risk in their current design.
  • Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs - This paper uncovers a mechanistic explanation for how spurious rewards in RLVR can lead to memorization shortcuts in LLMs, providing a roadmap for mitigating data contamination.
  • Hierarchical Sparse Circuit Extraction from Billion-Parameter Language Models through Scalable Attribution Graph Decomposition - This paper introduces a scalable method for extracting interpretable circuits from large language models, a crucial step towards understanding and controlling their behavior.
  • Learning from Synthetic Data: Limitations of ERM - This paper highlights the vulnerability of ERM to synthetic data contamination and proposes alternative learning algorithms, addressing a critical robustness issue in the age of pervasive LLM-generated content.

This digest reviewed 664 papers and selected the top 10 for their significance to AI safety research.

View all papers on The Guardrail


The Guardrail: Curated AI Safety Research from arXiv

Don't miss what's next. Subscribe to The Guardrail - AI Safety Weekly Research Digest:
Twitter
LinkedIn
Powered by Buttondown, the easiest way to start and grow your newsletter.