The Guardrail Weekly Digest: 2026-02-16 - 2026-02-22
The Guardrail Weekly Digest
Week of 2026-02-16 to 2026-02-22
From 681 papers reviewed, we selected 10 that advance the field's understanding of evaluations, alignment, and agentic risks. New research highlights critical flaws in current defense strategies, with Intent Laundering and Boundary Point Jailbreaking demonstrating how easily safety classifiers can be bypassed, while IndicJR reveals overlooked vulnerabilities in South Asian language processing. In the domain of agents, Overthinking Loops identifies a novel supply-chain attack vector involving malicious tools, and WebWorld introduces a scalable world model to facilitate safer open-web training simulations.
Top Papers This Week
1. Intent Laundering: AI Safety Datasets Are Not What They Seem
Shahriar Golchin, Marc Wetter
Why it matters: This paper reveals a critical flaw in current AI safety evaluations by demonstrating that models are easily fooled by 'intent laundering,' where malicious intent is preserved while triggering cues are removed, leading to widespread failures in supposedly safe models.
Safety benchmarks overrely on "triggering cues" rather than intent. "Intent laundering" abstracts these cues, achieving 90%+ jailbreak success on models like Claude 3.
Score: 9.0/10 | Significance: 10.0 | Novelty: 9.0 | Quality: 9.0
2. Boundary Point Jailbreaking of Black-Box LLMs
Xander Davies, Giorgi Giglemiani, Edmund Lau...
Why it matters: This paper introduces a novel black-box jailbreaking technique, BPJ, that effectively bypasses state-of-the-art LLM safety classifiers, highlighting a critical vulnerability in current defense strategies.
BPJ automates black-box jailbreaks using only 1-bit feedback, bypassing Constitutional Classifiers and GPT-5 filters. By optimizing via a curriculum of "boundary points," it demonstrates that single-interaction defenses are insufficient against automated red teaming.
Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0
3. Overthinking Loops in Agents: A Structural Risk via MCP Tools
Yohan Lee, Jisoo Jang, Seoyeon Choi...
Why it matters: This paper identifies a novel and concerning supply-chain attack vector in tool-using LLM agents, demonstrating how malicious tools can induce costly and detrimental overthinking loops.
Formalizes structural overthinking attacks where malicious MCP tools induce cyclic trajectories in LLM agents, causing up to 142x resource amplification. This supply-chain risk bypasses token-
Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0
4. IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages
Priyaranjan Pattnayak, Sanchari Chowdhuri
Why it matters: This paper introduces a crucial multilingual benchmark for jailbreak robustness, exposing vulnerabilities in LLMs that are masked by English-centric evaluations, particularly for South Asian languages.
IndicJR provides a judge-free benchmark for 12 South Asian languages, revealing that English-to-Indic jailbreaks transfer via format wrappers. It shows JSON constraints mask vulnerabilities and that roman
Score: 8.4/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0
5. WebWorld: A Large-Scale World Model for Web Agent Training
Zikai Xiao, Jianhong Tu, Chuhang Zou...
Why it matters: WebWorld offers a scalable and generalizable world model for web agents, enabling safer and more effective training by simulating open-web interactions and demonstrating cross-domain generalization.
WebWorld scales web simulation to 1M+ trajectories, enabling safe offline training and inference-time search. This high-fidelity world model facilitates long-horizon (30+ steps) reasoning and verification
Score: 8.4/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0
Honorable Mentions
- Retrieval Collapses When AI Pollutes the Web - This paper identifies and experimentally validates the critical risk of 'Retrieval Collapse,' where AI-generated content pollutes the web, undermining the reliability of search and RAG systems.
- The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety - This paper reveals a fundamental geometric instability in fine-tuning aligned language models, demonstrating how seemingly benign training can unexpectedly degrade safety guardrails due to curvature effects.
- Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents - This paper reveals a critical gap between text-based safety and tool-call safety in LLM agents, demonstrating that current safety evaluations are insufficient for real-world agent deployments.
- LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs - LLM-WikiRace provides a challenging and insightful benchmark for evaluating long-term planning and reasoning capabilities of LLMs, revealing critical limitations even in state-of-the-art models.
- In Agents We Trust, but Who Do Agents Trust? Latent Source Preferences Steer LLM Generations - This paper uncovers and quantifies latent source preferences in LLMs, revealing a critical bias that can steer information presented to users via AI agents.
This digest reviewed 681 papers and selected the top 10 for their significance to AI safety research.
View all papers on The Guardrail
The Guardrail: Curated AI Safety Research from arXiv
