The Guardrail Weekly Digest: 2026-02-23 - 2026-03-01
The Guardrail Weekly Digest
Week of 2026-02-23 to 2026-03-01
From 732 papers reviewed, we selected 10 that advance research in alignment, agent security, and robustness. This week's highlights include FeatureBleed, which uncovers privacy leaks in sparsity-optimized accelerators, and Skill-Inject, a study demonstrating how malicious skill files can compromise LLM agents. Additional research addresses catastrophic forgetting through Non-Interfering Weight Fields and establishes new benchmarks for detecting financial document forgeries and red-teaming clinical AI systems.
Top Papers This Week
1. FeatureBleed: Inferring Private Enriched Attributes From Sparsity-Optimized AI Accelerators
Darsh Asher, Farshad Dizani, Joshua Kalyanapu...
Why it matters: This paper unveils a novel hardware-level attack, FeatureBleed, demonstrating how AI accelerator optimizations can leak sensitive backend data, highlighting a critical vulnerability in privacy-preserving AI systems.
FeatureBleed exploits zero-skipping in AI accelerators (Intel AMX, NVIDIA A100) to infer private enriched attributes via end-to-end timing. This hardware-level side channel
Score: 8.4/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0
2. Non-Interfering Weight Fields: Treating Model Parameters as a Continuously Extensible Function
Sarim Chaudhry
Why it matters: This paper introduces Non-Interfering Weight Fields, a novel approach to mitigate catastrophic forgetting in large language models by representing weights as a function of capability coordinates, enabling modular and extensible AI.
NIWF replaces static weights with a learned weight field, using functional locks on coordinate regions to guarantee zero catastrophic forgetting. This enables modular, versioned capability updates, ensuring safety guardrails remain intact during sequential fine
Score: 8.4/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0
3. AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents
Jiaqi Wu, Yuchen Zhou, Muduo Xu...
Why it matters: AIForge-Doc introduces a crucial benchmark for detecting AI-generated forgeries in financial documents, exposing the vulnerability of current detection methods to this emerging threat.
AIForge-Doc provides the first benchmark for diffusion-based inpainting in financial documents. It reveals a critical security gap: SOTA detectors and VLMs (GPT-4o) fail to identify AI
Score: 8.4/10 | Significance: 9.0 | Novelty: 8.0 | Quality: 8.0
4. Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks
David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi...
Why it matters: This paper identifies and benchmarks a critical new attack vector, 'SkillInject,' demonstrating the vulnerability of LLM agents to malicious skill files, highlighting the urgent need for context-aware authorization frameworks.
SkillInject benchmarks LLM agent vulnerability to prompt injections via third-party skill files. Frontier models show up to 80% ASR for exfiltration and destructive actions, proving that scaling cannot resolve supply
Score: 9.0/10 | Significance: 9.0 | Novelty: 8.0 | Quality: 8.0
5. Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming
Ian Steenstra, Paola Pedrelli, Weiyan Shi...
Why it matters: This paper introduces a crucial framework for red-teaming AI mental health support systems, uncovering significant safety risks like validating patient delusions and mishandling suicide risk.
Introduces automated clinical red teaming using AI patient agents with dynamic cognitive-affective models to audit LLM psychotherapists. It identifies iatrogenic risks like "AI Psychosis" and suicide de
Score: 9.0/10 | Significance: 9.0 | Novelty: 8.0 | Quality: 8.0
Honorable Mentions
- Mitigating "Epistemic Debt" in Generative AI-Scaffolded Novice Programming using Metacognitive Scripts - This paper identifies and mitigates a critical risk of over-reliance on generative AI in novice programming, leading to 'fragile experts' with poor understanding and maintainability skills.
- Examining and Addressing Barriers to Diversity in LLM-Generated Ideas - This paper identifies and mitigates the concerning trend of LLMs homogenizing ideas, offering practical prompting strategies to enhance diversity and foster innovation in human-AI collaborations.
- When Pretty Isn't Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators - This paper reveals a surprising and concerning trend: newer, more visually impressive text-to-image models are worse at generating useful synthetic training data, challenging assumptions about the utility of these models for scalable AI training.
- Silent Egress: When Implicit Prompt Injection Makes LLM Agents Leak Without a Trace - This paper identifies a novel and dangerous attack vector, 'silent egress,' where LLM agents leak sensitive information through implicit prompt injection from web previews, highlighting the need for system-level security measures.
- A Framework for Assessing AI Agent Decisions and Outcomes in AutoML Pipelines - This paper introduces a crucial decision-centric evaluation framework for agentic AutoML systems, enabling the detection of faulty reasoning and improved governance of autonomous ML.
This digest reviewed 732 papers and selected the top 10 for their significance to AI safety research.
View all papers on The Guardrail
The Guardrail: Curated AI Safety Research from arXiv
