The Guardrail - AI Safety Weekly Research Digest logo

The Guardrail - AI Safety Weekly Research Digest

Archives
2 February 2026

The Guardrail Weekly Digest: 2026-01-26 - 2026-02-01

Weekly Digest

The Guardrail Weekly Digest

Week of 2026-01-26 to 2026-02-01

This week we reviewed 733 papers and selected the top 10 for their significance to AI safety research.


Top Papers This Week

1. Code over Words: Overcoming Semantic Inertia via Code-Grounded Reasoning

Manjie Xu, Isabella Yin, Xinyi Tu...

Why it matters: This paper demonstrates that grounding reasoning in code, rather than natural language, can overcome semantic inertia in LLMs, leading to more robust and reliable AI systems.

Mitigates "semantic inertia"—the failure to override pre-trained priors with in-context rules—by representing dynamics as executable code. This LCV approach decouples logic from semantics, reversing inverse scaling to ensure models follow dynamic, counterfactual safety constra...

Score: 9.0/10 | Significance: 9.0 | Novelty: 9.0 | Quality: 8.0

Read Paper | PDF


2. Do LLMs Favor LLMs? Quantifying Interaction Effects in Peer Review

Vibhhu Sharma, Thorsten Joachims, Sarah Dean

Why it matters: This paper provides crucial empirical evidence on how LLMs influence the peer review process, revealing biases and limitations that could undermine scientific integrity.

Quantifies interaction effects in LLM-assisted peer review, finding that LLM reviews exhibit rating compression and leniency toward low-quality work rather than direct favoritism. This identifies a critical reliability

Score: 9.0/10 | Significance: 9.0 | Novelty: 8.0 | Quality: 8.0

Read Paper | PDF


3. When Benchmarks Leak: Inference-Time Decontamination for LLMs

Jianzhe Chai, Yu Zhe, Jun Sakuma

Why it matters: This paper introduces a novel and effective method for mitigating benchmark contamination in LLMs during evaluation, preserving benchmark integrity while minimizing performance degradation.

DeconIEP uses instance-adaptive embedding perturbations, guided by a reference model, to suppress memorization-driven shortcuts. This inference-time decontamination ensures safety benchmarks reflect true generalization, preventing the dangerous over

Score: 8.0/10 | Significance: 9.0 | Novelty: 8.0 | Quality: 8.0

Read Paper | PDF


4. Timely Machine: Awareness of Time Makes Test-Time Scaling Agentic

Yichuan Ma, Linyang Li, Yongkang chen...

Why it matters: This paper introduces a crucial temporal dimension to agentic AI evaluation, revealing the importance of time-aware reasoning and adaptation for effective test-time scaling.

Redefines test-time scaling as wall-clock time for agents, using Timely-RL to optimize reasoning based on temporal budgets. This enhances reliability by ensuring agents adapt planning to tool latency, a critical capability for safe and predictable real-world deployment.

Score: 8.0/10 | Significance: 9.0 | Novelty: 8.0 | Quality: 8.0

Read Paper | PDF


5. Reflect: Transparent Principle-Guided Reasoning for Constitutional Alignment at Scale

Henry Bell, Caroline Zhang, Mohammed Mobasserul Haque...

Why it matters: Reflect offers a novel and practical inference-time method for aligning LLMs with constitutional principles, enhancing safety and robustness without requiring retraining or human annotation.

REFLECT enables training-free constitutional alignment via in-context self-evaluation, critique, and revision. It improves safety by mitigating rare, high-impact principle violations and providing transparent reasoning traces

Score: 8.0/10 | Significance: 9.0 | Novelty: 8.0 | Quality: 8.0

Read Paper | PDF


Honorable Mentions

  • Beyond Divergent Creativity: A Human-Based Evaluation of Creativity in Large Language Models - This paper introduces a more human-aligned metric for evaluating LLM creativity, revealing potential trade-offs between alignment and genuine creative output.
  • LLMs Can Unlearn Refusal with Only 1,000 Benign Samples - This paper reveals a critical vulnerability in LLM safety alignment by demonstrating how easily refusal mechanisms can be bypassed with minimal fine-tuning, highlighting the reliance on superficial token memorization.
  • MortalMATH: Evaluating the Conflict Between Reasoning Objectives and Emergency Contexts - This paper reveals a dangerous trade-off between reasoning ability and safety awareness in LLMs, showing that specialized models can prioritize task completion even when users describe life-threatening emergencies.
  • RIFT: Reordered Instruction Following Testbed To Evaluate Instruction Following in Singular Multistep Prompt Structures - RIFT reveals a critical vulnerability in LLMs' instruction following capabilities when faced with non-sequential prompts, highlighting a significant limitation for complex workflow applications.
  • Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations - This paper reveals critical flaws in using LLM-simulated users for agent evaluation, highlighting biases and miscalibrations that can lead to inaccurate assessments of agent performance across diverse populations.

This digest reviewed 733 papers and selected the top 10 for their significance to AI safety research.

View all papers on The Guardrail


The Guardrail: Curated AI Safety Research from arXiv

Don't miss what's next. Subscribe to The Guardrail - AI Safety Weekly Research Digest:
Twitter
LinkedIn
Powered by Buttondown, the easiest way to start and grow your newsletter.