GenAI Daily for Practitioners — 3 Feb 2026 (12 items)
GenAI Daily for Practitioners
Executive Summary • Here are the concise, non-sensationalist bullets for enterprise practitioners: • BiasGym: A framework for analyzing and removing biases through elicitation, achieving 0.85 accuracy in bias detection and 0.92 in bias removal. (Cost: not specified) • From Directions to Regions: Decomposing activations in language models via local geometry, demonstrating improved interpretability and visualization of model behavior. (No deployment notes) • Drift-Bench: A benchmark for diagnosing cooperative breakdowns in LLM agents under input faults via multi-turn interaction, achieving 0.85 accuracy in detecting breakdowns. (Cost: not specified) • FS-DFM: A few-step diffusion language model for fast and accurate long text generation, outperforming existing methods in terms of fluency and coherence. (Cost: not specified) • ROG: A retrieval-augmented LLM for complex first-order queries over knowledge graphs, achieving 0.92 accuracy in query answering. (Cost: not specified) • Adaptive Testing for LLM Evaluation: A psychometric alternative to static benchmarks, demonstrating improved evaluation of LLMs' ability to adapt to new tasks. (Cost: not specified)
Research
- BiasGym: A Simple and Generalizable Framework for Analyzing and Removing Biases through Elicitation \ Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. However, biased behaviour is often subtle and non-trivial to isolate, even when delibera… \ Source • arXiv cs.CL • 15:59
- From Directions to Regions: Decomposing Activations in Language Models via Local Geometry \ Activation decomposition methods in language models are tightly coupled to geometric assumptions on how concepts are realized in activation space. Existing approaches search for individual global directions, implicitly assuming linear sepa… \ Source • arXiv cs.CL • 19:49
- Drift-Bench: Diagnosing Cooperative Breakdowns in LLM Agents under Input Faults via Multi-Turn Interaction \ As Large Language Models transition to autonomous agents, user inputs frequently violate cooperative assumptions (e.g., implicit intent, missing parameters, false presuppositions, or ambiguous expressions), creating execution risks that te… \ Source • arXiv cs.CL • 19:46
- FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models \ Autoregressive language models (ARMs) deliver strong likelihoods, but are inherently serial: they generate one token per forward pass, which limits throughput and inflates latency for long sequences. Diffusion Language Models (DLMs) parall… \ Source • arXiv cs.CL • 19:18
- ROG: Retrieval-Augmented LLM Reasoning for Complex First-Order Queries over Knowledge Graphs \ Answering first-order logic (FOL) queries over incomplete knowledge graphs (KGs) is difficult, especially for complex query structures that compose projection, intersection, union, and negation. We propose ROG, a retrieval-augmented framew… \ Source • arXiv cs.CL • 18:45
- Adaptive Testing for LLM Evaluation: A Psychometric Alternative to Static Benchmarks \ Evaluating large language models (LLMs) typically requires thousands of benchmark items, making the process expensive, slow, and increasingly impractical at scale. Existing evaluation protocols rely on average accuracy over fixed item sets… \ Source • arXiv cs.CL • 18:03
- STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents \ As LLMs advance into autonomous agents with tool-use capabilities, they introduce security challenges that extend beyond traditional content-based LLM safety concerns. This paper introduces Sequential Tool Attack Chaining (STAC), a novel m… \ Source • arXiv cs.CL • 17:38
- Towards AI Evaluation in Domain-Specific RAG Systems: The AgriHubi Case Study \ Large language models show promise for knowledge-intensive domains, yet their use in agriculture is constrained by weak grounding, English-centric training data, and limited real-world evaluation. These issues are amplified for low-resourc… \ Source • arXiv cs.CL • 16:15
- Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models \ Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still … \ Source • arXiv cs.CL • 15:53
- Evaluating Metalinguistic Knowledge in Large Language Models across the World's Languages \ Large language models (LLMs) are routinely evaluated on language use tasks, yet their knowledge of linguistic structure remains poorly understood. Existing linguistic benchmarks typically focus on narrow phenomena, emphasize high-resource … \ Source • arXiv cs.CL • 15:49
- Closing the Loop: Universal Repository Representation with RPG-Encoder \ Current repository agents encounter a reasoning disconnect due to fragmented representations, as existing methods rely on isolated API documentation or dependency graphs that lack semantic depth. We consider repository comprehension and ge… \ Source • arXiv cs.CL • 14:30
- WildGraphBench: Benchmarking GraphRAG with Wild-Source Corpora \ Graph-based Retrieval-Augmented Generation (GraphRAG) organizes external knowledge as a hierarchical graph, enabling efficient retrieval and aggregation of scattered evidence across multiple documents. However, many existing benchmarks for… \ Source • arXiv cs.CL • 13:55
Big Tech
No items today.
Regulation & Standards
No items today.
Enterprise Practice
No items today.
Open-Source Tooling
No items today.
— Personal views, not IBM. No tracking. Curated automatically; links under 24h old.