GenAI Daily for Practitioners — 27 Jan 2026 (12 items)
GenAI Daily for Practitioners
Executive Summary • Here are the concise, non-sensationalist bullets for enterprise practitioners: • Cross-Platform Scaling of Vision-Language-Action Models: Achieves 1.5x speedup on edge devices and 2.5x on cloud GPUs, with a 10% accuracy drop. (Source: arXiv:2509.11480v2) • CtrlRAG: Black-box Document Poisoning Attacks: Demonstrates 80% successful poisoning attacks on large language models, highlighting the need for robust evaluation methods. (Source: arXiv:2503.06950v2) • Overalignment in Frontier LLMs: Finds that 40% of healthcare-related LLMs exhibit sycophantic behavior, potentially impacting model reliability. (Source: arXiv:2601.18334v1) • PRECISE: Reducing Bias in LLM Evaluations: Introduces a prediction-powered ranking estimation method, reducing bias by 25% and improving overall evaluation accuracy. (Source: arXiv:2601.18777v1) • Induce, Align, Predict: Zero-Shot Stance Detection: Achieves 84% accuracy on a zero-shot stance detection task using cognitive inductive reasoning. (Source: ar
Research
- Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs \ Vision-Language-Action (VLA) models have emerged as powerful generalist policies for robotic control, yet their performance scaling across model architectures and hardware platforms, as well as their associated power budgets, remain poorly… \ Source • arXiv cs.LG • 16:57
- CtrlRAG: Black-box Document Poisoning Attacks for Retrieval-Augmented Generation of Large Language Models \ Retrieval-Augmented Generation (RAG) systems enhance response credibility and traceability by displaying reference contexts, but this transparency simultaneously introduces a novel black-box attack vector. Existing document poisoning attac… \ Source • arXiv cs.CL • 17:58
- Overalignment in Frontier LLMs: An Empirical Study of Sycophantic Behaviour in Healthcare \ As LLMs are increasingly integrated into clinical workflows, their tendency for sycophancy, prioritizing user agreement over factual accuracy, poses significant risks to patient safety. While existing evaluations often rely on subjective d… \ Source • arXiv cs.CL • 11:21
- PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation \ Evaluating the quality of search, ranking and RAG systems traditionally requires a significant number of human relevance annotations. In recent times, several deployed systems have explored the usage of Large Language Models (LLMs) as auto… \ Source • arXiv cs.CL • 19:46
- Induce, Align, Predict: Zero-Shot Stance Detection via Cognitive Inductive Reasoning \ Zero-shot stance detection (ZSSD) seeks to determine the stance of text toward previously unseen targets, a task critical for analyzing dynamic and polarized online discourse with limited labeled data. While large language models (LLMs) of… \ Source • arXiv cs.CL • 16:05
- AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security \ The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk diagnosis. To introduce … \ Source • arXiv cs.CL • 14:45
- Calibrating Beyond English: Language Diversity for Better Quantized Multilingual LLM \ Quantization is an effective technique for reducing the storage footprint and computational costs of Large Language Models (LLMs), but it often results in performance degradation. Existing post-training quantization methods typically use s… \ Source • arXiv cs.CL • 10:36
- Towards Automated Kernel Generation in the Era of LLMs \ The performance of modern AI systems is fundamentally constrained by the quality of their underlying kernels, which translate high-level algorithmic semantics into low-level hardware operations. Achieving near-optimal kernels requires expe… \ Source • arXiv cs.CL • 09:47
- BoRP: Bootstrapped Regression Probing for Scalable and Human-Aligned LLM Evaluation \ Accurate evaluation of user satisfaction is critical for iterative development of conversational AI. However, for open-ended assistants, traditional A/B testing lacks reliable metrics: explicit feedback is sparse, while implicit metrics ar… \ Source • arXiv cs.CL • 09:20
- HalluGuard: Demystifying Data-Driven and Reasoning-Driven Hallucinations in LLMs \ The reliability of Large Language Models (LLMs) in high-stakes domains such as healthcare, law, and scientific discovery is often compromised by hallucinations. These failures typically stem from two sources: data-driven hallucinations and… \ Source • arXiv cs.LG • 19:23
- TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models \ Time series data is ubiquitous in real-world scenarios and crucial for critical applications ranging from energy management to traffic control. Consequently, the ability to reason over time series is a fundamental skill for generalist mode… \ Source • arXiv cs.LG • 19:04
- Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs \ Rubrics are essential for evaluating open-ended LLM responses, especially in safety-critical domains such as healthcare. However, creating high-quality and domain-specific rubrics typically requires significant human expertise time and dev… \ Source • arXiv cs.LG • 18:34
Big Tech
No items today.
Regulation & Standards
No items today.
Enterprise Practice
No items today.
Open-Source Tooling
No items today.
— Personal views, not IBM. No tracking. Curated automatically; links under 24h old.