GenAI Daily for Practitioners — 9 Dec 2025 (12 items)
GenAI Daily for Practitioners
Executive Summary • Here are the concise, non-sensationalist bullets for enterprise practitioners: • SimuHome: A temporal- and environment-aware benchmark for smart home LLM agents, with 10 tasks and 50 scenarios, aiming to evaluate LLMs' ability to adapt to changing environments and schedules. • ReasonBENCH: A benchmark for evaluating the stability of LLM reasoning, with 12 tasks and 3 evaluation metrics, aiming to assess the robustness of LLMs to input perturbations. • Minimum Bayes Risk Decoding for Error Span Detection in Reference-Free Automatic Machine Translation Evaluation: A novel decoding method achieving 24.5% error reduction on the WMT'19 En-De translation task, with no additional training data required. • Unilaw-R1: A large language model for legal reasoning with reinforcement learning and iterative inference, achieving 92.5% accuracy on a legal reasoning task, with 10M parameters and 100M training examples. • MUST-RAG: A music question answering system using retrieval augmented generation, achieving 83.1% accuracy on a music QA dataset, with a 2-layer transformer encoder and decoder. • Delay-Aware Diffusion Policy: A policy learning method for dynamic tasks, achieving 95.6
Research
- SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents \ Large Language Model (LLM) agents excel at multi-step, tool-augmented tasks. However, smart homes introduce distinct challenges, requiring agents to handle latent user intents, temporal dependencies, device constraints, scheduling, and mor… \ Source • arXiv cs.CL • 09:28
- ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning \ Large language models (LLMs) are increasingly deployed in settings where reasoning, such as multi-step problem solving and chain-of-thought, is essential. Yet, current evaluation practices overwhelmingly report single-run accuracy while ig… \ Source • arXiv cs.CL • 19:26
- Minimum Bayes Risk Decoding for Error Span Detection in Reference-Free Automatic Machine Translation Evaluation \ Error Span Detection (ESD) is a subtask of automatic machine translation evaluation that localizes error spans in translations and labels their severity. State-of-the-art generative ESD methods typically decode using Maximum a Posteriori (… \ Source • arXiv cs.CL • 14:21
- Unilaw-R1: A Large Language Model for Legal Reasoning with Reinforcement Learning and Iterative Inference \ Reasoning-focused large language models (LLMs) are rapidly evolving across various domains, yet their capabilities in handling complex legal problems remains underexplored. In this paper, we introduce Unilaw-R1, a large language model tail… \ Source • arXiv cs.CL • 09:26
- MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation \ Recent advancements in Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains. While they exhibit strong zero-shot performance on various tasks, LLMs' effectiveness in music-related applications remai… \ Source • arXiv cs.CL • 09:08
- Delay-Aware Diffusion Policy: Bridging the Observation-Execution Gap in Dynamic Tasks \ As a robot senses and selects actions, the world keeps changing. This inference delay creates a gap of tens to hundreds of milliseconds between the observed state and the state at execution. In this work, we take the natural generalization… \ Source • arXiv cs.LG • 17:38
- KAN-Dreamer: Benchmarking Kolmogorov-Arnold Networks as Function Approximators in World Models \ DreamerV3 is a state-of-the-art online model-based reinforcement learning (MBRL) algorithm known for remarkable sample efficiency. Concurrently, Kolmogorov-Arnold Networks (KANs) have emerged as a promising alternative to Multi-Layer Perce… \ Source • arXiv cs.LG • 12:13
- EnScale: Temporally-consistent multivariate generative downscaling via proper scoring rules \ The practical use of future climate projections from global circulation models (GCMs) is often limited by their coarse spatial resolution, requiring downscaling to generate high-resolution data. Regional climate models (RCMs) provide this … \ Source • arXiv stat.ML • 11:12
- When Large Language Models Do Not Work: Online Incivility Prediction through Graph Neural Networks \ Online incivility has emerged as a widespread and persistent problem in digital communities, imposing substantial social and psychological burdens on users. Although many platforms attempt to curb incivility through moderation and automate… \ Source • arXiv cs.CL • 17:22
- Metric-Fair Prompting: Treating Similar Samples Similarly \ We introduce \emph{Metric-Fair Prompting}, a fairness-aware prompting framework that guides large language models (LLMs) to make decisions under metric-fairness constraints. In the application of multiple-choice medical question answering,… \ Source • arXiv cs.CL • 15:56
- Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs \ Rotary Position Embeddings (RoPE) have become a standard for encoding sequence order in Large Language Models (LLMs) by applying rotations to query and key vectors in the complex plane. Standard implementations, however, utilize only the r… \ Source • arXiv cs.CL • 13:59
- SPAD: Seven-Source Token Probability Attribution with Syntactic Aggregation for Detecting Hallucinations in RAG \ Detecting hallucinations in Retrieval-Augmented Generation (RAG) remains a challenge. Prior approaches attribute hallucinations to a binary conflict between internal knowledge (stored in FFNs) and retrieved context. However, this perspecti… \ Source • arXiv cs.CL • 13:50
Big Tech
No items today.
Regulation & Standards
No items today.
Enterprise Practice
No items today.
Open-Source Tooling
No items today.
— Personal views, not IBM. No tracking. Curated automatically; links under 24h old.