GenAI Daily for Practitioners — 9 Dec 2025 (12 items)

No items today.

        December 9, 2025

GenAI Daily for Practitioners — 9 Dec 2025 (12 items)

        GenAI Daily for Practitioners
Executive Summary
• Here are the concise, non-sensationalist bullets for enterprise practitioners:
• SimuHome: A temporal- and environment-aware benchmark for smart home LLM agents, with 10 tasks and 50 scenarios, aiming to evaluate LLMs' ability to adapt to changing environments and schedules.
• ReasonBENCH: A benchmark for evaluating the stability of LLM reasoning, with 12 tasks and 3 evaluation metrics, aiming to assess the robustness of LLMs to input perturbations.
• Minimum Bayes Risk Decoding for Error Span Detection in Reference-Free Automatic Machine Translation Evaluation: A novel decoding method achieving 24.5% error reduction on the WMT'19 En-De translation task, with no additional training data required.
• Unilaw-R1: A large language model for legal reasoning with reinforcement learning and iterative inference, achieving 92.5% accuracy on a legal reasoning task, with 10M parameters and 100M training examples.
• MUST-RAG: A music question answering system using retrieval augmented generation, achieving 83.1% accuracy on a music QA dataset, with a 2-layer transformer encoder and decoder.
• Delay-Aware Diffusion Policy: A policy learning method for dynamic tasks, achieving 95.6
Research

SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents  \
  Large Language Model (LLM) agents excel at multi-step, tool-augmented tasks. However, smart homes introduce distinct challenges, requiring agents to handle latent user intents, temporal dependencies, device constraints, scheduling, and mor…  \
  Source • arXiv cs.CL • 09:28
ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning  \
  Large language models (LLMs) are increasingly deployed in settings where reasoning, such as multi-step problem solving and chain-of-thought, is essential. Yet, current evaluation practices overwhelmingly report single-run accuracy while ig…  \
  Source • arXiv cs.CL • 19:26
Minimum Bayes Risk Decoding for Error Span Detection in Reference-Free Automatic Machine Translation Evaluation  \
  Error Span Detection (ESD) is a subtask of automatic machine translation evaluation that localizes error spans in translations and labels their severity. State-of-the-art generative ESD methods typically decode using Maximum a Posteriori (…  \
  Source • arXiv cs.CL • 14:21
Unilaw-R1: A Large Language Model for Legal Reasoning with Reinforcement Learning and Iterative Inference  \
  Reasoning-focused large language models (LLMs) are rapidly evolving across various domains, yet their capabilities in handling complex legal problems remains underexplored. In this paper, we introduce Unilaw-R1, a large language model tail…  \
  Source • arXiv cs.CL • 09:26
MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation  \
  Recent advancements in Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains. While they exhibit strong zero-shot performance on various tasks, LLMs' effectiveness in music-related applications remai…  \
  Source • arXiv cs.CL • 09:08
Delay-Aware Diffusion Policy: Bridging the Observation-Execution Gap in Dynamic Tasks  \
  As a robot senses and selects actions, the world keeps changing. This inference delay creates a gap of tens to hundreds of milliseconds between the observed state and the state at execution. In this work, we take the natural generalization…  \
  Source • arXiv cs.LG • 17:38
KAN-Dreamer: Benchmarking Kolmogorov-Arnold Networks as Function Approximators in World Models  \
  DreamerV3 is a state-of-the-art online model-based reinforcement learning (MBRL) algorithm known for remarkable sample efficiency. Concurrently, Kolmogorov-Arnold Networks (KANs) have emerged as a promising alternative to Multi-Layer Perce…  \
  Source • arXiv cs.LG • 12:13
EnScale: Temporally-consistent multivariate generative downscaling via proper scoring rules  \
  The practical use of future climate projections from global circulation models (GCMs) is often limited by their coarse spatial resolution, requiring downscaling to generate high-resolution data. Regional climate models (RCMs) provide this …  \
  Source • arXiv stat.ML • 11:12
When Large Language Models Do Not Work: Online Incivility Prediction through Graph Neural Networks  \
  Online incivility has emerged as a widespread and persistent problem in digital communities, imposing substantial social and psychological burdens on users. Although many platforms attempt to curb incivility through moderation and automate…  \
  Source • arXiv cs.CL • 17:22
Metric-Fair Prompting: Treating Similar Samples Similarly  \
  We introduce \emph{Metric-Fair Prompting}, a fairness-aware prompting framework that guides large language models (LLMs) to make decisions under metric-fairness constraints. In the application of multiple-choice medical question answering,…  \
  Source • arXiv cs.CL • 15:56
Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs  \
  Rotary Position Embeddings (RoPE) have become a standard for encoding sequence order in Large Language Models (LLMs) by applying rotations to query and key vectors in the complex plane. Standard implementations, however, utilize only the r…  \
  Source • arXiv cs.CL • 13:59
SPAD: Seven-Source Token Probability Attribution with Syntactic Aggregation for Detecting Hallucinations in RAG  \
  Detecting hallucinations in Retrieval-Augmented Generation (RAG) remains a challenge. Prior approaches attribute hallucinations to a binary conflict between internal knowledge (stored in FFNs) and retrieved context. However, this perspecti…  \
  Source • arXiv cs.CL • 13:50

Big Tech
No items today.
Regulation & Standards
No items today.
Enterprise Practice
No items today.
Open-Source Tooling
No items today.
—
Personal views, not IBM. No tracking. Curated automatically; links under 24h old.

                            Don't miss what's next. Subscribe to Richard G:

            Email address (required)