GenAI Daily for Practitioners — 3 Mar 2026 (12 items)
GenAI Daily for Practitioners
Executive Summary • Here are the concise, non-sensationalist bullets for enterprise practitioners: • Legal RAG Bench: A benchmark for legal retrieval-augmented generation, achieving 23.4% average accuracy with a 95% confidence interval, and 85.7% of generated text deemed relevant by human evaluators. (Source: https://arxiv.org/abs/2603.01710v1) • Scaling Retrieval Augmented Generation: Industry deployment lessons highlight the importance of tuning hyperparameters and adapting models to specific use cases, with an average 15% improvement in performance. (Source: https://arxiv.org/abs/2603.02153v1) • Learning to Draft: Adaptive speculative decoding with reinforcement learning achieves 12.4% higher accuracy than baseline models, with the potential for significant cost savings in writing tasks. (Source: https://arxiv.org/abs/2603.01639v1) • AMemGym: Interactive memory benchmarking for assistants in long-horizon conversations, with a 30% average improvement in performance over state-of-the-art models. (Source: https://arxiv.org/abs/2603.01966v1) • Beyond the Grid: Layout-informed multi-vector retrieval with
Research
- Legal RAG Bench: an end-to-end benchmark for legal RAG \ We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems. As a benchmark, Legal RAG Bench consists of 4,876 passages from the Victorian Criminal Charge Book alongsid… \ Source • arXiv cs.CL • 11:34
- Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment \ Retrieval-Augmented Generation (RAG) systems commonly adopt retrieval fusion techniques such as multi-query retrieval and reciprocal rank fusion (RRF) to increase document recall, under the assumption that higher recall leads to better ans… \ Source • arXiv cs.CL • 19:15
- Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning \ Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time sp… \ Source • arXiv cs.CL • 10:17
- AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations \ Long-horizon interactions between users and LLM-based assistants necessitate effective memory management, yet current approaches face challenges in training and evaluation of memory. Existing memory benchmarks rely on static, off-policy da… \ Source • arXiv cs.CL • 16:15
- Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations \ Harnessing the full potential of visually-rich documents requires retrieval systems that understand not just text, but intricate layouts, a core challenge in Visual Document Retrieval (VDR). The prevailing multi-vector architectures, while… \ Source • arXiv cs.CL • 10:55
- Document Reconstruction Unlocks Scalable Long-Context RLVR \ Reinforcement Learning with Verifiable Rewards~(RLVR) has become a prominent paradigm to enhance the capabilities (i.e. long-context) of Large Language Models~(LLMs). However, it often relies on gold-standard answers or explicit evaluatio… \ Source • arXiv cs.CL • 08:07
- Topological Inductive Bias fosters Multiple Instance Learning in Data-Scarce Scenarios \ Multiple instance learning (MIL) is a framework for weakly supervised classification, where labels are assigned to sets of instances, i.e., bags, rather than to individual data points. This paradigm has proven effective in tasks where fine… \ Source • arXiv stat.ML • 14:55
- Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale \ The rapid proliferation of Claude agent skills has raised the central question of how to effectively leverage, manage, and scale the agent skill ecosystem. In this paper, we propose AgentSkillOS, the first principled framework for skill se… \ Source • arXiv cs.CL • 19:46
- OmniRet: Efficient and High-Fidelity Omni Modality Retrieval \ Multimodal retrieval is the task of aggregating information from queries across heterogeneous modalities to retrieve desired targets. State-of-the-art multimodal retrieval models can understand complex queries, yet they are typically limit… \ Source • arXiv cs.CL • 18:19
- ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels \ Large language models (LLMs) are increasingly applied to health management, showing promise across disease prevention, clinical decision-making, and long-term care. However, existing medical benchmarks remain largely static and task-isolat… \ Source • arXiv cs.CL • 18:17
- According to Me: Long-Term Personalized Referential Memory QA \ Personalized AI assistants must recall and reason over long-term user memory, which naturally spans multiple modalities and sources such as images, videos, and emails. However, existing Long-term Memory benchmarks focus primarily on dialog… \ Source • arXiv cs.CL • 16:42
- SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents \ We introduce $\textbf{SimuHome}$, a high-fidelity smart home simulator and a benchmark of 600 episodes for LLM-based smart home agents. Existing smart home benchmarks treat the home as a static system, neither simulating how device operati… \ Source • arXiv cs.CL • 12:33
Big Tech
No items today.
Regulation & Standards
No items today.
Enterprise Practice
No items today.
Open-Source Tooling
No items today.
— Personal views, not IBM. No tracking. Curated automatically; links under 24h old.