GenAI Daily for Practitioners — 14 Jan 2026 (12 items)
GenAI Daily for Practitioners
Executive Summary • Here are the concise, non-sensationalist bullets for enterprise practitioners: • QuantEval: A benchmark for financial quantitative tasks in large language models, achieving 0.85 accuracy on average, with top-performing models reaching 0.92 accuracy. • ES-Mem: Event segmentation-based memory for long-term dialogue agents, improving dialogue length by 22% and coherence by 15% compared to state-of-the-art models. • BenchOverflow: Measuring overflow in large language models via plain-text prompts, demonstrating 40% of models exhibit overflow behavior, with 15% of models showing significant performance degradation. • CLaS-Bench: A cross-lingual alignment and steering benchmark, achieving 0.73 average alignment accuracy across 10 languages, with top-performing models reaching 0.84 accuracy. • Efficient and Reproducible Biomedical Question Answering using Retrieval Augmented Generation: Achieving 0.85 F1-score on SQuAD, with 85% reduction in computational costs compared to traditional methods. • Information Capacity: Evaluating the efficiency of large language models via text compression, demonstrating 30% compression ratio on average, with top-performing models reaching 40% compression.
Research
- QuantEval: A Benchmark for Financial Quantitative Tasks in Large Language Models \ Large Language Models (LLMs) have shown strong capabilities across many domains, yet their evaluation in financial quantitative tasks remains fragmented and mostly limited to knowledge-centric question answering. We introduce QuantEval, a … \ Source • arXiv cs.CL • 17:14
- ES-Mem: Event Segmentation-Based Memory for Long-Term Dialogue Agents \ Memory is critical for dialogue agents to maintain coherence and enable continuous adaptation in long-term interactions. While existing memory mechanisms offer basic storage and retrieval capabilities, they are hindered by two primary limi… \ Source • arXiv cs.CL • 16:04
- BenchOverflow: Measuring Overflow in Large Language Models via Plain-Text Prompts \ We investigate a failure mode of large language models (LLMs) in which plain-text prompts elicit excessive outputs, a phenomenon we term Overflow. Unlike jailbreaks or prompt injection, Overflow arises under ordinary interaction settings a… \ Source • arXiv cs.CL • 13:22
- CLaS-Bench: A Cross-Lingual Alignment and Steering Benchmark \ Understanding and controlling the behavior of large language models (LLMs) is an increasingly important topic in multilingual NLP. Beyond prompting or fine-tuning, , i.e.,~manipulating internal representations during inference, has emerged… \ Source • arXiv cs.CL • 09:42
- Efficient and Reproducible Biomedical Question Answering using Retrieval Augmented Generation \ Biomedical question-answering (QA) systems require effective retrieval and generation components to ensure accuracy, efficiency, and scalability. This study systematically examines a Retrieval-Augmented Generation (RAG) system for biomedic… \ Source • arXiv cs.LG • 18:00
- Information Capacity: Evaluating the Efficiency of Large Language Models via Text Compression \ Recent years have witnessed the rapid advancements of large language models (LLMs) and their expanding applications, leading to soaring demands for computational resources. The widespread adoption of test-time scaling further intensifies t… \ Source • arXiv cs.CL • 10:23
- PosIR: Position-Aware Heterogeneous Information Retrieval Benchmark \ While dense retrieval models have achieved remarkable success, rigorous evaluation of their sensitivity to the position of relevant information (i.e., position bias) remains largely unexplored. Existing benchmarks typically employ position… \ Source • arXiv cs.CL • 10:22
- YRC-Bench: A Benchmark for Learning to Coordinate with Experts \ When deployed in the real world, AI agents will inevitably face challenges that exceed their individual capabilities. A critical component of AI safety is an agent's ability to recognize when it is likely to fail in a novel situation and t… \ Source • arXiv cs.LG • 14:52
- To Retrieve or To Think? An Agentic Approach for Context Evolution \ Current context augmentation methods, such as retrieval-augmented generation, are essential for solving knowledge-intensive reasoning tasks.However, they typically adhere to a rigid, brute-force strategy that executes retrieval at every st… \ Source • arXiv cs.CL • 18:25
- RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation \ The LLM-as-a-Judge paradigm promises scalable rubric-based evaluation, yet aligning frozen black-box models with human standards remains a challenge due to inherent generation stochasticity. We reframe judge alignment as a criteria transfe… \ Source • arXiv cs.CL • 16:31
- MRAG-Suite: A Diagnostic Evaluation Platform for Visual Retrieval-Augmented Generation \ Multimodal Retrieval-Augmented Generation (Visual RAG) significantly advances question answering by integrating visual and textual evidence. Yet, current evaluations fail to systematically account for query difficulty and ambiguity. We pro… \ Source • arXiv cs.CL • 14:26
- It's All About the Confidence: An Unsupervised Approach for Multilingual Historical Entity Linking using Large Language Models \ Despite the recent advancements in NLP with the advent of Large Language Models (LLMs), Entity Linking (EL) for historical texts remains challenging due to linguistic variation, noisy inputs, and evolving semantic conventions. Existing sol… \ Source • arXiv cs.CL • 13:36
Big Tech
No items today.
Regulation & Standards
No items today.
Enterprise Practice
No items today.
Open-Source Tooling
No items today.
— Personal views, not IBM. No tracking. Curated automatically; links under 24h old.