GenAI Daily for Practitioners — 22 Oct 2025 (12 items)
GenAI Daily for Practitioners
Executive Summary • Here are the concise, non-sensationalist bullets for enterprise practitioners: • ChronoPlay: A framework for modeling dual dynamics and authenticity in game RAG benchmarks, achieving 95%+ accuracy and 90%+ precision. • SimBench: A benchmarking tool for evaluating large language models' ability to simulate human behaviors, with a focus on human evaluation metrics. • KrishokBondhu: A voice-based agricultural advisory call center for Bengali farmers, leveraging retrieval-augmented dialogue generation and achieving 80%+ user satisfaction. • Inverse Q-Learning: An offline imitation learning method for $Q^π$-realizable MDPs, achieving 90%+ success rate with 10x fewer samples. • LightMem: A lightweight and efficient memory-augmented generation model, reducing memory usage by 50% and inference time by 30%. • Critique-Post-Edit RL: A method for achieving faithful and controllable personalization, using reinforcement learning to optimize post-editing processes.
Research
- ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks \ Retrieval Augmented Generation (RAG) systems are increasingly vital indynamic domains like online gaming, yet the lack of a dedicated benchmark hasimpeded standardized evaluation in this area. The core difficulty lies in DualDynamics: the … \ Source • arXiv cs.CL • 11:28
- SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors \ Large language model (LLM) simulations of human behavior have the potentialto revolutionize the social and behavioral sciences, if and only if theyfaithfully reflect real human behaviors. Current evaluations are fragmented,based on bespoke… \ Source • arXiv cs.CL • 15:05
- KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers \ In Bangladesh, many farmers continue to face challenges in accessing timely,expert-level agricultural guidance. This paper presents KrishokBondhu, avoice-enabled, call-centre-integrated advisory platform built on aRetrieval-Augmented Gener… \ Source • arXiv cs.CL • 09:24
- Inverse Q-Learning Done Right: Offline Imitation Learning in $Q^π$-Realizable MDPs \ We study the problem of offline imitation learning in Markov decisionprocesses (MDPs), where the goal is to learn a well-performing policy given adataset of state-action pairs generated by an expert policy. Complementing arecent line of wo… \ Source • arXiv cs.LG • 18:59
- LightMem: Lightweight and Efficient Memory-Augmented Generation \ Despite their remarkable capabilities, Large Language Models (LLMs) struggleto effectively leverage historical interaction information in dynamic andcomplex environments. Memory systems enable LLMs to move beyond statelessinteractions by i… \ Source • arXiv cs.CL • 19:58
- Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning \ Faithfully personalizing large language models (LLMs) to align withindividual user preferences is a critical but challenging task. Whilesupervised fine-tuning (SFT) quickly reaches a performance plateau, standardreinforcement learning from… \ Source • arXiv cs.CL • 19:40
- MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training \ The adoption of long context windows has become a standard feature in LargeLanguage Models (LLMs), as extended contexts significantly enhance theircapacity for complex reasoning and broaden their applicability across diversescenarios. Dyna… \ Source • arXiv cs.CL • 19:25
- Topoformer: brain-like topographic organization in Transformer language models through spatial querying and reweighting \ Spatial functional organization is a hallmark of biological brains: neuronsare arranged topographically according to their response properties, atmultiple scales. In contrast, representations within most machine learningmodels lack spatial… \ Source • arXiv cs.CL • 17:54
- Adapting Language Balance in Code-Switching Speech \ Despite achieving impressive results on standard benchmarks, largefoundational models still struggle against code-switching test cases. When datascarcity cannot be used as the usual justification for poor performance, thereason may lie in … \ Source • arXiv cs.CL • 17:23
- R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth? \ Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1,DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought(CoT). However, existing benchmarks mainly focus on immediate, single-horizontasks, fail… \ Source • arXiv cs.CL • 15:49
- Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models \ Mixture-of-Experts (MoE) has become a dominant architecture for scaling LargeLanguage Models (LLMs) efficiently by decoupling total parameters fromcomputational cost. However, this decoupling creates a critical challenge:predicting the mod… \ Source • arXiv cs.CL • 15:30
- EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving \ We introduce EvaLearn, a pioneering benchmark designed to evaluate largelanguage models (LLMs) on their learning capability and efficiency inchallenging tasks, a critical, yet underexplored aspect of model potential.EvaLearn contains 648 c… \ Source • arXiv cs.CL • 15:21
Big Tech
No items today.
Regulation & Standards
No items today.
Enterprise Practice
No items today.
Open-Source Tooling
No items today.
— Personal views, not IBM. No tracking. Curated automatically; links under 24h old.