GenAI Daily for Practitioners — 21 Oct 2025 (12 items)
GenAI Daily for Practitioners
Executive Summary • Here are the summaries in concise, non-sensationalist bullets: • HUME: Measuring the Human-Model Performance Gap in Text Embedding Tasks • + Finds a significant gap between human and model performance in text embedding tasks • + Suggests that human evaluation is necessary for robust model assessment • + No concrete deployment or cost information provided • SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors • + Introduces a benchmark for evaluating LLMs' ability to simulate human behaviors
Research
- HUME: Measuring the Human-Model Performance Gap in Text Embedding Tasks \ Comparing human and model performance offers a valuable perspective forunderstanding the strengths and limitations of embedding models, highlightingwhere they succeed and where they fail to capture meaning and nuance. However,such comparis… \ Source • arXiv cs.CL • 18:44
- SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors \ Large language model (LLM) simulations of human behavior have the potentialto revolutionize the social and behavioral sciences, if and only if theyfaithfully reflect real human behaviors. Current evaluations are fragmented,based on bespoke… \ Source • arXiv cs.CL • 15:14
- Addressing Antisocial Behavior in Multi-Party Dialogs Through Multimodal Representation Learning \ Antisocial behavior (ASB) on social media -- including hate speech,harassment, and cyberbullying -- poses growing risks to platform safety andsocietal well-being. Prior research has focused largely on networks such as Xand Reddit, while \t… \ Source • arXiv cs.CL • 10:27
- Neural Bayes estimation and selection for complex bivariate extremal dependence models \ Likelihood-free approaches are appealing for performing inference on complexdependence models, either because it is not possible to formulate a likelihoodfunction, or its evaluation is very computationally costly. This is the casefor sever… \ Source • arXiv stat.ML • 17:01
- Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains \ Finetuning specialized generative evaluators has emerged as a popularparadigm to meet the increasing demand for scalable evaluation during bothtraining and test-time. However, recent work has largely focused on applyingnew methodology, suc… \ Source • arXiv cs.CL • 19:52
- UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action \ Multimodal agents for computer use rely exclusively on primitive actions(click, type, scroll) that require accurate visual grounding and lengthyexecution chains, leading to cascading failures and performance bottlenecks.While other agents … \ Source • arXiv cs.CL • 19:48
- MIRAGE: Agentic Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning \ Misinformation spreads across web platforms through billions of dailymultimodal posts that combine text and images, overwhelming manualfact-checking capacity. Supervised detection models require domain-specifictraining data and fail to gen… \ Source • arXiv cs.CL • 16:40
- Annotation-Efficient Universal Honesty Alignment \ Honesty alignment-the ability of large language models (LLMs) to recognizetheir knowledge boundaries and express calibrated confidence-is essential fortrustworthy deployment. Existing methods either rely on training-freeconfidence estimati… \ Source • arXiv cs.CL • 15:05
- Lingua Custodi's participation at the WMT 2025 Terminology shared task \ While BERT is an effective method for learning monolingual sentenceembeddings for semantic similarity and embedding based transfer learning BERTbased cross-lingual sentence embeddings have yet to be explored. Wesystematically investigate m… \ Source • arXiv cs.CL • 15:00
- Empowering Real-World: A Survey on the Technology, Practice, and Evaluation of LLM-driven Industry Agents \ With the rise of large language models (LLMs), LLM agents capable ofautonomous reasoning, planning, and executing complex tasks have become afrontier in artificial intelligence. However, how to translate the research ongeneral agents into … \ Source • arXiv cs.CL • 14:46
- CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation \ Large language models (LLMs) have demonstrated strong capabilities in codegeneration, underscoring the critical need for rigorous and comprehensiveevaluation. Existing evaluation approaches fall into three categories,including human-center… \ Source • arXiv cs.CL • 14:00
- Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing \ Document parsing from scanned images into structured formats remains asignificant challenge due to its complexly intertwined elements such as textparagraphs, figures, formulas, and tables. Existing supervised fine-tuningmethods often strug… \ Source • arXiv cs.CL • 13:03
Big Tech
No items today.
Regulation & Standards
No items today.
Enterprise Practice
No items today.
Open-Source Tooling
No items today.
— Personal views, not IBM. No tracking. Curated automatically; links under 24h old.
Don't miss what's next. Subscribe to Richard G: