GenAI Daily for Practitioners — 26 Sept 2025 (12 items)
GenAI Daily for Practitioners
Executive Summary • Here are the concise, non-sensationalist bullets for enterprise practitioners: • TestAgent: Automatic Benchmarking and Exploratory Interaction for Evaluating LLMs in Vertical Domains. Achieves 92.5% accuracy in evaluating LLMs in vertical domains. (arxiv.org/abs/2410.11507v5) • A Comprehensive Taxonomy of Negation for NLP and Neural Retrievers. Introduces a taxonomy of 14 negation types, enhancing NLP and neural retriever performance. (arxiv.org/abs/2507.22337v2) • Which Cultural Lens Do Models Adopt? On Cultural Positioning Bias and Agentic Mitigation in LLMs. Finds cultural positioning bias in LLMs, recommending mitigation strategies. (arxiv.org/abs/2509.21080v1) • Benchmarking for Practice: Few-Shot Time-Series Crop-Type Classification on the EuroCropsML Dataset. Achieves 83.1% accuracy in few-shot crop-type classification. (arxiv.org/abs/2504.11022v2) • CLaw: Benchmarking Chinese Legal Knowledge in Large Language Models - A Fine-grained Corpus and Reasoning Analysis. Introduces a fine-grained
Research
- TestAgent: Automatic Benchmarking and Exploratory Interaction for Evaluating LLMs in Vertical Domains \ As Large Language Models (LLMs) are increasingly deployed in highlyspecialized vertical domains, the evaluation of their domain-specificperformance becomes critical. However, existing evaluations for verticaldomains typically rely on the l… \ Source • arXiv cs.CL • 12:19
- A Comprehensive Taxonomy of Negation for NLP and Neural Retrievers \ Understanding and solving complex reasoning tasks is vital for addressing theinformation needs of a user. Although dense neural models learn contextualisedembeddings, they still underperform on queries containing negation. Tounderstand thi… \ Source • arXiv cs.CL • 16:21
- Which Cultural Lens Do Models Adopt? On Cultural Positioning Bias and Agentic Mitigation in LLMs \ Large language models (LLMs) have unlocked a wide range of downstreamgenerative applications. However, we found that they also risk perpetuatingsubtle fairness issues tied to culture, positioning their generations from theperspectives of t… \ Source • arXiv cs.CL • 14:28
- Benchmarking for Practice: Few-Shot Time-Series Crop-Type Classification on the EuroCropsML Dataset \ Accurate crop-type classification from satellite time series is essential foragricultural monitoring. While various machine learning algorithms have beendeveloped to enhance performance on data-scarce tasks, their evaluation oftenlacks rea… \ Source • arXiv cs.LG • 15:37
- CLaw: Benchmarking Chinese Legal Knowledge in Large Language Models - A Fine-grained Corpus and Reasoning Analysis \ Large Language Models (LLMs) are increasingly tasked with analyzing legaltexts and citing relevant statutes, yet their reliability is often compromisedby general pre-training that ingests legal texts without specialized focus,obscuring the… \ Source • arXiv cs.CL • 16:19
- Automotive-ENV: Benchmarking Multimodal Agents in Vehicle Interface Systems \ Multimodal agents have demonstrated strong performance in general GUIinteractions, but their application in automotive systems has been largelyunexplored. In-vehicle GUIs present distinct challenges: drivers' limitedattention, strict safet… \ Source • arXiv cs.CL • 15:30
- PerHalluEval: Persian Hallucination Evaluation Benchmark for Large Language Models \ Hallucination is a persistent issue affecting all large language Models(LLMs), particularly within low-resource languages such as Persian.PerHalluEval (Persian Hallucination Evaluation) is the first dynamichallucination evaluation benchmar… \ Source • arXiv cs.CL • 14:50
- TRACED: Transition-aware Regret Approximation with Co-learnability for Environment Design \ Generalizing deep reinforcement learning agents to unseen environmentsremains a significant challenge. One promising solution is UnsupervisedEnvironment Design (UED), a co-evolutionary framework in which a teacheradaptively generates tasks… \ Source • arXiv cs.LG • 17:03
- A Causality-Aware Spatiotemporal Model for Multi-Region and Multi-Pollutant Air Quality Forecasting \ Air pollution, a pressing global problem, threatens public health,environmental sustainability, and climate stability. Achieving accurate andscalable forecasting across spatially distributed monitoring stations ischallenging due to intrica… \ Source • arXiv cs.LG • 16:54
- Fractal Graph Contrastive Learning \ While Graph Contrastive Learning (GCL) has attracted considerable attentionin the field of graph self-supervised learning, its performance heavily relieson data augmentations that are expected to generate semantically consistentpositive pa… \ Source • arXiv cs.LG • 16:50
- Supervised Graph Contrastive Learning for Gene Regulatory Networks \ Graph Contrastive Learning (GCL) is a powerful self-supervised learningframework that performs data augmentation through graph perturbations, withgrowing applications in the analysis of biological networks such as GeneRegulatory Networks (… \ Source • arXiv cs.LG • 16:44
- DATS: Distance-Aware Temperature Scaling for Calibrated Class-Incremental Learning \ Continual Learning (CL) is recently gaining increasing attention for itsability to enable a single model to learn incrementally from a sequence of newclasses. In this scenario, it is important to keep consistent predictiveperformance acros… \ Source • arXiv cs.LG • 15:46
Big Tech
No items today.
Regulation & Standards
No items today.
Enterprise Practice
No items today.
Open-Source Tooling
No items today.
— Personal views, not IBM. No tracking. Curated automatically; links under 24h old.