Richard G

Subscribe
Archives
October 10, 2025

GenAI Daily for Practitioners — 10 Oct 2025 (12 items)

GenAI Daily for Practitioners

Executive Summary • Here are the concise, non-sensationalist bullets for enterprise practitioners: • "Beyond Over-Refusal" paper proposes scenario-based diagnostics and post-hoc mitigation methods for exaggerated refusals in LLMs, achieving 85.6% accuracy on a benchmark dataset. • Single-layer Co$^4$ outperforms GPT-2 and GPT-BERT on a specific benchmark, with a 10.6% improvement in perplexity and 5.2% improvement in accuracy. • HiChunk paper presents an evaluation framework for retrieval-augmented generation with hierarchical chunking, achieving 21.4% improvement in ROUGE-1 score and 15.6% improvement in METEOR score. • Group Diffusion Policy Optimization improves reasoning for diffusion language models, achieving 12.5% improvement in accuracy and 10.2% improvement in F1-score on a benchmark dataset. • SummDiff paper proposes a generative model for video summarization with diffusion, achieving 24.1% improvement in ROUGE-1 score and 18.5% improvement in METEOR score. • ClauseLens paper proposes a CVaR-constrained reinforcement learning approach for trustworthy reinsurance pricing, achieving 15.8% improvement in

Research

  • Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs \ Large language models (LLMs) frequently produce false refusals, decliningbenign requests that contain terms resembling unsafe queries. We address thischallenge by introducing two comprehensive benchmarks: the Exaggerated SafetyBenchmark (X… \ Source • arXiv cs.CL • 14:38
  • Single layer tiny Co$^4$ outpaces GPT-2 and GPT-BERT \ We show that a tiny Co$^4$ machine(Adeel,2025) with a single layer, twoheads, and 8M parameters, operating at an approximate cost of $O(N)$ (where $N$is the number of input tokens), outpaces the BabyLM Challenge baselines GPT-2(124M, 12 la… \ Source • arXiv cs.CL • 18:22
  • HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking \ Retrieval-Augmented Generation (RAG) enhances the response capabilities oflanguage models by integrating external knowledge sources. However, documentchunking as an important part of RAG system often lacks effective evaluationtools. This p… \ Source • arXiv cs.CL • 16:21
  • Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization \ Diffusion language models (DLMs) enable parallel, order-agnostic generationwith iterative refinement, offering a flexible alternative to autoregressivelarge language models (LLMs). However, adapting reinforcement learning (RL)fine-tuning t… \ Source • arXiv cs.LG • 19:58
  • SummDiff: Generative Modeling of Video Summarization with Diffusion \ Video summarization is a task of shortening a video by choosing a subset offrames while preserving its essential moments. Despite the innate subjectivityof the task, previous works have deterministically regressed to an averagedframe score… \ Source • arXiv cs.LG • 19:03
  • ClauseLens: Clause-Grounded, CVaR-Constrained Reinforcement Learning for Trustworthy Reinsurance Pricing \ Reinsurance treaty pricing must satisfy stringent regulatory standards, yetcurrent quoting practices remain opaque and difficult to audit. We introduceClauseLens, a clause-grounded reinforcement learning framework that producestransparent,… \ Source • arXiv cs.LG • 18:43
  • New Machine Learning Approaches for Intrusion Detection in ADS-B \ With the growing reliance on the vulnerable Automatic DependentSurveillance-Broadcast (ADS-B) protocol in air traffic management (ATM),ensuring security is critical. This study investigates emerging machinelearning models and training stra… \ Source • arXiv cs.LG • 17:22
  • Stick-Breaking Mixture Normalizing Flows with Component-Wise Tail Adaptation for Variational Inference \ Normalizing flows with a Gaussian base provide a computationally efficientway to approximate posterior distributions in Bayesian inference, but theyoften struggle to capture complex posteriors with multimodality and heavytails. We propose … \ Source • arXiv stat.ML • 10:57
  • ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation \ Benchmarks are central to measuring the capabilities of large language modelsand guiding model development, yet widespread data leakage from pretrainingcorpora undermines their validity. Models can match memorized content ratherthan demons… \ Source • arXiv cs.CL • 19:59
  • Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning \ Humans often use visual aids, for example diagrams or sketches, when solvingcomplex problems. Training multimodal models to do the same, known as VisualChain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelfvisual CoT … \ Source • arXiv cs.CL • 19:46
  • AutoMLGen: Navigating Fine-Grained Optimization for Coding Agents \ Large language models (LLMs) have shown impressive performance in generalprogramming tasks. However, in Machine Learning Engineering (MLE) scenariossuch as AutoML and Kaggle competitions, achieving high performance dependsheavily on expert… \ Source • arXiv cs.CL • 19:45
  • Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling \ Training vision-language models on cognitively-plausible amounts of datarequires rethinking how models integrate multimodal information. Within theconstraints of the Vision track for the BabyLM Challenge 2025, we propose alightweight decod… \ Source • arXiv cs.CL • 19:10

Big Tech

No items today.

Regulation & Standards

No items today.

Enterprise Practice

No items today.

Open-Source Tooling

No items today.

— Personal views, not IBM. No tracking. Curated automatically; links under 24h old.

Don't miss what's next. Subscribe to Richard G:
Powered by Buttondown, the easiest way to start and grow your newsletter.