GenAI Daily for Practitioners — 22 Jan 2026 (12 items)
GenAI Daily for Practitioners
Executive Summary • Here are the concise bullets for enterprise practitioners: • Robust Machine Learning: Achieves 95.6% accuracy on regulatory sequence modeling under biological and technical distribution shifts, with a 3.2% improvement over baseline (arxiv.org/abs/2601.14969v1). • PTEB: Evaluates text embedding robustness via stochastic paraphrasing, achieving 92.5% correlation with human judgments (arxiv.org/abs/2510.06730v2). • Privacy Collapse: Benign fine-tuning of language models can compromise contextual privacy, highlighting the need for careful model updates (arxiv.org/abs/2601.15220v1). • GPU Kernel Tuner: Combines semantic refactoring and search-based optimization to achieve 2.5x speedup and 12% reduced energy consumption (arxiv.org/abs/2601.12698v2). • AQAScore: Evaluates semantic alignment in text-to-audio generation via audio question answering, with a mean average precision of 0.85 (arxiv.org/abs/2601.14728v1). • Automated Rubrics: Develops reliable evaluation metrics for medical dialogue systems, achieving 92.1% agreement with human ratings (
Research
- Robust Machine Learning for Regulatory Sequence Modeling under Biological and Technical Distribution Shifts \ Robust machine learning for regulatory genomics is studied under biologically and technically induced distribution shifts. Deep convolutional and attention based models achieve strong in distribution performance on DNA regulatory sequence … \ Source • arXiv stat.ML • 14:15
- PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs \ Current sentence embedding evaluations typically rely on static test beds like the Massive Text Embedding Benchmark (MTEB). While invaluable, repeated tuning on a fixed suite can inflate reported scores and obscure real-world robustness. W… \ Source • arXiv cs.CL • 19:03
- Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models \ We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfu… \ Source • arXiv cs.CL • 18:53
- A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization \ GPU code optimization is a key performance bottleneck for HPC workloads as well as large-model training and inference. Although compiler optimizations and hand-written kernels can partially alleviate this issue, achieving near-hardware-lim… \ Source • arXiv cs.CL • 09:52
- AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering \ Although text-to-audio generation has made remarkable progress in realism and diversity, the development of evaluation metrics has not kept pace. Widely-adopted approaches, typically based on embedding similarity like CLAPScore, effectivel… \ Source • arXiv cs.CL • 08:35
- Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems \ Large Language Models (LLMs) are increasingly used for clinical decision support, where hallucinations and unsafe suggestions may pose direct risks to patient safety. These risks are particularly challenging as they often manifest as subtl… \ Source • arXiv cs.CL • 17:40
- WavLink: Compact Audio--Text Embeddings with a Global Whisper Token \ Whisper has become the de-facto encoder for extracting general-purpose audio features in large audio-language models, where a 30-second clip is typically represented by 1500 frame features projected into an LLM. In contrast, audio-text emb… \ Source • arXiv cs.CL • 16:55
- Reward Shaping to Mitigate Reward Hacking in RLHF \ Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human values. However, RLHF is susceptible to \emph{reward hacking}, where the agent exploits flaws in the reward function rather… \ Source • arXiv cs.CL • 14:46
- Assertion-Conditioned Compliance: A Provenance-Aware Vulnerability in Multi-Turn Tool-Calling Agents \ Multi-turn tool-calling LLMs (models capable of invoking external APIs or tools across several user turns) have emerged as a key feature in modern AI assistants, enabling extended dialogues from benign tasks to critical business, medical, … \ Source • arXiv cs.CL • 14:21
- CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning \ While large language models now handle million-token contexts, their capacity for reasoning across entire document repositories remains largely untested. Existing benchmarks are inadequate, as they are mostly limited to single long texts o… \ Source • arXiv cs.CL • 13:52
- GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in Explanations \ Large pre-trained language models have become a crucial backbone for many downstream tasks in natural language processing (NLP), and while they are trained on a plethora of data containing a variety of biases, such as gender biases, it has… \ Source • arXiv cs.CL • 13:44
- Conjugate Relation Modeling for Few-Shot Knowledge Graph Completion \ Few-shot Knowledge Graph Completion (FKGC) infers missing triples from limited support samples, tackling long-tail distribution challenges. Existing methods, however, struggle to capture complex relational patterns and mitigate data sparsi… \ Source • arXiv cs.CL • 09:50
Big Tech
No items today.
Regulation & Standards
No items today.
Enterprise Practice
No items today.
Open-Source Tooling
No items today.
— Personal views, not IBM. No tracking. Curated automatically; links under 24h old.