GenAI Daily for Practitioners — 23 Jan 2026 (12 items)
GenAI Daily for Practitioners
Executive Summary • Here are the concise bullet points: • Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing • + Proposed a fine-grained benchmark for evaluating composed image retrieval models • + Achieved a 10.3% improvement in retrieval accuracy using the proposed benchmark • + Open-sourced the benchmark and evaluation code • Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics • + Introduced a method for fine-grained control over LLM refusal behavior for sensitive topics
Research
- Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing \ Composed Image Retrieval (CIR) is a pivotal and complex task in multimodal understanding. Current CIR benchmarks typically feature limited query categories and fail to capture the diverse requirements of real-world scenarios. To bridge thi… \ Source • arXiv cs.CL • 18:26
- Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics \ We introduce Refusal Steering, an inference-time method to exercise fine-grained control over Large Language Models refusal behaviour on politically sensitive topics without retraining. We replace fragile pattern-based refusal detection wi… \ Source • arXiv cs.CL • 14:49
- SciArena: An Open Evaluation Platform for Non-Verifiable Scientific Literature-Grounded Tasks \ We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature-grounded tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the res… \ Source • arXiv cs.CL • 19:32
- Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain \ This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain through domain adaptation strategies. We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT… \ Source • arXiv cs.CL • 15:41
- Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning \ Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. … \ Source • arXiv cs.CL • 13:09
- Evaluating and Achieving Controllable Code Completion in Code LLM \ Code completion has become a central task, gaining significant attention with the rise of large language model (LLM)-based tools in software engineering. Although recent advances have greatly improved LLMs' code completion abilities, evalu… \ Source • arXiv cs.CL • 12:40
- GENERator: A Long-Context Generative Genomic Foundation Model \ The rapid advancement of DNA sequencing has produced vast genomic datasets, yet interpreting and engineering genomic function remain fundamental challenges. Recent large language models have opened new avenues for genomic analysis, but exi… \ Source • arXiv cs.CL • 12:18
- From Interpretability to Performance: Optimizing Retrieval Heads for Long-Context Language Models \ Advances in mechanistic interpretability have identified special attention heads, known as retrieval heads, that are responsible for retrieving information from the context. However, the role of these retrieval heads in improving model per… \ Source • arXiv cs.CL • 12:02
- ExDR: Explanation-driven Dynamic Retrieval Enhancement for Multimodal Fake News Detection \ The rapid spread of multimodal fake news poses a serious societal threat, as its evolving nature and reliance on timely factual details challenge existing detection methods. Dynamic Retrieval-Augmented Generation provides a promising solut… \ Source • arXiv cs.CL • 11:10
- MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators \ Evaluating the quality of open-domain chatbots has become increasingly reliant on LLMs acting as automatic judges. However, existing meta-evaluation benchmarks are static, outdated, and lacking in multilingual coverage, limiting their abil… \ Source • arXiv cs.CL • 10:51
- SteerEval: Inference-time Interventions Strengthen Multilingual Generalization in Neural Summarization Metrics \ An increasing body of work has leveraged multilingual language models for Natural Language Generation tasks such as summarization. A major empirical bottleneck in this area is the shortage of accurate and robust evaluation metrics for many… \ Source • arXiv cs.CL • 10:49
- WavLink: Compact Audio-Text Embeddings with a Global Whisper Token \ Whisper has become the de-facto encoder for extracting general-purpose audio features in large audio-language models, where a 30-second clip is typically represented by 1500 frame features projected into an LLM. In contrast, audio-text emb… \ Source • arXiv cs.CL • 09:55
Big Tech
No items today.
Regulation & Standards
No items today.
Enterprise Practice
No items today.
Open-Source Tooling
No items today.
— Personal views, not IBM. No tracking. Curated automatically; links under 24h old.
Don't miss what's next. Subscribe to Richard G: