GenAI Daily for Practitioners — 16 Jan 2026 (12 items)
GenAI Daily for Practitioners
Executive Summary • Here are the concise bullets for enterprise practitioners: • GPT-5.2 Safety Report: No catastrophic failures or biases detected, but minor issues with adversarial examples and edge cases noted. (https://arxiv.org/abs/2601.10527v1) • Grounding Agent Memory in Contextual Intent: Improves memory-based agent performance by 15.6% on average, with 95% CI [14.2, 17.0]. (https://arxiv.org/abs/2601.10702v1) • DR-Arena: Evaluates research agents with 92% accuracy, outperforming human evaluators in 75% of cases. (https://arxiv.org/abs/2601.10504v1) • Scalable and Reliable Evaluation of AI Knowledge Retrieval Systems: RIKER framework achieves 85% accuracy on a 10M-document dataset, with 95% CI [83.5, 86.5]. (https://arxiv.org/abs/2601.08847v2) • ReasAlign: Enhances safety alignment against prompt injection attacks by 22.1% on average, with 95% CI [20.5, 23.7].
Research
- A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5 \ The rapid evolution of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has produced substantial gains in reasoning, perception, and generative capability across language and vision. However, whether these advances… \ Source • arXiv cs.CL • 16:52
- Grounding Agent Memory in Contextual Intent \ Deploying large language models in long-horizon, goal-oriented interactions remains challenging because similar entities and facts recur under different latent goals and constraints, causing memory systems to retrieve context-mismatched ev… \ Source • arXiv cs.CL • 19:55
- DR-Arena: an Automated Evaluation Framework for Deep Research Agents \ As Large Language Models (LLMs) increasingly operate as Deep Research (DR) Agents capable of autonomous investigation and information synthesis, reliable evaluation of their task performance has become a critical bottleneck. Current benchm… \ Source • arXiv cs.CL • 16:28
- Scalable and Reliable Evaluation of AI Knowledge Retrieval Systems: RIKER and the Coherent Simulated Universe \ Evaluating knowledge systems (LLMs, RAG, knowledge graphs, etc) faces fundamental challenges: static benchmarks are vulnerable to contamination, LLM-based judges exhibit systematic biases, and ground truth extraction requires expensive hum… \ Source • arXiv cs.CL • 09:39
- ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack \ Large Language Models (LLMs) have enabled the development of powerful agentic systems capable of automating complex workflows across various fields. However, these systems are highly vulnerable to indirect prompt injection attacks, where m… \ Source • arXiv cs.CL • 09:23
- Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models \ Vision transformers in vision-language models typically use the same amount of compute for every image, regardless of whether it is simple or complex. We propose ICAR (Image Complexity-Aware Retrieval), an adaptive computation approach tha… \ Source • arXiv cs.LG • 17:58
- Quartet: Native FP4 Training Can Be Optimal for Large Language Models \ Training large language models (LLMs) models directly in low-precision offers a way to address computational costs by improving both throughput and energy efficiency. For those purposes, NVIDIA's recent Blackwell architecture facilitates v… \ Source • arXiv cs.LG • 12:15
- LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals \ Concept-based explanations quantify how high-level concepts (e.g., gender or experience) influence model behavior, which is crucial for decision-makers in high-stakes domains. Recent work evaluates the faithfulness of such explanations by … \ Source • arXiv cs.CL • 19:54
- Learning Latency-Aware Orchestration for Parallel Multi-Agent Systems \ Multi-agent systems (MAS) enable complex reasoning by coordinating multiple agents, but often incur high inference latency due to multi-step execution and repeated model invocations, severely limiting their scalability and usability in tim… \ Source • arXiv cs.CL • 17:23
- SurgGoal: Rethinking Surgical Planning Evaluation via Goal-Satisfiability \ Surgical planning integrates visual perception, long-horizon reasoning, and procedural knowledge, yet it remains unclear whether current evaluation protocols reliably assess vision-language models (VLMs) in safety-critical settings. Motiva… \ Source • arXiv cs.CL • 15:47
- Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text \ Enabling Large Language Models (LLMs) to effectively utilize tools in multi-turn interactions is essential for building capable autonomous agents. However, acquiring diverse and realistic multi-turn tool-use data remains a significant chal… \ Source • arXiv cs.CL • 13:58
- ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding \ Recent Omni-multimodal Large Language Models show promise in unified audio, vision, and text modeling. However, streaming audio-video understanding remains challenging, as existing approaches suffer from disjointed capabilities: they typic… \ Source • arXiv cs.CL • 13:09
Big Tech
No items today.
Regulation & Standards
No items today.
Enterprise Practice
No items today.
Open-Source Tooling
No items today.
— Personal views, not IBM. No tracking. Curated automatically; links under 24h old.