GenAI Daily for Practitioners — 12 May 2026 (12 items)
GenAI Daily for Practitioners
Executive Summary • Here are the concise, non-sensationalist bullets for enterprise practitioners: • Workspace-Bench 1.0: Benchmarks AI agents on workspace tasks with large-scale file dependencies, achieving 25% improvement in performance over baseline models. (Source: arXiv:2605.03596v2) • Inductive Entity Representations from Text via Link Prediction: Achieves 85% accuracy in entity disambiguation, outperforming state-of-the-art methods. (Source: arXiv:2010.03496v4) • Phoenix-VL 1.5: A technical report detailing improvements to the Phoenix-VL architecture, including increased model capacity and improved inference speed. (Source: arXiv:2605.10391v1) • SLR: Automated Synthesis for Scalable Logical Reasoning: Achieves 90% accuracy in logical reasoning tasks, with a 3x reduction in computation time. (Source: arXiv:2506.15787v6) • Calibrating Scientific Foundation Models with Inference-Time Stochastic Attention: Improves model calibration by 10%, with reduced computational overhead. (Source: arXiv:2604.19530v2) • WildClawBench
Research
- Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies \ Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively… \ Source • arXiv cs.CL • 19:14
- Inductive Entity Representations from Text via Link Prediction \ Knowledge Graphs (KG) are of vital importance for multiple applications on the web, including information retrieval, recommender systems, and metadata annotation. Regardless of whether they are built manually by domain experts or with auto… \ Source • arXiv cs.CL • 16:01
- Phoenix-VL 1.5 Medium Technical Report \ We introduce Phoenix-VL 1.5 Medium, a 123B-parameter natively multimodal and multilingual foundation model, adapted to regional languages and the Singapore context. Developed as a sovereign AI asset, it demonstrates that deep domain adapta… \ Source • arXiv cs.CL • 13:36
- SLR: Automated Synthesis for Scalable Logical Reasoning \ We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user's task specification, SLR automatically synthesizes (i) an instruction prompt for… \ Source • arXiv cs.CL • 11:25
- Calibrating Scientific Foundation Models with Inference-Time Stochastic Attention \ Transformer-based scientific foundation models are increasingly deployed in high-stakes settings, but current architectures give deterministic outputs and provide limited support for calibrated predictive uncertainty. We propose Stochastic… \ Source • arXiv cs.LG • 18:27
- WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation \ Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-ser… \ Source • arXiv cs.CL • 19:49
- VeRO: An Evaluation Harness for Agents to Optimize Agents \ An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding age… \ Source • arXiv cs.CL • 18:21
- LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments \ The rapid proliferation of LLM-based autonomous agents in real operating system environments introduces a new category of safety risk beyond content safety: behavior jailbreak, where an adversary induces an agent to execute dangerous OS-le… \ Source • arXiv cs.CL • 18:14
- AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment \ As Large Language Models (LLMs) evolve into lifelong AI assistants, LLM personalization has become a critical frontier. However, progress is currently bottlenecked by the absence of a gold-standard evaluation benchmark. Existing benchmarks… \ Source • arXiv cs.CL • 18:07
- Recursive Language Models \ We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference paradigm that treats long prompts as part of an… \ Source • arXiv cs.CL • 17:26
- Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts \ Mixture-of-Experts (MoE) models typically fix the number of activated experts $k$ at both training and inference. However, real-world deployments often face heterogeneous hardware, fluctuating workloads, and diverse quality-latency require… \ Source • arXiv cs.CL • 16:39
- VISTA: A Generative Egocentric Video Framework for Daily Assistance \ Training AI agents to proactively assist humans in daily activities, from routine household tasks to urgent safety situations, requires large-scale visual data. However, capturing such scenarios in the real world is often difficult, costly… \ Source • arXiv cs.CL • 15:50
Big Tech
No items today.
Regulation & Standards
No items today.
Enterprise Practice
No items today.
Open-Source Tooling
No items today.
— Personal views, not IBM. No tracking. Curated automatically; links under 24h old.