GenAI Daily for Practitioners — 23 Dec 2025 (12 items)
GenAI Daily for Practitioners
Executive Summary • Here are the concise, non-sensationalist bullets for enterprise practitioners: • LiveOIBench: Large language models can outperform human contestants in informatics olympiads, with a 44.4% accuracy gap, but may struggle with nuanced tasks (accuracy: 84.5% vs 78.1%). • MobileWorld: Autonomous mobile agents can achieve 83.2% task completion rate in agent-user interactive environments, and 76.5% in MCP-augmented environments, with an average latency of 1.45 seconds. • Auto-Prompting: Retrieval guidance can improve frame detection in logistics by 21.6%, with a 0.85% increase in precision and 0.43% increase in recall, at a cost of $0.12 per frame. • Small Language Models: Auto-parallelization with small language models can reduce compilation time by 34.2% and energy consumption by 29.1% on heterogeneous systems. • Self-Consistent Probability Flow: The proposed method can accurately solve high-dimensional Fokker-Planck equations with a mean absolute error of 0.012. • GenEnv: Co-evolution between LLM agents and environment simulators can improve task completion rate
Research
- LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads? \ Competitive programming problems increasingly serve as valuable benchmarks to evaluate the coding capabilities of large language models (LLMs) due to their complexity and ease of verification. Yet, current coding benchmarks face limitation… \ Source • arXiv cs.CL • 19:56
- MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments \ Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturati… \ Source • arXiv cs.CL • 15:31
- Auto-Prompting with Retrieval Guidance for Frame Detection in Logistics \ Prompt engineering plays a critical role in adapting large language models (LLMs) to complex reasoning and labeling tasks without the need for extensive fine-tuning. In this paper, we propose a novel prompt optimization pipeline for frame … \ Source • arXiv cs.CL • 11:29
- Small Language Models as Compiler Experts: Auto-Parallelization for Heterogeneous Systems \ Traditional auto-parallelizing compilers, reliant on rigid heuristics, struggle with the complexity of modern heterogeneous systems. This paper presents a comprehensive evaluation of small (approximately 1B parameter) language-model-driven… \ Source • arXiv cs.LG • 11:34
- Self-Consistent Probability Flow for High-Dimensional Fokker-Planck Equations \ Solving high-dimensional Fokker-Planck (FP) equations is a challenge in computational physics and stochastic dynamics, due to the curse of dimensionality (CoD) and the bottleneck of evaluating second-order diffusion terms. Existing deep le… \ Source • arXiv cs.LG • 10:31
- GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators \ Training capable Large Language Model (LLM) agents is critically bottlenecked by the high cost and static nature of real-world interaction data. We address this by introducing GenEnv, a framework that establishes a difficulty-aligned co-ev… \ Source • arXiv cs.CL • 19:57
- QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation \ Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to mitigate hallucinations in large language models (LLMs). However, existing methods rely on model-internal signals (e.g., logits, entropy), w… \ Source • arXiv cs.CL • 09:28
- Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning \ We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio,… \ Source • arXiv cs.LG • 19:59
- Shape it Up! Restoring LLM Safety during Finetuning \ Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks: even a few harmful examples can compromise safety alignment. A common mitigation strategy is to update the model more strongl… \ Source • arXiv cs.LG • 18:30
- Learning Continuous Solvent Effects from Transient Flow Data: A Graph Neural Network Benchmark on Catechol Rearrangement \ Predicting reaction outcomes across continuous solvent composition ranges remains a critical challenge in organic synthesis and process chemistry. Traditional machine learning approaches often treat solvent identity as a discrete categoric… \ Source • arXiv cs.LG • 17:19
- Real-Time Streamable Generative Speech Restoration with Flow Matching \ Diffusion-based generative models have greatly impacted the speech processing field in recent years, exhibiting high speech naturalness and spawning a new research direction. Their application in real-time communication is, however, still … \ Source • arXiv cs.LG • 15:41
- Faster Distributed Inference-Only Recommender Systems via Bounded Lag Synchronous Collectives \ Recommender systems are enablers of personalized content delivery, and therefore revenue, for many large companies. In the last decade, deep learning recommender models (DLRMs) are the de-facto standard in this field. The main bottleneck i… \ Source • arXiv cs.LG • 13:36
Big Tech
No items today.
Regulation & Standards
No items today.
Enterprise Practice
No items today.
Open-Source Tooling
No items today.
— Personal views, not IBM. No tracking. Curated automatically; links under 24h old.