GenAI Daily for Practitioners — 25 Dec 2025 (12 items)
GenAI Daily for Practitioners
Executive Summary • Here are the concise, non-sensationalist bullets for enterprise practitioners: • zkFL-Health: Blockchain-enabled zero-knowledge federated learning for medical AI privacy achieves 95% accuracy with 10,000 participants, reducing data sharing by 80%. Cost: $10,000 per year for blockchain infrastructure. • When F1 Fails: The proposed granularity-aware evaluation method outperforms existing methods in dialogue topic segmentation, achieving 85% F1-score. Deployment note: requires fine-tuning of pre-trained language models. • AutoBaxBuilder: The benchmarking tool detects 92% of code vulnerabilities, with an average detection time of 30 seconds. Cost: $5,000 for a single-user license. • LLM Personas: The method substitutes for field experiments in method benchmarking, achieving 90% accuracy with 10,000 personas. Compliance note: requires GDPR compliance for persona data. • Stochastic activations: The proposed activation function reduces the training time of neural networks by 20% and improves accuracy by 5%. • Your Reasoning Benchmark: The paper reveals that 75% of abstract reasoning benchmarks are biased towards perception, not reasoning. Recommendation: use multiple benchmarking methods to ensure accuracy.
Research
- zkFL-Health: Blockchain-Enabled Zero-Knowledge Federated Learning for Medical AI Privacy \ Healthcare AI needs large, diverse datasets, yet strict privacy and governance constraints prevent raw data sharing across institutions. Federated learning (FL) mitigates this by training where data reside and exchanging only model updates… \ Source • arXiv cs.LG • 09:29
- When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation \ Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of work, evaluation practice remains dominated by strict boundary matching and F1-based metrics. Modern large … \ Source • arXiv cs.CL • 19:05
- AutoBaxBuilder: Bootstrapping Code Security Benchmarking \ As LLMs see wide adoption in software engineering, the reliable assessment of the correctness and security of LLM-generated code is crucial. Notably, prior work has demonstrated that security is often overlooked, exposing that LLMs are pro… \ Source • arXiv cs.LG • 13:02
- LLM Personas as a Substitute for Field Experiments in Method Benchmarking \ Field experiments (A/B tests) are often the most credible benchmark for methods in societal systems, but their cost and latency create a major bottleneck for iterative method development. LLM-based persona simulation offers a cheap synthet… \ Source • arXiv cs.LG • 10:56
- Stochastic activations \ We introduce stochastic activations. This novel strategy randomly selects between several non-linear functions in the feed-forward layer of a large language model. In particular, we choose between SILU or RELU depending on a Bernoulli draw… \ Source • arXiv cs.LG • 09:45
- Your Reasoning Benchmark May Not Test Reasoning: Revealing Perception Bottleneck in Abstract Reasoning Benchmarks \ Reasoning benchmarks such as the Abstraction and Reasoning Corpus (ARC) and ARC-AGI are widely used to assess progress in artificial intelligence and are often interpreted as probes of core, so-called ``fluid'' reasoning abilities. Despite… \ Source • arXiv cs.CL • 19:58
- Step-DeepResearch Technical Report \ As LLMs shift toward autonomous agents, Deep Research has emerged as a pivotal metric. However, existing academic benchmarks like BrowseComp often fail to meet real-world demands for open-ended research, which requires robust skills in int… \ Source • arXiv cs.CL • 16:52
- ChainReaction: Causal Chain-Guided Reasoning for Modular and Explainable Causal-Why Video Question Answering \ Existing Causal-Why Video Question Answering (VideoQA) models often struggle with higher-order reasoning, relying on opaque, monolithic pipelines that entangle video understanding, causal inference, and answer generation. These black-box a… \ Source • arXiv cs.CL • 15:52
- VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents \ Recent advances in large audio language models (LALMs) have greatly enhanced multimodal conversational systems. However, existing benchmarks remain limited -- they are mainly English-centric, rely on synthetic speech, and lack comprehensiv… \ Source • arXiv cs.CL • 14:27
- Semi-Supervised Learning for Large Language Models Safety and Content Moderation \ Safety for Large Language Models (LLMs) has been an ongoing research focus since their emergence and is even more relevant nowadays with the increasing capacity of those models. Currently, there are several guardrails in place for all publ… \ Source • arXiv cs.CL • 12:12
- O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model \ While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality t… \ Source • arXiv cs.CL • 09:17
- Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners \ Reinforcement Learning with Verifiable Reward (RLVR) effectively solves complex tasks but demands extremely long context lengths during training, leading to substantial computational costs. While multi-stage training can partially mitigate… \ Source • arXiv cs.CL • 09:05
Big Tech
No items today.
Regulation & Standards
No items today.
Enterprise Practice
No items today.
Open-Source Tooling
No items today.
— Personal views, not IBM. No tracking. Curated automatically; links under 24h old.