ProEval Cuts Benchmark Eval Samples 8-65x

        April 28, 2026

ProEval Cuts Benchmark Eval Samples 8-65x

Benchmark Eval Becomes a Probability Problem. Google's ProEval treats LLM benchmark scoring as Bayesian estimation with a pretrained Gaussian process surrogate, cutting sample budgets 8-65x at 1% error.

FT vs ICL Finally Has a Clean Comparison. On formal-language tasks, in-distribution FT wins clearly, out-of-distribution they tie, and ICL's sensitivity to model scale and tokenization shows up as structural rather than noise.

Copyrighted Corpora Get a Legal Workaround. Annotations release in plaintext while source text ships as non-reversible hashes; cross-edition alignment still hits 98.7%-99.79% token match.

SAM in the Clinic Stalls on Prompts, Not the Model. Saliency-guided anatomical priors plus cross-slice consistency keep SAM stable when the only input is a sloppy midline point.

Also Notable

Searching Surveillance Footage for Anomalous Behavior via Text. A cascade framework runs coarse alignment first and then refines, splitting geometric structure from semantic intent across two stages.
VLM Pseudo-Labels Carry Systematic Bias in Open-Vocabulary Detection. Hierarchical consistency constraints debias the labels so objectness doesn't inherit pretraining-distribution skew.
Same Person, Different Roles Across Events in a Video. Multimodal coreference makes identity-role mapping explicit so VidSitu stops fragmenting one person into many.
Text-to-Motion Modeled at Multiple Time Scales Separately. Hierarchical flow matching handles coarse structure and fine motion together, avoiding the single-scale tradeoff.
Semi-Supervised Medical Segmentation Goes Beyond Masks. Generative dual-distribution alignment adds feature-level supervision, mining more signal from unlabeled data.

Read the full edition →

                                Don't miss what's next. Subscribe to AI Research Brief:

            Email address (required)