ProEval Cuts Benchmark Eval Samples 8-65x
- Benchmark Eval Becomes a Probability Problem. Google's ProEval treats LLM benchmark scoring as Bayesian estimation with a pretrained Gaussian process surrogate, cutting sample budgets 8-65x at 1% error.
- FT vs ICL Finally Has a Clean Comparison. On formal-language tasks, in-distribution FT wins clearly, out-of-distribution they tie, and ICL's sensitivity to model scale and tokenization shows up as structural rather than noise.
- Copyrighted Corpora Get a Legal Workaround. Annotations release in plaintext while source text ships as non-reversible hashes; cross-edition alignment still hits 98.7%-99.79% token match.
- SAM in the Clinic Stalls on Prompts, Not the Model. Saliency-guided anatomical priors plus cross-slice consistency keep SAM stable when the only input is a sloppy midline point.
Also Notable
- Searching Surveillance Footage for Anomalous Behavior via Text. A cascade framework runs coarse alignment first and then refines, splitting geometric structure from semantic intent across two stages.
- VLM Pseudo-Labels Carry Systematic Bias in Open-Vocabulary Detection. Hierarchical consistency constraints debias the labels so objectness doesn't inherit pretraining-distribution skew.
- Same Person, Different Roles Across Events in a Video. Multimodal coreference makes identity-role mapping explicit so VidSitu stops fragmenting one person into many.
- Text-to-Motion Modeled at Multiple Time Scales Separately. Hierarchical flow matching handles coarse structure and fine motion together, avoiding the single-scale tradeoff.
- Semi-Supervised Medical Segmentation Goes Beyond Masks. Generative dual-distribution alignment adds feature-level supervision, mining more signal from unlabeled data.
Don't miss what's next. Subscribe to AI Research Brief: