dots.tts Hits 54ms First Packet, SWE Agent Self-Evolves Past 50%
- Open-source TTS takes the continuous-latent route, with three design choices all aimed at deployment. dots.tts is a 2B continuous autoregressive speech model, Apache 2.0, that pushes first-packet latency down to 54–85ms and reports 0.94%/1.30% CN/EN WER on Seed-TTS-Eval.
- One set of weights for every camera. UniSHARP unifies perspective, wide-angle, fisheye, and panoramic cameras into a single panoramic latent space for monocular view synthesis, instead of training a separate model per camera type.
- Let the coding agent write its own problems and see if it escapes its comfort zone. Socratic-SWE mines an agent's execution traces for reusable skills, generates tasks from them, and reaches 50.40% on SWE-bench Verified after three iterations.
- Tabular foundation models start cutting back toward deployability. TabSwift matches the heavier TabPFN v2 and TabICL with a lightweight row-wise attention backbone, adds per-layer early exit, and bets squarely on low latency.
Also Notable
- Same Prompt Keeps Yielding Similar Images, and You Can Restore Diversity Without Retraining — tackles mode collapse in flow-based text-to-image with representation modulation, no retraining needed. Breaking the Lock-in: Diversifying Text-to-Image Generation via Representation Modulation
- VLMs Read Events but Miss Fine Motion, So Borrow From Video Diffusion — injects video diffusion motion priors into VLMs to fix fine-grained motion understanding. MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models
- Easy Questions Shouldn't Burn as Many Tokens as Hard Ones — curbs overthinking by scaling reasoning to difficulty, with difficulty modeling that evolves during training. DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling
- AIGC Detectors Fail on a New Generator, So This Exposes the Criteria — builds interpretable, transferable forensic concepts to counter black-box detectors' generalization collapse. ForensicConcept: Transferable Forensic Concepts for AIGI Detection
- Skip Skeletons and Pose Estimation, Learn Character Animation Straight From Driving Video — avoids error propagation from pose estimation under occlusion and complex poses. Beyond Skeletons: Learning Animation Directly from Driving Videos
- Unsupervised Disease Staging That Explains Its Representations and Clusters — uses Huntington's disease to add the interpretability clinical use needs. Explaining Unsupervised Disease Staging in Huntington's Disease
- LLM Research Watches Semantics and Spelling but Ignores Sound — a benchmark for Chinese phonological understanding to fill the gap. Phun-Bench: Evaluating LLMs on Phonological Understanding in Chinese
- Adding Text Supervision Improves Geospatial Representations in VLMs — helps overlooked dimensions like geolocation and spatial reasoning. Textual Supervision Enhances Geospatial Representations in Vision-Language Models
- Cultural Alignment Always Says What to Suppress, So This Defines What Counts as Coherent — uses Korean culture to give cultural alignment a constructive, positive definition. Korean Culture into LLM Alignment: Toward Cultural Coherence
Don't miss what's next. Subscribe to AI Research Brief: