Machine Translation Digest for Sep 10 2025
Here is today's selection of cs.CL papers focusing on advancements in language model alignment and evaluation. The papers explore multilingual alignment, enhanced explainability in writing assessment, and improvements in translation and detection of language nuances, highlighting a focus on precision and authenticity in language processing.
CM-Align: Consistency-based Multilingual Alignment for Large Language Models
Current large language models (LLMs) generally show a significant performance gap in alignment between English and other languages. To bridge this gap, existing research typically leverages the model's responses in English as a reference to select the best/worst responses in other languages, which are then used for Direct Preference Optimization (DPO) training. However, we argue that there are two limitations in the current methods that result in noisy multilingual preference data and further limited alignment performance: 1) Not all English responses are of high quality, and using a response with low quality may mislead the alignment for other languages. 2) Current methods usually use biased or heuristic approaches to construct multilingual preference pairs. To address these limitations, we design a consistency-based data selection method to construct high-quality multilingual preference data for improving multilingual alignment (CM-Align). Specifically, our method includes two parts: consistency-guided English reference selection and cross-lingual consistency-based multilingual preference data construction. Experimental results on three LLMs and three common tasks demonstrate the effectiveness and superiority of our method, which further indicates the necessity of constructing high-quality preference data.
Toward Subtrait-Level Model Explainability in Automated Writing Evaluation
Subtrait (latent-trait components) assessment presents a promising path toward enhancing transparency of automated writing scores. We prototype explainability and subtrait scoring with generative language models and show modest correlation between human subtrait and trait scores, and between automated and human subtrait scores. Our approach provides details to demystify scores for educators and students.
MoVoC: Morphology-Aware Subword Construction for Geez Script Languages
Subword-based tokenization methods often fail to preserve morphological boundaries, a limitation especially pronounced in low-resource, morphologically complex languages such as those written in the Geez script. To address this, we present MoVoC (Morpheme-aware Subword Vocabulary Construction) and train MoVoC-Tok, a tokenizer that integrates supervised morphological analysis into the subword vocabulary. This hybrid segmentation approach combines morpheme-based and Byte Pair Encoding (BPE) tokens to preserve morphological integrity while maintaining lexical meaning. To tackle resource scarcity, we curate and release manually annotated morpheme data for four Geez script languages and a morpheme-aware vocabulary for two of them. While the proposed tokenization method does not lead to significant gains in automatic translation quality, we observe consistent improvements in intrinsic metrics, MorphoScore, and Boundary Precision, highlighting the value of morphology-aware segmentation in enhancing linguistic fidelity and token efficiency. Our morpheme-annotated datasets and tokenizer will be publicly available to support further research in low-resource, morphologically rich languages. Our code and data are available on GitHub: https://github.com/hailaykidu/MoVoC
Automatic Detection of Inauthentic Templated Responses in English Language Assessments
In high-stakes English Language Assessments, low-skill test takers may employ memorized materials called templates'' on essay questions togame'' or fool the automated scoring system. In this study, we introduce the automated detection of inauthentic, templated responses (AuDITR) task, describe a machine learning-based approach to this task and illustrate the importance of regularly updating these models in production.
Natural Language Translation of Formal Proofs through Informalization of Proof Steps and Recursive Summarization along Proof Structure
This paper proposes a natural language translation method for machine-verifiable formal proofs that leverages the informalization (verbalization of formal language proof steps) and summarization capabilities of LLMs. For evaluation, it was applied to formal proof data created in accordance with natural language proofs taken from an undergraduate-level textbook, and the quality of the generated natural language proofs was analyzed in comparison with the original natural language proofs. Furthermore, we will demonstrate that this method can output highly readable and accurate natural language proofs by applying it to existing formal proof library of the Lean proof assistant.
| Curated by yukajii.com |