# lmt **Repository Path**: mirrors/lmt ## Basic Information - **Project Name**: lmt - **Description**: LMT 是一套以中英为中心的大规模多语言翻译模型,共有四种规模(0.6B/1.7B/4B/8B) - **Primary Language**: Python - **License**: Not specified - **Default Branch**: main - **Homepage**: https://www.oschina.net/p/lmt - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-11-14 - **Last Updated**: 2025-12-27 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
# 🌍 Large-scale Multilingual Translation (LMT)

• [📢 News](#-news) • [🤗 Open Resources](#-open-resources) • [📄 Contents](#-contents)

The LMT aims to advance the frontier of Multilingual Machine Translation (MMT) by building **Inclusive**, **Scalable**, and **High-performance** multilingual translation models. # 📢 News - *2025.11.11*: Our **LMT** paper [Beyond English: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs](https://arxiv.org/abs/2511.07003) and corresponding [Models](https://huggingface.co/NiuTrans/LMT-60-8B) are released. # 🤗 Open Resources We have made the following resources available: | Resource | Description | Link | |------------------|-----------------------------------------------------|-----------------------------------------------------------| | LMT-60-*B | Our high-performance multilingual translation models cover 60 languages and 234 directions. Available in four sizes: 0.6B / 1.7B / 4B / 8B. | [LMT-60-0.6B](https://huggingface.co/NiuTrans/LMT-60-0.6B)
[LMT-60-1.7B](https://huggingface.co/NiuTrans/LMT-60-1.7B)
[LMT-60-4B](https://huggingface.co/NiuTrans/LMT-60-4B)
[LMT-60-8B](https://huggingface.co/NiuTrans/LMT-60-8B) | | LMT-60-*B-Base | Our continued pre-training of Qwen3 on 90B tokens serves as the foundation for large-scale translation adaptation. Available in four sizes: 0.6B / 1.7B / 4B / 8B. Note that these models have not been fine-tuned for translation using our SFT data. For translation tasks, please refer to the LMT-60-*B series. | [LMT-60-0.6B-Base](https://huggingface.co/NiuTrans/LMT-60-0.6B-Base)
[LMT-60-1.7B-Base](https://huggingface.co/NiuTrans/LMT-60-1.7B-Base)
[LMT-60-4B-Base](https://huggingface.co/NiuTrans/LMT-60-4B-Base)
[LMT-60-8B-Base](https://huggingface.co/NiuTrans/LMT-60-8B-Base) | | LMT-60-sft-data | Our SFT dataset including Flores-200 devset, NTREX-128, SMol, WMT14–23, and IWSLT17–24 test sets, totaling 567K samples. | [LMT-60-sft-data](https://huggingface.co/datasets/NiuTrans/LMT-60-sft-data) | | FLORES-mn_cn | A new Chinese–Mongolian evaluation set annotated by native speakers to extend the FLORES-200 benchmark. | [FLORES-mn_cn](https://huggingface.co/datasets/NiuTrans/FLORES-mn_cn) | # 📄 Contents ## Beyond English: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs ### Introduction In this project, we take a step toward overcoming the prevailing English-centric bias in MMT. We introduce **LMT**, a suite of **Chinese-English-centric** MMT models trained on **90B** mixed monolingual and bilingual tokens, covering **60 languages across 234 translation directions** and achieving **SOTA performance** among models with similar language coverage. Our work makes the following main contributions: - We identify and analyze a previously overlooked issue, **directional degeneration**, in large-scale multilingual SFT with multi-way data and propose a simple yet effective **Strategic Downsampling** method to mitigate it. - We propose **Parallel Multilingual Prompting (PMP)**, which enhances cross-lingual transfer by incorporating an auxiliary parallel sentence into the instruction. - We release LMT, a suite of **large-scale Chinese–English-centric** multilingual translation models in four sizes (0.6B/1.7B/4B/8B), providing strong baselines for future MMT research. ### Support Languages | Resource Tier | Languages | | :---- | :---- | | High-resource Languages (13) | Arabic(ar), English(en), Spanish(es), German(de), French(fr), Italian(it), Japanese(ja), Dutch(nl), Polish(pl), Portuguese(pt), Russian(ru), Turkish(tr), Chinese(zh) | | Medium-resource Languages (18) | Bulgarian(bg), Bengali(bn), Czech(cs), Danish(da), Modern Greek(el), Persian(fa), Finnish(fi), Hindi(hi), Hungarian(hu), Indonesian(id), Korean(ko), Norwegian(nb), Romanian(ro), Slovak(sk), Swedish(sv), Thai(th), Ukrainian(uk), Vietnamese(vi) | | Low-resouce Languages (29) | Amharic(am), Azerbaijani(az), Tibetan(bo), Modern Hebrew(he), Croatian(hr), Armenian(hy), Icelandic(is), Javanese(jv), Georgian(ka), Kazakh(kk), Central Khmer(km), Kirghiz(ky), Lao(lo), Chinese Mongolian(mn_cn), Marathi(mr), Malay(ms), Burmese(my), Nepali(ne), Pashto(ps), Sinhala(si), Swahili(sw), Tamil(ta), Telugu(te), Tajik(tg), Tagalog(tl), Uighur(ug), Urdu(ur), Uzbek(uz), Yue Chinese(yue) | ### Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "NiuTrans/LMT-60-8B" tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left') model = AutoModelForCausalLM.from_pretrained(model_name) prompt = "Translate the following text from English into Chinese. English: The concept came from China where plum blossoms were the flower of choice. Chinese:" messages = [{"role": "user", "content": prompt}] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) generated_ids = model.generate(**model_inputs, max_new_tokens=512, num_beams=5, do_sample=False) output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True) print("response:", outputs) ``` For more details, please refer to [src/inference.py](./src/inference.py). ## Reference Email: luoyingfeng_neu@outlook.com If you find our paper useful for your research, please kindly cite our paper: ``` @misc{luoyf2025lmt, title={Beyond English: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs}, author={Yingfeng Luo, Ziqiang Xu, Yuxuan Ouyang, Murun Yang, Dingyang Lin, Kaiyan Chang, Tong Zheng, Bei Li, Peinan Feng, Quan Du, Tong Xiao, Jingbo Zhu}, year={2025}, eprint={2511.07003}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2511.07003}, } ```