SrcMix: Mixing of Related Source Languages Benefits Extremely Low-resource Machine Translation

📢 Accepted to EACL 2026 (Findings)

Official implementation of SrcMix, a simple and effective training-time strategy for improving machine translation in Extremely Low-Resource Languages (ELRLs).

Motivation

Extremely low-resource languages (ELRLs) often suffer from:

Fewer than 6K parallel sentences
Limited supervision
High typological diversity
Weak transfer from massively multilingual models

Standard multilingual training (naïve concatenation or many-to-many setups) Often performs poorly in ELRL settings due to:

Negative transfer
Cross-lingual interference
Fragmented decoder supervision

SrcMix addresses this by introducing multilinguality only on the source side, while keeping the decoder specialized to a single target ELRL.

What is SrcMix?

Let:

Target ELRL: n1
Related source languages: n2, n3, ..., nN
High-resource language: H

To train H → n1, we mix:

H → n1 n2 → n1 n3 → n1 ... nN → n1

Key design choices:

Remove explicit source language identifiers
Force implicit structural transfer
Preserve decoder specialization

Unlike naïve multilingual mixing, SrcMix prevents decoder fragmentation and improves performance under extreme data scarcity.

🧡 First Public Angika MT Dataset

This work introduces the first publicly available machine translation training and evaluation dataset for Angika, an Indo-Aryan language spoken in eastern India.

Dataset available here:

https://huggingface.co/datasets/snjev310/AngikaMT

The dataset includes:

Parallel training data
Evaluation splits
FLORES-aligned dev/test sets
Clean preprocessing for reproducibility

📂 Repository Structure

srcmix/
│
├── srcmix/
│   ├── training/
│   │   ├── train_single.py
│   │   ├── train_srcmix.py
│   │   └── train_naive_mix.py
│   │
│   ├── data/
│   │   ├── build_dataset.py
│   │   ├── mixing.py
│   │   └── splits.py
│   │
│   ├── evaluation/
│   │   └── evaluate.py
│   │
│   └── utils/
│       └── config.py
│
├── configs/
├── scripts/
└── README.md

Training

Single Supervised Training

bash scripts/run_single.sh

SrcMix Training

bash scripts/run_srcmix.sh

Naïve Multilingual Mixing (Baseline)

bash scripts/run_ablations.sh

Evaluation

python srcmix/evaluation/evaluate.py --peft_model_path path_to_adapter --test_data path_to_tokenized_test

Metrics:

BLEU (sacreBLEU)
ChrF++

🔬 Experimental Setup

Base Model: Aya-101 (CohereForAI), mT5-Large
Fine-tuning: LoRA (r=16)
Training data: ~6K sentences per language
Evaluation: FLORES-aligned splits
Metrics: BLEU and ChrF++

Key Contributions

Introduce SrcMix, a source-side mixing strategy tailored for ELRLs.
Show that naïve multilingual mixing underperforms in extreme data scarcity.
Release the first MT training & evaluation dataset for Angika.
Provide systematic comparison across language families and scripts.
Demonstrate consistent gains across multiple ELRLs.

📜 Citation

If you use this dataset or the associated research in your work, please cite it as follows:

We will provide the final BibTeX entry once the paper is publicly available through the EACL proceedings.

🤝 Acknowledgements

This work was supported by a Ph.D. grant from the TCS Research Foundation and the Amazon-IIT Bombay AI/ML initiative.

📬 Contact

Sanjeev Kumar CSE IIT Bombay
Email: sanjeev@cse.iitb.ac.in

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
srcmix		srcmix
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SrcMix: Mixing of Related Source Languages Benefits Extremely Low-resource Machine Translation

Motivation

What is SrcMix?

🧡 First Public Angika MT Dataset

📂 Repository Structure

Training

Single Supervised Training

SrcMix Training

Naïve Multilingual Mixing (Baseline)

Evaluation

🔬 Experimental Setup

Key Contributions

📜 Citation

🤝 Acknowledgements

📬 Contact

About

Uh oh!

Releases

Packages

Languages

License

snjev310/SrcMix

Folders and files

Latest commit

History

Repository files navigation

SrcMix: Mixing of Related Source Languages Benefits Extremely Low-resource Machine Translation

Motivation

What is SrcMix?

🧡 First Public Angika MT Dataset

📂 Repository Structure

Training

Single Supervised Training

SrcMix Training

Naïve Multilingual Mixing (Baseline)

Evaluation

🔬 Experimental Setup

Key Contributions

📜 Citation

🤝 Acknowledgements

📬 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages