📢 Accepted to EACL 2026 (Findings)
Official implementation of SrcMix, a simple and effective training-time strategy for improving machine translation in Extremely Low-Resource Languages (ELRLs).
Extremely low-resource languages (ELRLs) often suffer from:
- Fewer than 6K parallel sentences
- Limited supervision
- High typological diversity
- Weak transfer from massively multilingual models
Standard multilingual training (naïve concatenation or many-to-many setups) Often performs poorly in ELRL settings due to:
- Negative transfer
- Cross-lingual interference
- Fragmented decoder supervision
SrcMix addresses this by introducing multilinguality only on the source side, while keeping the decoder specialized to a single target ELRL.
Let:
- Target ELRL: n1
- Related source languages: n2, n3, ..., nN
- High-resource language: H
To train H → n1, we mix:
H → n1 n2 → n1 n3 → n1 ... nN → n1
Key design choices:
- Remove explicit source language identifiers
- Force implicit structural transfer
- Preserve decoder specialization
Unlike naïve multilingual mixing, SrcMix prevents decoder fragmentation and improves performance under extreme data scarcity.
This work introduces the first publicly available machine translation training and evaluation dataset for Angika, an Indo-Aryan language spoken in eastern India.
Dataset available here:
https://huggingface.co/datasets/snjev310/AngikaMT
The dataset includes:
- Parallel training data
- Evaluation splits
- FLORES-aligned dev/test sets
- Clean preprocessing for reproducibility
srcmix/
│
├── srcmix/
│ ├── training/
│ │ ├── train_single.py
│ │ ├── train_srcmix.py
│ │ └── train_naive_mix.py
│ │
│ ├── data/
│ │ ├── build_dataset.py
│ │ ├── mixing.py
│ │ └── splits.py
│ │
│ ├── evaluation/
│ │ └── evaluate.py
│ │
│ └── utils/
│ └── config.py
│
├── configs/
├── scripts/
└── README.md
bash scripts/run_single.sh
bash scripts/run_srcmix.sh
bash scripts/run_ablations.sh
python srcmix/evaluation/evaluate.py --peft_model_path path_to_adapter --test_data path_to_tokenized_test
Metrics:
- BLEU (sacreBLEU)
- ChrF++
- Base Model: Aya-101 (CohereForAI), mT5-Large
- Fine-tuning: LoRA (r=16)
- Training data: ~6K sentences per language
- Evaluation: FLORES-aligned splits
- Metrics: BLEU and ChrF++
- Introduce SrcMix, a source-side mixing strategy tailored for ELRLs.
- Show that naïve multilingual mixing underperforms in extreme data scarcity.
- Release the first MT training & evaluation dataset for Angika.
- Provide systematic comparison across language families and scripts.
- Demonstrate consistent gains across multiple ELRLs.
If you use this dataset or the associated research in your work, please cite it as follows:
We will provide the final BibTeX entry once the paper is publicly available through the EACL proceedings.
This work was supported by a Ph.D. grant from the TCS Research Foundation and the Amazon-IIT Bombay AI/ML initiative.
Sanjeev Kumar CSE IIT Bombay
Email: sanjeev@cse.iitb.ac.in