This repository contains datasets and code accompanying the paper:
Syntactic Feature Encoding in Multilingual BERT: A Layer-Wise Probing Study on Code-Switched Language
Chaitanya Chakka, Mithun Kumar S R, Aruna Malapati, Asif Ekbal
Code-switching (mixing two or more languages in a sentence) poses unique challenges for multilingual models such as mBERT.
This work introduces novel probing datasets and conducts a layer-wise probing analysis of mBERT on EnglishβSpanish code-switched text.
We focus on five syntactic properties:
- Gender
- Mood
- Number
- Person
- Tense
Our results show that:
- Lower layers in mBERT encode surface-level syntactic features.
- Higher layers capture abstract, semantic properties.
- Code-switching disrupts contextual feature capture, particularly for gender and person.
Lince_Number_spaeng/
βββ train/
β βββ data-00000-of-00001.arrow
β βββ dataset_info.json
β βββ state.json
βββ validation/
β βββ data-00000-of-00001.arrow
β βββ dataset_info.json
β βββ state.json
βββ test/
β βββ data-00000-of-00001.arrow
β βββ dataset_info.json
β βββ state.json
from datasets import load_from_disk
ds = load_from_disk("Lince_Tense_spaeng")
print(ds)
print(ds['train'][0])- Tokenization: WordPiece (mBERT tokenizer)
- Embeddings: extracted from all 12 layers
- Probing model: simple classification head (frozen mBERT)
- Evaluation: Matthews Correlation Coefficient (MCC) and F1-score
| Task | Best Layer | MCC | F1 |
|---|---|---|---|
| Gender | Layer 1 | 0.60 | 0.97 |
| Number | Layer 1-2 | 0.74 | 0.89 |
| Mood | Layer 3 | 0.75 | 0.97 |
| Person | Layer 5 | 0.69 | 0.96 |
| Tense | Layer 5 | 0.76 | 0.95 |
β‘οΈ Lower layers: strong for Gender, Number, Person
β‘οΈ Mid-to-upper layers: stronger for Mood, Tense
If you use these datasets or results, please cite:
@article{chakka2025syntactic,
title={Syntactic Feature Encoding in Multilingual BERT: A Layer-Wise Probing Study on Code-Switched Language},
author={Chakka, Chaitanya and S R, Mithun Kumar and Malapati, Aruna and Ekbal, Asif},
journal={Preprint},
year={2025}
}This project is licensed under the MIT License.
- LINCE Benchmark
- UniMorph Project
- HPC support from BITS Pilani, Hyderabad Campus