Skip to content

This repo contains dataset and other details for the paper: Syntactic Feature Encoding in Multilingual BERT: A Layer-Wise Probing Study on Code-Switched language

Notifications You must be signed in to change notification settings

LearningLeopard/mBert-Codeswitching-layer-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Syntactic Feature Probing in Code-Switched Text

Paper License: MIT

This repository contains datasets and code accompanying the paper:

Syntactic Feature Encoding in Multilingual BERT: A Layer-Wise Probing Study on Code-Switched Language
Chaitanya Chakka, Mithun Kumar S R, Aruna Malapati, Asif Ekbal


πŸ“– Overview

Code-switching (mixing two or more languages in a sentence) poses unique challenges for multilingual models such as mBERT.
This work introduces novel probing datasets and conducts a layer-wise probing analysis of mBERT on English–Spanish code-switched text.

We focus on five syntactic properties:

  • Gender
  • Mood
  • Number
  • Person
  • Tense

Our results show that:

  • Lower layers in mBERT encode surface-level syntactic features.
  • Higher layers capture abstract, semantic properties.
  • Code-switching disrupts contextual feature capture, particularly for gender and person.

πŸ“‚ Repository Structure

Lince_Number_spaeng/
  β”œβ”€β”€ train/
  β”‚    β”œβ”€β”€ data-00000-of-00001.arrow
  β”‚    β”œβ”€β”€ dataset_info.json
  β”‚    └── state.json
  β”œβ”€β”€ validation/
  β”‚    β”œβ”€β”€ data-00000-of-00001.arrow
  β”‚    β”œβ”€β”€ dataset_info.json
  β”‚    └── state.json
  β”œβ”€β”€ test/
  β”‚    β”œβ”€β”€ data-00000-of-00001.arrow
  β”‚    β”œβ”€β”€ dataset_info.json
  β”‚    └── state.json

πŸš€ Usage

Load the dataset

from datasets import load_from_disk

ds = load_from_disk("Lince_Tense_spaeng")
print(ds)
print(ds['train'][0])

Probing setup

  • Tokenization: WordPiece (mBERT tokenizer)
  • Embeddings: extracted from all 12 layers
  • Probing model: simple classification head (frozen mBERT)
  • Evaluation: Matthews Correlation Coefficient (MCC) and F1-score

πŸ“Š Results (Summary)

Task Best Layer MCC F1
Gender Layer 1 0.60 0.97
Number Layer 1-2 0.74 0.89
Mood Layer 3 0.75 0.97
Person Layer 5 0.69 0.96
Tense Layer 5 0.76 0.95

➑️ Lower layers: strong for Gender, Number, Person
➑️ Mid-to-upper layers: stronger for Mood, Tense


πŸ“š Citation

If you use these datasets or results, please cite:

@article{chakka2025syntactic,
  title={Syntactic Feature Encoding in Multilingual BERT: A Layer-Wise Probing Study on Code-Switched Language},
  author={Chakka, Chaitanya and S R, Mithun Kumar and Malapati, Aruna and Ekbal, Asif},
  journal={Preprint},
  year={2025}
}

πŸ“ License

This project is licensed under the MIT License.


Acknowledgments

About

This repo contains dataset and other details for the paper: Syntactic Feature Encoding in Multilingual BERT: A Layer-Wise Probing Study on Code-Switched language

Resources

Stars

Watchers

Forks