This repository provides a reference implementation for the paper PluRel: Synthetic Data unlocks Scaling Laws for Relational Foundation Models.
PluRel is a framework for synthesizing diverse multi-tabular relational databases using Structural Causal Models (SCMs). This repository provides:
- Scalable generation of synthetic relational data (from scratch or SQL schemas) compatible with relbench.
- High-performance context sampling via a Rust-based sampler (rustler).
- Pretraining of relational transformers on synthetic data.
Setup the development and testing environment with pixi.
# setup pixi environment
$ pixi install
# Compile and install the rust sampler
$ cd rustler && pixi run maturin develop --uv --release && cd ..
# Run tests
$ pixi run pytest
# Lint and format code
$ pixi run ruff check .
$ pixi run ruff format .- The
SyntheticDatasetclass can be used to create relbench compatible dataset objects. - It only requires a
seedand aConfigobject that containsdatabase,scmanddaglevel params for sampling. See example below.
from plurel import SyntheticDataset, Config
# create relbench compatible dataset
dataset = SyntheticDataset(seed=0, config=Config())
# create database which can be cached via relbench APIs
db = dataset.make_db()The Config class controls all aspects of synthetic database generation through three parameter groups:
| Parameters | Description |
|---|---|
DatabaseParams |
Table layout (BarabasiAlbert, ReverseRandomTree, WattsStrogatz), number of tables, row counts, column counts, and timestamp ranges. |
SCMParams |
SCM graph layouts, column types, MLP initialization, activation functions, noise distributions, and time-series trend/cycle parameters. |
DAGParams |
DAG-specific parameters like edge dropout, in-degree limits, and rewiring probabilities for different graph types. |
from plurel import Config, DatabaseParams, SCMParams
config = Config(
database_params=DatabaseParams(num_tables_choices=Choices(kind="range", value=[5, 10])),
schema_file="path/to/schema.sql", # optional: generate from SQL schema
cache_dir="~/.cache/plurel", # optional: cache generated databases
)We also provide a multiprocessing-based script to generate databases in parallel.
$ pixi run python scripts/synthetic_gen.py \
--seed_offset 0 \
--num_dbs 1000 \
--num_proc 16 \
--preprocess| Argument | Description |
|---|---|
--seed_offset |
Seed offset for database generation. DBs will be named rel-synthetic-<seed>. |
--num_dbs |
Number of databases to generate. |
--num_proc |
Number of parallel processes (default: number of CPU cores). |
--preprocess |
Run preprocessing and embedding steps. Omit to skip. |
Note
Checkout notebooks in examples/ for synthesizing from SQL schemas
The preprocessed synthetic data is available on the Hugging Face Hub at kvignesh1420/plurel.
- Install the HuggingFace CLI (if not present)
pixi add huggingface_hub- Create the destination
mkdir -p ~/scratch/pre- Download the repository contents into ~/scratch/pre
pixi run hf download kvignesh1420/plurel \
--repo-type dataset \
--local-dir ~/scratch/preThe preprocessed relbench data is available on the Hugging Face Hub at hvag976/relational-transformer.
pixi run hf download hvag976/relational-transformer \
--repo-type dataset \
--local-dir ~/scratch/preThe synthetic pretrained model checkpoints are hosted on the Hugging Face Hub at kvignesh1420/relational-transformer-plurel.
$ mkdir -p ~/scratch/rt_hf_ckpts
$ pixi run hf download kvignesh1420/relational-transformer-plurel \
--repo-type model \
--local-dir ~/scratch/rt_hf_ckptsOne of the downloaded checkpoints will be listed as:
$ ls ~/scratch/rt_hf_ckpts
# model pretrained on a dataset of size 4B tokens curated from 1024 synthetic RDBs
synthetic-pretrain_rdb_1024_size_4b.pt- Baseline (real-world) pretraining on relbench datasets with a randomly initialized relational-transformer (RT) model.
$ pixi run torchrun --standalone --nproc_per_node=1 scripts/baseline_pretrain.py- Synthetic pretraining on varying number of databases and dataset sizes with a randomly initialized RT model.
$ pixi run torchrun --standalone --nproc_per_node=1 scripts/synthetic_pretrain.py- Continued pretraining on relbench datasets using the synthetic pretrained models. For faster experimentation, the downloaded models from huggingface (stored in
~/scratch/rt_hf_ckpts) can be passed to theload_ckpt_pathargument in the training script.
$ pixi run torchrun --standalone --nproc_per_node=1 scripts/cntd_pretrain.pyIf you find this work useful, please cite our paper:
@misc{kothapalli2026plurel,
title={{PluRel:} Synthetic Data unlocks Scaling Laws for Relational Foundation Models},
author={Vignesh Kothapalli and Rishabh Ranjan and Valter Hudovernik and Vijay Prakash Dwivedi and Johannes Hoffart and Carlos Guestrin and Jure Leskovec},
year={2026},
eprint={2602.04029},
archivePrefix={arXiv},
primaryClass={cs.DB},
url={https://arxiv.org/abs/2602.04029},
}If you use the architecture, training loop or sampler code, please also cite the Relational Transformer paper:
@inproceedings{ranjan2025relationaltransformer,
title={{Relational Transformer:} Toward Zero-Shot Foundation Models for Relational Data},
author={Rishabh Ranjan and Valter Hudovernik and Mark Znidar and Charilaos Kanatsoulis and Roshan Upendra and Mahmoud Mohammadi and Joe Meyer and Tom Palczewski and Carlos Guestrin and Jure Leskovec},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026}
}
