feat: build models from JSON config files instead of HuggingFace pretrained #56

Magus4450 · 2026-01-30T05:38:15Z

Summary

Add support for building model components (image encoder, bytes encoder, latent transformer, bytes decoder) from JSON config files instead of requiring HuggingFace pretrained model names
This enables training models from scratch with custom architectures without needing to publish configs to HuggingFace first
Add example pretraining configs for The Pile dataset with Pythia-style hyperparameters
Fix streaming dataset feature detection for datasets that return None features

Changes

New Arguments

Each component now accepts an optional *_config parameter (path to JSON file):

image_encoder_config
bytes_encoder_config
latent_transformer_config
bytes_decoder_config

When a config file is provided, it takes precedence over the corresponding *_model_name_or_path. The config file must include a model_type field (e.g., bert, gpt_neox, llama).

Example Usage

# Use JSON config instead of pretrained model
bytes_encoder_model_name_or_path: null
bytes_encoder_config: configs/bert-custom.json

// configs/bert-custom.json
{
  "model_type": "bert",
  "hidden_size": 256,
  "num_hidden_layers": 6,
  "num_attention_heads": 4,
  "intermediate_size": 1024,
  "max_position_embeddings": 512
}

Pretraining Configs

Added example configs in welt_training/experiments/pretrain/:

pile-pretrain-30m-no-image.yaml - Full pretraining config with Pythia-style hyperparameters
test-run.yaml - Quick sanity check config
bert-tiny.json - Example bytes encoder config

Test plan

Run test config: python -m welt_training.train welt_training/experiments/pretrain/test-run.yaml
Verify model builds correctly from JSON configs
Verify streaming dataset loading works with The Pile

🤖 Generated with Claude Code

…from HF

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

AmitMY

looks good!

Magus4450 and others added 3 commits January 29, 2026 19:05

add configs for pretraining

e6daead

add feature to build base models from config file instead of loading …

b2e716d

…from HF

fix: correct YAML formatting in pile-pretrain config

18b2e09

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

AmitMY approved these changes Jan 31, 2026

View reviewed changes

AmitMY merged commit f863b03 into sign:main Jan 31, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: build models from JSON config files instead of HuggingFace pretrained #56

feat: build models from JSON config files instead of HuggingFace pretrained #56

Uh oh!

Magus4450 commented Jan 30, 2026

Uh oh!

AmitMY left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

feat: build models from JSON config files instead of HuggingFace pretrained #56

feat: build models from JSON config files instead of HuggingFace pretrained #56

Uh oh!

Conversation

Magus4450 commented Jan 30, 2026

Summary

Changes

New Arguments

Example Usage

Pretraining Configs

Test plan

Uh oh!

AmitMY left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants