Skip to content

Conversation

@Magus4450
Copy link
Contributor

Summary

  • Add support for building model components (image encoder, bytes encoder, latent transformer, bytes decoder) from JSON config files instead of requiring HuggingFace pretrained model names
  • This enables training models from scratch with custom architectures without needing to publish configs to HuggingFace first
  • Add example pretraining configs for The Pile dataset with Pythia-style hyperparameters
  • Fix streaming dataset feature detection for datasets that return None features

Changes

New Arguments

Each component now accepts an optional *_config parameter (path to JSON file):

  • image_encoder_config
  • bytes_encoder_config
  • latent_transformer_config
  • bytes_decoder_config

When a config file is provided, it takes precedence over the corresponding *_model_name_or_path. The config file must include a model_type field (e.g., bert, gpt_neox, llama).

Example Usage

# Use JSON config instead of pretrained model
bytes_encoder_model_name_or_path: null
bytes_encoder_config: configs/bert-custom.json
// configs/bert-custom.json
{
  "model_type": "bert",
  "hidden_size": 256,
  "num_hidden_layers": 6,
  "num_attention_heads": 4,
  "intermediate_size": 1024,
  "max_position_embeddings": 512
}

Pretraining Configs

Added example configs in welt_training/experiments/pretrain/:

  • pile-pretrain-30m-no-image.yaml - Full pretraining config with Pythia-style hyperparameters
  • test-run.yaml - Quick sanity check config
  • bert-tiny.json - Example bytes encoder config

Test plan

  • Run test config: python -m welt_training.train welt_training/experiments/pretrain/test-run.yaml
  • Verify model builds correctly from JSON configs
  • Verify streaming dataset loading works with The Pile

🤖 Generated with Claude Code

Copy link
Contributor

@AmitMY AmitMY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@AmitMY AmitMY merged commit f863b03 into sign:main Jan 31, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants