Skip to content

Training: make resuming training from a checkpoint simple #1025

@trivoldus28

Description

@trivoldus28

It is currently quite hard to resume training from a checkpoint - you need to hardcode a specific checkpoint path.

It is also buggy:

  • Training will overwrite itself with checkpoints starting counting iteration from 0, so you'd need to use a different path
  • wandb will also overwrite old snapshots with new snapshots

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions