Skip to content

Issues with saving and loading checkpoints when using multiple gpus. #15

@dhkim2810

Description

@dhkim2810

Hi, thank you for sharing your code.

I'm trying to follow your instructions but when I run discovery code, it fails to load pretrained model.
My environment is

  • Ubuntu 16.04 LTS
  • 2ea Nvidia RTX3090
  • python 3.8, cuda 11.0, pytorch 1.7.1, torchvision 0.8.2
  • same version of pytorch-lightning and lightning-bolts as the repo

My errors are

Traceback (most recent call last):
File "main_discover.py", line 280, in
main(args)
File "main_discover.py", line 266, in main
model = Discoverer(**args.dict)
File "main_discover.py", line 70, in init
state_dict = torch.load(self.hparams.pretrained, map_location=self.device)
File "/home/dircon/anaconda3/envs/uno/lib/python3.8/site-packages/torch/serialization.py", line 594, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/home/dircon/anaconda3/envs/uno/lib/python3.8/site-packages/torch/serialization.py", line 853, in _load
result = unpickler.load()
File "/home/dircon/anaconda3/envs/uno/lib/python3.8/site-packages/torch/serialization.py", line 845, in persistent_load
load_tensor(data_type, size, key, _maybe_decode_ascii(location))
File "/home/dircon/anaconda3/envs/uno/lib/python3.8/site-packages/torch/serialization.py", line 833, in load_tensor
storage = zip_file.get_storage_from_record(name, size, dtype).storage()
RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading file data/94820505364016: invalid header or archive is corrupted

I believe it's due to distributed data parallel(ddp) but how can I stop from multiple cards to save the model?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions