Issues with saving and loading checkpoints when using multiple gpus.

Hi, thank you for sharing your code.

I'm trying to follow your instructions but when I run discovery code, it fails to load pretrained model.
My environment is 
- Ubuntu 16.04 LTS
- 2ea Nvidia RTX3090
- python 3.8, cuda 11.0, pytorch 1.7.1, torchvision 0.8.2
- same version of pytorch-lightning and lightning-bolts as the repo

My errors are

Traceback (most recent call last):
  File "main_discover.py", line 280, in <module>
    main(args)
  File "main_discover.py", line 266, in main
    model = Discoverer(**args.__dict__)
  File "main_discover.py", line 70, in __init__
    state_dict = torch.load(self.hparams.pretrained, map_location=self.device)
  File "/home/dircon/anaconda3/envs/uno/lib/python3.8/site-packages/torch/serialization.py", line 594, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/home/dircon/anaconda3/envs/uno/lib/python3.8/site-packages/torch/serialization.py", line 853, in _load
    result = unpickler.load()
  File "/home/dircon/anaconda3/envs/uno/lib/python3.8/site-packages/torch/serialization.py", line 845, in persistent_load
    load_tensor(data_type, size, key, _maybe_decode_ascii(location))
  File "/home/dircon/anaconda3/envs/uno/lib/python3.8/site-packages/torch/serialization.py", line 833, in load_tensor
    storage = zip_file.get_storage_from_record(name, size, dtype).storage()
RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading file data/94820505364016: invalid header or archive is corrupted

I believe it's due to distributed data parallel(ddp) but how can I stop from multiple cards to save the model?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with saving and loading checkpoints when using multiple gpus. #15

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issues with saving and loading checkpoints when using multiple gpus. #15

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions