Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint logging and doc fixes #270

Merged
merged 3 commits into from
Jan 26, 2022

Conversation

ajaysaini725
Copy link
Contributor

@ajaysaini725 ajaysaini725 commented Jan 25, 2022

Example of checkpoint loading:

INFO:composer.cli.launcher:Starting DDP on local node for global_rank(0-0)
/mnt/cota/ajay/composer/composer/cli/launcher.py:105: UserWarning: AutoSelectPortWarning: The DDP port was auto-selected. This may lead to race conditions when launching multiple training processes simultaneously. To eliminate this race condition, explicitely specify a port with --master_port PORT_NUMBER
  warnings.warn("AutoSelectPortWarning: The DDP port was auto-selected. "
INFO:composer.cli.launcher:DDP Store: tcp://127.0.0.1:59695
INFO:composer.cli.launcher:Launching process for local_rank(0), global_rank(0)
UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  ../torch/csrc/utils/tensor_numpy.cpp:180.) (source: /usr/local/lib/python3.8/dist-packages/torchvision/datasets/mnist.py:498)
Config
------------------------------
algorithms: []
callbacks: []
compute_training_metrics: false
datadir: null
dataloader:
  num_workers: 8
  persistent_workers: true
  pin_memory: true
  prefetch_factor: 2
  timeout: 0.0
ddp_sync_strategy: null
deepspeed: null
deterministic_mode: false
device:
  gpu: {}
dist_timeout: 15.0
eval_batch_size: 1000
eval_subset_num_batches: null
grad_accum: 1
grad_clip_norm: null
load_checkpoint:
  checkpoint: mosaic_states.pt
  chunk_size: 1048576
  load_weights_only: false
  object_store: null
  progress_bar: true
  strict_model_weights: false
log_level: INFO
loggers:
- tqdm: {}
max_duration: 10ep
model:
  mnist_classifier:
    initializers:
    - KAIMING_NORMAL
    - BN_UNIFORM
    num_classes: 10
optimizer:
  sgd:
    dampening: 0.0
    lr: 0.1
    momentum: 0.9
    nesterov: false
    weight_decay: 0.0001
precision: AMP
profiler: null
save_checkpoint: null
schedulers:
- cosine_decay:
    T_max: 10ep
    eta_min: 0.0
    interval: epoch
    verbose: false
seed: 17
train_batch_size: 2048
train_dataset:
  mnist:
    datadir: /datasets/mnist
    download: true
    drop_last: true
    is_train: true
    shuffle: true
    synthetic_device: cpu
    synthetic_memory_format: CONTIGUOUS_FORMAT
    synthetic_num_unique_samples: 100
    use_synthetic: false
train_subset_num_batches: null
val_dataset:
  mnist:
    datadir: /datasets/mnist
    download: true
    drop_last: false
    is_train: false
    shuffle: false
    synthetic_device: cpu
    synthetic_memory_format: CONTIGUOUS_FORMAT
    synthetic_num_unique_samples: 100
    use_synthetic: false
validate_every_n_batches: -1
validate_every_n_epochs: 1
------------------------------

INFO:composer.trainer.checkpoint:Trainer checkpoint loaded from mosaic_states.pt.

Example when saving a checkpoint:

INFO:composer.cli.launcher:Starting DDP on local node for global_rank(0-0)
/mnt/cota/ajay/composer/composer/cli/launcher.py:105: UserWarning: AutoSelectPortWarning: The DDP port was auto-selected. This may lead to race conditions when launching multiple training processes simultaneously. To eliminate this race condition, explicitely specify a port with --master_port PORT_NUMBER
  warnings.warn("AutoSelectPortWarning: The DDP port was auto-selected. "
INFO:composer.cli.launcher:DDP Store: tcp://127.0.0.1:53247
INFO:composer.cli.launcher:Launching process for local_rank(0), global_rank(0)
UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  ../torch/csrc/utils/tensor_numpy.cpp:180.) (source: /usr/local/lib/python3.8/dist-packages/torchvision/datasets/mnist.py:498)
Config
------------------------------
algorithms: []
callbacks: []
compute_training_metrics: false
datadir: null
dataloader:
  num_workers: 8
  persistent_workers: true
  pin_memory: true
  prefetch_factor: 2
  timeout: 0.0
ddp_sync_strategy: null
deepspeed: null
deterministic_mode: false
device:
  gpu: {}
dist_timeout: 15.0
eval_batch_size: 1000
eval_subset_num_batches: null
grad_accum: 1
grad_clip_norm: null
load_checkpoint: null
log_level: INFO
loggers:
- tqdm: {}
max_duration: 10ep
model:
  mnist_classifier:
    initializers:
    - KAIMING_NORMAL
    - BN_UNIFORM
    num_classes: 10
optimizer:
  sgd:
    dampening: 0.0
    lr: 0.1
    momentum: 0.9
    nesterov: false
    weight_decay: 0.0001
precision: AMP
profiler: null
save_checkpoint:
  folder: checkpoints
  interval: 10
  interval_unit: ep
schedulers:
- cosine_decay:
    T_max: 10ep
    eta_min: 0.0
    interval: epoch
    verbose: false
seed: 17
train_batch_size: 2048
train_dataset:
  mnist:
    datadir: /datasets/mnist
    download: true
    drop_last: true
    is_train: true
    shuffle: true
    synthetic_device: cpu
    synthetic_memory_format: CONTIGUOUS_FORMAT
    synthetic_num_unique_samples: 100
    use_synthetic: false
train_subset_num_batches: null
val_dataset:
  mnist:
    datadir: /datasets/mnist
    download: true
    drop_last: false
    is_train: false
    shuffle: false
    synthetic_device: cpu
    synthetic_memory_format: CONTIGUOUS_FORMAT
    synthetic_num_unique_samples: 100
    use_synthetic: false
validate_every_n_batches: -1
validate_every_n_epochs: 1
------------------------------

Epoch 0: 100%|██████████| 29/29 [00:01<00:00, 21.93it/s, loss/train=0.2796]                                                                                                                                                                     
Epoch 1, Batch 29 (val): 100%|██████████| 10/10 [00:00<00:00, 78.61it/s, accuracy/val=0.9089]                                                                                                                                                   
Epoch 1: 100%|██████████| 29/29 [00:00<00:00, 45.42it/s, loss/train=0.1645]                                                                                                                                                                     
Epoch 2, Batch 58 (val): 100%|██████████| 10/10 [00:00<00:00, 80.64it/s, accuracy/val=0.9538]                                                                                                                                                   
Epoch 2: 100%|██████████| 29/29 [00:00<00:00, 47.89it/s, loss/train=0.0957]                                                                                                                                                                     
Epoch 3, Batch 87 (val): 100%|██████████| 10/10 [00:00<00:00, 81.33it/s, accuracy/val=0.9708]                                                                                                                                                   
Epoch 3: 100%|██████████| 29/29 [00:00<00:00, 46.96it/s, loss/train=0.1058]                                                                                                                                                                     
Epoch 4, Batch 116 (val): 100%|██████████| 10/10 [00:00<00:00, 81.68it/s, accuracy/val=0.9760]                                                                                                                                                  
Epoch 4: 100%|██████████| 29/29 [00:00<00:00, 47.22it/s, loss/train=0.0686]                                                                                                                                                                     
Epoch 5, Batch 145 (val): 100%|██████████| 10/10 [00:00<00:00, 80.40it/s, accuracy/val=0.9803]                                                                                                                                                  
Epoch 5: 100%|██████████| 29/29 [00:00<00:00, 46.34it/s, loss/train=0.0852]                                                                                                                                                                     
Epoch 6, Batch 174 (val): 100%|██████████| 10/10 [00:00<00:00, 81.10it/s, accuracy/val=0.9816]                                                                                                                                                  
Epoch 6: 100%|██████████| 29/29 [00:00<00:00, 46.63it/s, loss/train=0.0660]                                                                                                                                                                     
Epoch 7, Batch 203 (val): 100%|██████████| 10/10 [00:00<00:00, 81.81it/s, accuracy/val=0.9815]                                                                                                                                                  
Epoch 7: 100%|██████████| 29/29 [00:00<00:00, 47.10it/s, loss/train=0.0535]                                                                                                                                                                     
Epoch 8, Batch 232 (val): 100%|██████████| 10/10 [00:00<00:00, 78.34it/s, accuracy/val=0.9823]                                                                                                                                                  
Epoch 8: 100%|██████████| 29/29 [00:00<00:00, 46.71it/s, loss/train=0.0646]                                                                                                                                                                     
Epoch 9, Batch 261 (val): 100%|██████████| 10/10 [00:00<00:00, 81.50it/s, accuracy/val=0.9837]                                                                                                                                                  
Epoch 9: 100%|██████████| 29/29 [00:00<00:00, 46.53it/s, loss/train=0.0490]                                                                                                                                                                     
Epoch 10, Batch 290 (val): 100%|██████████| 10/10 [00:00<00:00, 80.95it/s, accuracy/val=0.9829]                                                                                                                                                 
INFO:composer.trainer.checkpoint:Trainer checkpoint saved to /mnt/cota/ajay/composer/runs/2022-01-26T00:42:24.243183/rank_0/checkpoints/ep10.tar  

@hanlint
Copy link
Contributor

hanlint commented Jan 25, 2022

Could you perhaps link to a w&b run so we can see the logging with the new log.INFO default?

@ajaysaini725
Copy link
Contributor Author

Could you perhaps link to a w&b run so we can see the logging with the new log.INFO default?

I included output in the description. It seems like checkpointing is really broken when saving checkpoints locally (

Could you perhaps link to a w&b run so we can see the logging with the new log.INFO default?

I included the console output of training when both saving and loading a checkpoint above.

I wasn't able to get a full training run working when loading a checkpoint from a local file, but that seems like a bug and is definitely unrelated to these extra logging statements. I filed an issue to track it here and can investigate further later: #276 .

@ajaysaini725 ajaysaini725 merged commit f0f5633 into mosaicml:dev Jan 26, 2022
A-Jacobson pushed a commit that referenced this pull request Feb 10, 2022
More checkpoint logging + doc fixes
coryMosaicML pushed a commit to coryMosaicML/composer that referenced this pull request Feb 23, 2022
More checkpoint logging + doc fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants