Checkpoint logging and doc fixes #270

ajaysaini725 · 2022-01-25T07:16:41Z

Includes more verbose logging for checkpoint saving + loading. Addresses Print checkpointing info #234
Minor fixes to docstrings

Example of checkpoint loading:

INFO:composer.cli.launcher:Starting DDP on local node for global_rank(0-0)
/mnt/cota/ajay/composer/composer/cli/launcher.py:105: UserWarning: AutoSelectPortWarning: The DDP port was auto-selected. This may lead to race conditions when launching multiple training processes simultaneously. To eliminate this race condition, explicitely specify a port with --master_port PORT_NUMBER
  warnings.warn("AutoSelectPortWarning: The DDP port was auto-selected. "
INFO:composer.cli.launcher:DDP Store: tcp://127.0.0.1:59695
INFO:composer.cli.launcher:Launching process for local_rank(0), global_rank(0)
UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  ../torch/csrc/utils/tensor_numpy.cpp:180.) (source: /usr/local/lib/python3.8/dist-packages/torchvision/datasets/mnist.py:498)
Config
------------------------------
algorithms: []
callbacks: []
compute_training_metrics: false
datadir: null
dataloader:
  num_workers: 8
  persistent_workers: true
  pin_memory: true
  prefetch_factor: 2
  timeout: 0.0
ddp_sync_strategy: null
deepspeed: null
deterministic_mode: false
device:
  gpu: {}
dist_timeout: 15.0
eval_batch_size: 1000
eval_subset_num_batches: null
grad_accum: 1
grad_clip_norm: null
load_checkpoint:
  checkpoint: mosaic_states.pt
  chunk_size: 1048576
  load_weights_only: false
  object_store: null
  progress_bar: true
  strict_model_weights: false
log_level: INFO
loggers:
- tqdm: {}
max_duration: 10ep
model:
  mnist_classifier:
    initializers:
    - KAIMING_NORMAL
    - BN_UNIFORM
    num_classes: 10
optimizer:
  sgd:
    dampening: 0.0
    lr: 0.1
    momentum: 0.9
    nesterov: false
    weight_decay: 0.0001
precision: AMP
profiler: null
save_checkpoint: null
schedulers:
- cosine_decay:
    T_max: 10ep
    eta_min: 0.0
    interval: epoch
    verbose: false
seed: 17
train_batch_size: 2048
train_dataset:
  mnist:
    datadir: /datasets/mnist
    download: true
    drop_last: true
    is_train: true
    shuffle: true
    synthetic_device: cpu
    synthetic_memory_format: CONTIGUOUS_FORMAT
    synthetic_num_unique_samples: 100
    use_synthetic: false
train_subset_num_batches: null
val_dataset:
  mnist:
    datadir: /datasets/mnist
    download: true
    drop_last: false
    is_train: false
    shuffle: false
    synthetic_device: cpu
    synthetic_memory_format: CONTIGUOUS_FORMAT
    synthetic_num_unique_samples: 100
    use_synthetic: false
validate_every_n_batches: -1
validate_every_n_epochs: 1
------------------------------

INFO:composer.trainer.checkpoint:Trainer checkpoint loaded from mosaic_states.pt.

Example when saving a checkpoint:

INFO:composer.cli.launcher:Starting DDP on local node for global_rank(0-0)
/mnt/cota/ajay/composer/composer/cli/launcher.py:105: UserWarning: AutoSelectPortWarning: The DDP port was auto-selected. This may lead to race conditions when launching multiple training processes simultaneously. To eliminate this race condition, explicitely specify a port with --master_port PORT_NUMBER
  warnings.warn("AutoSelectPortWarning: The DDP port was auto-selected. "
INFO:composer.cli.launcher:DDP Store: tcp://127.0.0.1:53247
INFO:composer.cli.launcher:Launching process for local_rank(0), global_rank(0)
UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  ../torch/csrc/utils/tensor_numpy.cpp:180.) (source: /usr/local/lib/python3.8/dist-packages/torchvision/datasets/mnist.py:498)
Config
------------------------------
algorithms: []
callbacks: []
compute_training_metrics: false
datadir: null
dataloader:
  num_workers: 8
  persistent_workers: true
  pin_memory: true
  prefetch_factor: 2
  timeout: 0.0
ddp_sync_strategy: null
deepspeed: null
deterministic_mode: false
device:
  gpu: {}
dist_timeout: 15.0
eval_batch_size: 1000
eval_subset_num_batches: null
grad_accum: 1
grad_clip_norm: null
load_checkpoint: null
log_level: INFO
loggers:
- tqdm: {}
max_duration: 10ep
model:
  mnist_classifier:
    initializers:
    - KAIMING_NORMAL
    - BN_UNIFORM
    num_classes: 10
optimizer:
  sgd:
    dampening: 0.0
    lr: 0.1
    momentum: 0.9
    nesterov: false
    weight_decay: 0.0001
precision: AMP
profiler: null
save_checkpoint:
  folder: checkpoints
  interval: 10
  interval_unit: ep
schedulers:
- cosine_decay:
    T_max: 10ep
    eta_min: 0.0
    interval: epoch
    verbose: false
seed: 17
train_batch_size: 2048
train_dataset:
  mnist:
    datadir: /datasets/mnist
    download: true
    drop_last: true
    is_train: true
    shuffle: true
    synthetic_device: cpu
    synthetic_memory_format: CONTIGUOUS_FORMAT
    synthetic_num_unique_samples: 100
    use_synthetic: false
train_subset_num_batches: null
val_dataset:
  mnist:
    datadir: /datasets/mnist
    download: true
    drop_last: false
    is_train: false
    shuffle: false
    synthetic_device: cpu
    synthetic_memory_format: CONTIGUOUS_FORMAT
    synthetic_num_unique_samples: 100
    use_synthetic: false
validate_every_n_batches: -1
validate_every_n_epochs: 1
------------------------------

Epoch 0: 100%|██████████| 29/29 [00:01<00:00, 21.93it/s, loss/train=0.2796]                                                                                                                                                                     
Epoch 1, Batch 29 (val): 100%|██████████| 10/10 [00:00<00:00, 78.61it/s, accuracy/val=0.9089]                                                                                                                                                   
Epoch 1: 100%|██████████| 29/29 [00:00<00:00, 45.42it/s, loss/train=0.1645]                                                                                                                                                                     
Epoch 2, Batch 58 (val): 100%|██████████| 10/10 [00:00<00:00, 80.64it/s, accuracy/val=0.9538]                                                                                                                                                   
Epoch 2: 100%|██████████| 29/29 [00:00<00:00, 47.89it/s, loss/train=0.0957]                                                                                                                                                                     
Epoch 3, Batch 87 (val): 100%|██████████| 10/10 [00:00<00:00, 81.33it/s, accuracy/val=0.9708]                                                                                                                                                   
Epoch 3: 100%|██████████| 29/29 [00:00<00:00, 46.96it/s, loss/train=0.1058]                                                                                                                                                                     
Epoch 4, Batch 116 (val): 100%|██████████| 10/10 [00:00<00:00, 81.68it/s, accuracy/val=0.9760]                                                                                                                                                  
Epoch 4: 100%|██████████| 29/29 [00:00<00:00, 47.22it/s, loss/train=0.0686]                                                                                                                                                                     
Epoch 5, Batch 145 (val): 100%|██████████| 10/10 [00:00<00:00, 80.40it/s, accuracy/val=0.9803]                                                                                                                                                  
Epoch 5: 100%|██████████| 29/29 [00:00<00:00, 46.34it/s, loss/train=0.0852]                                                                                                                                                                     
Epoch 6, Batch 174 (val): 100%|██████████| 10/10 [00:00<00:00, 81.10it/s, accuracy/val=0.9816]                                                                                                                                                  
Epoch 6: 100%|██████████| 29/29 [00:00<00:00, 46.63it/s, loss/train=0.0660]                                                                                                                                                                     
Epoch 7, Batch 203 (val): 100%|██████████| 10/10 [00:00<00:00, 81.81it/s, accuracy/val=0.9815]                                                                                                                                                  
Epoch 7: 100%|██████████| 29/29 [00:00<00:00, 47.10it/s, loss/train=0.0535]                                                                                                                                                                     
Epoch 8, Batch 232 (val): 100%|██████████| 10/10 [00:00<00:00, 78.34it/s, accuracy/val=0.9823]                                                                                                                                                  
Epoch 8: 100%|██████████| 29/29 [00:00<00:00, 46.71it/s, loss/train=0.0646]                                                                                                                                                                     
Epoch 9, Batch 261 (val): 100%|██████████| 10/10 [00:00<00:00, 81.50it/s, accuracy/val=0.9837]                                                                                                                                                  
Epoch 9: 100%|██████████| 29/29 [00:00<00:00, 46.53it/s, loss/train=0.0490]                                                                                                                                                                     
Epoch 10, Batch 290 (val): 100%|██████████| 10/10 [00:00<00:00, 80.95it/s, accuracy/val=0.9829]                                                                                                                                                 
INFO:composer.trainer.checkpoint:Trainer checkpoint saved to /mnt/cota/ajay/composer/runs/2022-01-26T00:42:24.243183/rank_0/checkpoints/ep10.tar

hanlint · 2022-01-25T16:36:10Z

Could you perhaps link to a w&b run so we can see the logging with the new log.INFO default?

ajaysaini725 · 2022-01-26T01:09:23Z

Could you perhaps link to a w&b run so we can see the logging with the new log.INFO default?

I included output in the description. It seems like checkpointing is really broken when saving checkpoints locally (

Could you perhaps link to a w&b run so we can see the logging with the new log.INFO default?

I included the console output of training when both saving and loading a checkpoint above.

I wasn't able to get a full training run working when loading a checkpoint from a local file, but that seems like a bug and is definitely unrelated to these extra logging statements. I filed an issue to track it here and can investigate further later: #276 .

More checkpoint logging + doc fixes

Logging plus doc changes

144f566

ajaysaini725 requested review from jbloxham, growlix and hanlint January 25, 2022 07:16

ajaysaini725 added 2 commits January 25, 2022 07:19

Small fixes

c95a6c9

Removed extra lne

2ade055

jbloxham approved these changes Jan 26, 2022

View reviewed changes

ajaysaini725 merged commit f0f5633 into mosaicml:dev Jan 26, 2022

A-Jacobson pushed a commit that referenced this pull request Feb 10, 2022

Checkpoint logging and doc fixes (#270)

7df64f4

More checkpoint logging + doc fixes

coryMosaicML pushed a commit to coryMosaicML/composer that referenced this pull request Feb 23, 2022

Checkpoint logging and doc fixes (mosaicml#270)

092be86

More checkpoint logging + doc fixes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint logging and doc fixes #270

Checkpoint logging and doc fixes #270

ajaysaini725 commented Jan 25, 2022 •

edited

Loading

hanlint commented Jan 25, 2022

ajaysaini725 commented Jan 26, 2022

Checkpoint logging and doc fixes #270

Checkpoint logging and doc fixes #270

Conversation

ajaysaini725 commented Jan 25, 2022 • edited Loading

hanlint commented Jan 25, 2022

ajaysaini725 commented Jan 26, 2022

ajaysaini725 commented Jan 25, 2022 •

edited

Loading