Fixed loggers and callbacks #240

ravi-mosaicml · 2022-01-18T21:53:37Z

Removed rank zero callbacks and loggers, since these hid complexity and led to infinitely-blocking code when using distributed functions. Closes Remove RankZeroCallback and RankZeroLogger #239.
Incrementing state.timer before calling .eval() in the trainer. This helps ensure that the batch count is consistent for both batch-wise and epoch-wise evaluators. This batch is printed in the logs.
Fixed the TQDM logger so it works properly with gradient accumulation.
Removed LogLevel.ALGORITHM, LogLevel.MICROBATCH, and LogLevel.VERBOSE since these were rarely being used. Instead, the built-in python logger should probably be used for anything that is verbose (since it really wouldn't be a useful metric), MICROBATCH should use BATCH (since a MICROBATCH is like another gpu), and ALGORITHM should use batch or epoch, depending where it is being run.,
Updated the file logger to take a log_interval instead of log_every_n_epochs and log_every_n_batches, and a flush_interval instead of flush_every_n_batches.
Switched the default logger in all yamls to tqdm.

1. Removed rank zero callbacks and loggers, since these hid complexity and led to infinitely-blocking code when using distributed functions. 2. Incrementing `state.timer` _before_ calling `.eval()` in the trainer. This helps ensure that the batch count is consistent for both batch-wise and epoch-wise evaluators. This batch is printed in the logs. 3. Fixed the TQDM logger so it works properly with gradient accumulation.

jbloxham

All looks good to me! Thanks for nixing the rank zero stuff!

1. Removed rank zero callbacks and loggers, since these hid complexity and led to infinitely-blocking code when using distributed functions. Closes mosaicml#239. 2. Incrementing `state.timer` _before_ calling `.eval()` in the trainer. This helps ensure that the batch count is consistent for both batch-wise and epoch-wise evaluators. This batch is printed in the logs. 3. Fixed the TQDM logger so it works properly with gradient accumulation. 4. Removed `LogLevel.ALGORITHM`, `LogLevel.MICROBATCH`, and `LogLevel.VERBOSE` since these were rarely being used. Instead, the built-in python logger should probably be used for anything that is verbose (since it really wouldn't be a useful metric), MICROBATCH should use BATCH (since a MICROBATCH is like another gpu), and ALGORITHM should use batch or epoch, depending where it is being run., 5. Updated the file logger to take a `log_interval` instead of `log_every_n_epochs` and `log_every_n_batches`, and a `flush_interval` instead of `flush_every_n_batches`. 6. Switched the default logger in all yamls to tqdm.

ravi-mosaicml requested a review from jbloxham January 18, 2022 21:53

jbloxham approved these changes Jan 18, 2022

View reviewed changes

ravi-mosaicml added 5 commits January 18, 2022 23:18

Fixed file log levels; removed microbatch and verbose log levels

74f7617

Renamed flush_every_n_batches to flush_interval

be3389d

Fix tests

2129809

Fix tests for real this time

4eca648

Fix tqdm test on world size 2

30df1dd

ravi-mosaicml merged commit 083aff1 into dev Jan 19, 2022

ravi-mosaicml deleted the ravi/remove_rank_zero branch January 19, 2022 00:16

ravi-mosaicml mentioned this pull request Jan 19, 2022

Update TQDM to support dataloaders of unknown length #230

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed loggers and callbacks #240

Fixed loggers and callbacks #240

ravi-mosaicml commented Jan 18, 2022 •

edited

Loading

jbloxham left a comment

Fixed loggers and callbacks #240

Fixed loggers and callbacks #240

Conversation

ravi-mosaicml commented Jan 18, 2022 • edited Loading

jbloxham left a comment

Choose a reason for hiding this comment

ravi-mosaicml commented Jan 18, 2022 •

edited

Loading