v0.20.0
What's New
1. New Neptune Logger
Composer now supports logging training data to neptune.ai using the NeptuneLogger
. To get started:
neptune_project = 'test_project'
neptune_api_token = 'test_token'
neptune_logger = NeptuneLogger(
project=neptune_project,
api_token=neptune_api_token,
rank_zero_only=False,
mode='debug',
upload_artifacts=True,
)
We also have an example project demonstrating all the awesome things you can do with this integration!
Additional information on the NeptuneLogger
can be found in the docs.
2. OOM observer callback with memory visualizations
Composer now has an OOM observer callback. When a model runs out of memory, this callback helps produce a trace which identifies memory allocations, which can be critical to designing strategies to mitigate memory usage.
Example:
from composer import Trainer
from composer.callbacks import OOMObserver
# constructing trainer object with this callback
trainer = Trainer(
model=model,
train_dataloader=train_dataloader,
eval_dataloader=eval_dataloader,
optimizers=optimizer,
max_duration="1ep",
callbacks=[
OOMObserver(
folder="traces",
overwrite=true,
filename="rank{rank}_oom",
remote_filename="oci://bucket_name/{run_name}/oom_traces/rank{rank}_oom",
)
],
)
OOM Visualization:
3. Log all gpu rank stdout/err to MosaicML platform
Composer has expanded it's integration with the MosaicML platform.. Now, we can view all gpu rank stdout/stderrs with MCLI logs to enable more comprehensive analysis of jobs.
Example:
mcli logs <run-name> --node x --gpu x
Note, this defaults to node rank 0 if --node
is not provided.
Also, we can find the logs of any global gpu rank with the command:
mcli logs <run-name> --global-gpu-rank x
Bug Fixes
- Only save RNG on rank 0 by @mvpatel2000 in #2998
- [Auto-microbatch fix] FSDP reshard and cleanup after OOM to fix the cuda memory leak by @bigning in #3030
- Fix skip_first for profiler during resumption by @bigning in #2986
- Race condition fix in checkpoint loading util by @jessechancy in #3001
What's Changed
- Remove .ci folder and move FILE_HEADER and CODEOWNERS by @irenedea in #2957
- Modify UCObjectStore.list_objects to lists all files recursively by @irenedea in #2959
- Refactor MemorySnapshot by @cli99 in #2960
- Log all gpu rank stdout/err to MosaicML platform by @jjanezhang in #2839
- Add Torch 2.2 tests by @mvpatel2000 in #2970
- Memory snapshot dump pickle by @cli99 in #2968
- Neptune logger by @AleksanderWWW in #2447
- Fix torch pins in tests by @mvpatel2000 in #2973
- Add a register_model_with_run_id api to MLflowLogger by @dakinggg in #2967
- Remove bespoke codeowners by @mvpatel2000 in #2971
- Add a BEFORE_LOAD event by @snarayan21 in #2974
- More torch 2.2 fixes by @mvpatel2000 in #2975
- Adding the step argument to logger.log_table by @ShashankMosaicML in #2961
- Fix daily tests for torch 2.2 by @mvpatel2000 in #2980
- Format load_path with name by @mvpatel2000 in #2978
- Bump to 0.19.1 by @mvpatel2000 in #2979
- Fix UC object store bugfix by @nancyhung in #2982
- [Bugfix][UC] Add back the full object path by @nancyhung in #2988
- Minor cleanup of UC get_object_size by @dakinggg in #2989
- Pin UC to earlier version by @dakinggg in #2990
- Revert "fix skip_first for resumption" by @bigning in #2991
- Broadcast files for HSDP by @mvpatel2000 in #2914
- Bump ipykernel from 6.29.0 to 6.29.2 by @dependabot in #2994
- Bump yamllint from 1.33.0 to 1.34.0 by @dependabot in #2995
- Refactor
update_metric
by @maxisawesome in #2965 - Add azure integration test by @mvpatel2000 in #2996
- Fix Profiler schedule skip_first by @bigning in #2992
- Remove planner validation by @mvpatel2000 in #2985
- Fix load for non-HSDP device mesh by @mvpatel2000 in #2997
- Update NCCL arg since torch deprecated old one by @mvpatel2000 in #3000
- Add bias argument to LPLN by @mvpatel2000 in #2999
- Revert "Add bias argument to LPLN" by @mvpatel2000 in #3003
- Revert "Update NCCL arg since torch deprecated old one" by @mvpatel2000 in #3004
- Add torch 2.3 image for aws cluster by @j316chuck in #3002
- Patch torch 2.3 aws naming by @j316chuck in #3006
- Add debug log before training loop starts by @mvpatel2000 in #3005
- Deprecate ffcv code by @j316chuck in #3007
- Remove log for mosaicml logger by @mvpatel2000 in #3008
- [EASY] Always log 1st batch when resuming training by @bigning in #3009
- Use reusable actions for linting by @b-chu in #2948
- Make CodeEval respect device_eval_batch_size by @josejg in #2969
- Use Mosaic constant for GPU file prefix by @jjanezhang in #3018
- Fall back to normal logging when gpu prefix is not present by @jjanezhang in #3020
- Revert "Use reusable actions for linting" to fix CI/CD by @mvpatel2000 in #3023
- Change to pull_request_target by @b-chu in #3025
- Bump gitpython from 3.1.41 to 3.1.42 by @dependabot in #3031
- Bump yamllint from 1.34.0 to 1.35.1 by @dependabot in #3034
- Update torchmetrics requirement from <1.3.1,>=0.10.0 to >=0.10.0,<1.3.2 by @dependabot in #3035
- Bump pypandoc from 1.12 to 1.13 by @dependabot in #3033
- Add tensorboard images support by @Menduist in #3021
- Add sorted to logs for checkpoint broadcast by @mvpatel2000 in #3036
- Friendlier device mesh error by @mvpatel2000 in #3039
- Upgrade to python3.11 for torch nightly by @j316chuck in #3038
- Download symlink once by @mvpatel2000 in #3043
- Add min size to OCI download by @mvpatel2000 in #3044
- Lint fix by @mvpatel2000 in #3045
- Revert "Change to pull_request_target " by @mvpatel2000 in #3047
- Bump composer version 0.19.2 by @j316chuck in #3048
- Update XLA support by @bfontain in #2964
- Bump composer version 0.20.0 by @j316chuck in #3051
- Update ruff. Fix PLE & LOG lints by @Skylion007 in #3050
New Contributors
- @AleksanderWWW made their first contribution in #2447
- @ShashankMosaicML made their first contribution in #2961
- @nancyhung made their first contribution in #2982
- @bigning made their first contribution in #2986
- @jessechancy made their first contribution in #3001
- @josejg made their first contribution in #2969
- @Menduist made their first contribution in #3021
- @bfontain made their first contribution in #2964
Full Changelog: v0.19.1...v0.20.0