Release v0.20.0 · mosaicml/composer

What's New

1. New Neptune Logger

Composer now supports logging training data to neptune.ai using the NeptuneLogger. To get started:

neptune_project = 'test_project'
neptune_api_token = 'test_token'

neptune_logger = NeptuneLogger(
    project=neptune_project,
    api_token=neptune_api_token,
    rank_zero_only=False,
    mode='debug',
    upload_artifacts=True,
)

We also have an example project demonstrating all the awesome things you can do with this integration!

Additional information on the NeptuneLogger can be found in the docs.

2. OOM observer callback with memory visualizations

Composer now has an OOM observer callback. When a model runs out of memory, this callback helps produce a trace which identifies memory allocations, which can be critical to designing strategies to mitigate memory usage.

Example:

from composer import Trainer
from composer.callbacks import OOMObserver
# constructing trainer object with this callback
trainer = Trainer(
    model=model,
    train_dataloader=train_dataloader,
    eval_dataloader=eval_dataloader,
    optimizers=optimizer,
    max_duration="1ep",
    callbacks=[
        OOMObserver(
            folder="traces",
            overwrite=true,
            filename="rank{rank}_oom",
            remote_filename="oci://bucket_name/{run_name}/oom_traces/rank{rank}_oom",
        )
    ],
)

OOM Visualization:

3. Log all gpu rank stdout/err to MosaicML platform

Composer has expanded it's integration with the MosaicML platform.. Now, we can view all gpu rank stdout/stderrs with MCLI logs to enable more comprehensive analysis of jobs.

Example:

mcli logs <run-name> --node x --gpu x

Note, this defaults to node rank 0 if --node is not provided.

Also, we can find the logs of any global gpu rank with the command:

mcli logs <run-name> --global-gpu-rank x

Bug Fixes

Only save RNG on rank 0 by @mvpatel2000 in #2998
[Auto-microbatch fix] FSDP reshard and cleanup after OOM to fix the cuda memory leak by @bigning in #3030
Fix skip_first for profiler during resumption by @bigning in #2986
Race condition fix in checkpoint loading util by @jessechancy in #3001

What's Changed

Remove .ci folder and move FILE_HEADER and CODEOWNERS by @irenedea in #2957
Modify UCObjectStore.list_objects to lists all files recursively by @irenedea in #2959
Refactor MemorySnapshot by @cli99 in #2960
Log all gpu rank stdout/err to MosaicML platform by @jjanezhang in #2839
Add Torch 2.2 tests by @mvpatel2000 in #2970
Memory snapshot dump pickle by @cli99 in #2968
Neptune logger by @AleksanderWWW in #2447
Fix torch pins in tests by @mvpatel2000 in #2973
Add a register_model_with_run_id api to MLflowLogger by @dakinggg in #2967
Remove bespoke codeowners by @mvpatel2000 in #2971
Add a BEFORE_LOAD event by @snarayan21 in #2974
More torch 2.2 fixes by @mvpatel2000 in #2975
Adding the step argument to logger.log_table by @ShashankMosaicML in #2961
Fix daily tests for torch 2.2 by @mvpatel2000 in #2980
Format load_path with name by @mvpatel2000 in #2978
Bump to 0.19.1 by @mvpatel2000 in #2979
Fix UC object store bugfix by @nancyhung in #2982
[Bugfix][UC] Add back the full object path by @nancyhung in #2988
Minor cleanup of UC get_object_size by @dakinggg in #2989
Pin UC to earlier version by @dakinggg in #2990
Revert "fix skip_first for resumption" by @bigning in #2991
Broadcast files for HSDP by @mvpatel2000 in #2914
Bump ipykernel from 6.29.0 to 6.29.2 by @dependabot in #2994
Bump yamllint from 1.33.0 to 1.34.0 by @dependabot in #2995
Refactor update_metric by @maxisawesome in #2965
Add azure integration test by @mvpatel2000 in #2996
Fix Profiler schedule skip_first by @bigning in #2992
Remove planner validation by @mvpatel2000 in #2985
Fix load for non-HSDP device mesh by @mvpatel2000 in #2997
Update NCCL arg since torch deprecated old one by @mvpatel2000 in #3000
Add bias argument to LPLN by @mvpatel2000 in #2999
Revert "Add bias argument to LPLN" by @mvpatel2000 in #3003
Revert "Update NCCL arg since torch deprecated old one" by @mvpatel2000 in #3004
Add torch 2.3 image for aws cluster by @j316chuck in #3002
Patch torch 2.3 aws naming by @j316chuck in #3006
Add debug log before training loop starts by @mvpatel2000 in #3005
Deprecate ffcv code by @j316chuck in #3007
Remove log for mosaicml logger by @mvpatel2000 in #3008
[EASY] Always log 1st batch when resuming training by @bigning in #3009
Use reusable actions for linting by @b-chu in #2948
Make CodeEval respect device_eval_batch_size by @josejg in #2969
Use Mosaic constant for GPU file prefix by @jjanezhang in #3018
Fall back to normal logging when gpu prefix is not present by @jjanezhang in #3020
Revert "Use reusable actions for linting" to fix CI/CD by @mvpatel2000 in #3023
Change to pull_request_target by @b-chu in #3025
Bump gitpython from 3.1.41 to 3.1.42 by @dependabot in #3031
Bump yamllint from 1.34.0 to 1.35.1 by @dependabot in #3034
Update torchmetrics requirement from <1.3.1,>=0.10.0 to >=0.10.0,<1.3.2 by @dependabot in #3035
Bump pypandoc from 1.12 to 1.13 by @dependabot in #3033
Add tensorboard images support by @Menduist in #3021
Add sorted to logs for checkpoint broadcast by @mvpatel2000 in #3036
Friendlier device mesh error by @mvpatel2000 in #3039
Upgrade to python3.11 for torch nightly by @j316chuck in #3038
Download symlink once by @mvpatel2000 in #3043
Add min size to OCI download by @mvpatel2000 in #3044
Lint fix by @mvpatel2000 in #3045
Revert "Change to pull_request_target " by @mvpatel2000 in #3047
Bump composer version 0.19.2 by @j316chuck in #3048
Update XLA support by @bfontain in #2964
Bump composer version 0.20.0 by @j316chuck in #3051
Update ruff. Fix PLE & LOG lints by @Skylion007 in #3050

New Contributors

@AleksanderWWW made their first contribution in #2447
@ShashankMosaicML made their first contribution in #2961
@nancyhung made their first contribution in #2982
@bigning made their first contribution in #2986
@jessechancy made their first contribution in #3001
@josejg made their first contribution in #2969
@Menduist made their first contribution in #3021
@bfontain made their first contribution in #2964

Full Changelog: v0.19.1...v0.20.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.20.0