Update dependency lightning to v2.2.0 #239
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
==2.1.4
->==2.2.0
Release Notes
Lightning-AI/lightning (lightning)
v2.2.0
: Lightning 2.2Compare Source
Lightning AI is excited to announce the release of Lightning 2.2 ⚡
Did you know? The Lightning philosophy extends beyond a boilerplate-free deep learning framework: We've been hard at work bringing you Lightning Studio. Code together, prototype, train, deploy, host AI web apps. All from your browser, with zero setup.
While our previous release was packed with many big new features, this time around we're rolling out mainly improvements based on feedback from the community. And of course, as the name implies, this release fully supports the latest PyTorch 2.2 🎉
Highlights
Monitoring Throughput
Lightning now has built-in utilities to measure throughput metrics such as batches/sec, samples/sec and Model FLOP Utilization (MFU) (#18848).
Trainer:
For the Trainer, this comes in form of a
ThroughputMonitor
callback. In order to track samples/sec, you need to provide a function to tell the monitor how to extract the batch dimension from your input. Furthermore, if you want to track MFU, you can provide a sample forward pass and theThroughputMonitor
will automatically estimate the utilization based on the hardware you are running on:The results get automatically sent to the logger if one is configured on the Trainer.
Fabric:
For Fabric, the
ThroughputMonitor
is a simple utility object on which you call.update()
andcompute_and_log()
during the training loop:Check out our TinyLlama LLM pretraining script for a full example using Fabric's
ThroughputMonitor
.The troughput utilities can report:
Improved Handling of Evaluation Mode
When you train a model and have validation enabled, the Trainer automatically calls
.eval()
when transitioning to the validation loop, and.train()
when validation ends. Until now, this had the unfortunate side effect that any submodules in your LightningModule that were in evaluation mode get reset to train mode. In Lightning 2.2, the Trainer now captures the mode of every submodule before switching to validation, and restores the mode the modules were in when validation ends (#18951, #18951, #18951). This improvement will help users avoid silent correctness bugs and removes boilerplate code for managing frozen layers.If you have overridden any of the
LightningModule.on_{validation,test,predict}_model_{eval,train}
hooks, they will still get called and execute your custom logic, but they are no longer required if you added them to preserve the eval mode of frozen modules.Converting FSDP Checkpoints
In the previous release, we introduced distributed checkpointing with FSDP to speed up saving and loading checkpoints for big models. These checkpoints are in a special format saved in a folder with shards from each GPU in a separate file. While these checkpoints can be loaded back with Lightning Trainer or Fabric very easily, they aren't easy to load or process externally. In Lightning 2.2, we introduced a CLI utility that lets you consolidate the checkpoint folder to a single file that can be loaded in raw PyTorch with
torch.load()
for example (#19213).Given you saved a distributed checkpoint, you can then convert it like so:
Read more about distributed checkpointing in our documentation: Trainer, Fabric.
Improvements to Compiling DDP/FSDP in Fabric
PyTorch 2.0+ introduced
torch.compile
, a powerful tool to speed up your models without changing the code.We now added a comprehensive guide how to use
torch.compile
correctly with tips and tricks to help you troubleshoot common issues. On top of that,Fabric.setup()
will now reapplytorch.compile
on top of DDP/FSDP if you are enabling these strategies (#19280).You might see fewer graph breaks, but there won't be any significant speed-ups with this. We introduced this mainly to make Fabric ready for future improvements from PyTorch to optimizing distributed operations.
Saving and Loading DataLoader State
If you use a dataloader/iterable that implements the
.state_dict()
and.load_state_dict()
interface, the Trainer will now automatically save and load their state in the checkpoint (#19361).Note that the standard PyTorch DataLoader does not support this stateful interface. This feature only works on loaders that implement these two methods. A dataloader that supports full fault-tolerance will be included in our upcoming release of Lightning Data - a library to optimize data preprocessing and streaming in the cloud. Stay tuned!
Non-strict Checkpoint Loading in Trainer
A feature that has been requested for a long time by the community is non-strict checkpoint loading. By default, a checkpoint in PyTorch is loaded with
strict=True
to ensure all keys in the saved checkpoint match what's in the model's state dict.However, in some use cases it might make sense to exclude certain weights from being included in the checkpoint. When resuming training, the user would then be required to set
strict=False
, which wasn't configurable until now.You can now set the attribute
strict_loading=False
on your LightningModule if you want to allow loading partial checkpoints (#19404).Full documentation here.
Notable Changes
The 2.0 series of Lightning releases guarantees core API stability: No name changes, argument renaming, hook removals etc. on core interfaces (Trainer, LightningModule, etc.) unless a feature is specifically marked experimental. Here we list a few behavioral changes made in places where the change was justified if it significantly improves the user experience, improves performance, or fixes the correctness of a feature. These changes will likely not impact most users.
ModelCheckpoint's save-last Feature
In Lightning 2.1, we made the
ModelCheckpoint(..., save_last=True)
feature save a symbolic link to the last saved checkpoint instead of rewriting the checkpoint (#18748). This time saver is especially useful for large models who take a while to save. However, many users were confused by the new behavior and wanted it turned off, saving a copy instead of a symbolic link like before. In Lightning 2.2, we are reverting this decision and make the linking opt-in (#19191):Removed Problematic Default Seeding
The
seed_everything(x)
utility function is useful to set the seed for several libraries like PyTorch, NumPy and Python in a single line of code. However, until now you were allowed to omit passing a seeding value, in which case the function picked a seed value randomly. In certain cases, for example when processes are launched externally (e.g., SLURM, torchelastic etc.), this default behavior is dangerous because each process will independently choose a random seed. This can affect sampling, randomized validation splits, and other behaviors that rely on each process having the same seed. In 2.2, we removed this default behavior and default to a seed value 0 (#18846):In the unlikely event that you relied on the previous behavior, you now have to choose the seed randomly yourself:
Miscellaneous Changes
metrics.csv
file produced byCSVLogger
are now sorted alphabetically (#19159)TransformerEnginePrecision(fallback_compute_dtype=)
to control the dtype of operations that don't support fp8 (#19082)TransformerEnginePrecision(dtype=)
argument toweights_dtype
and made it required (#19082)LightningModule.load_from_checkpoint()
function now calls.configure_model()
on the model if it is overridden, to ensure all layers can be loaded from the checkpoint (#19036)CHANGELOG
PyTorch Lightning
Added
lightning.pytorch.callbacks.ThroughputMonitor
to track throughput and log it (#18848).train()
or.eval()
on a submodule-level when switching from validation to training (#18951)TransformerEnginePrecision(fallback_compute_dtype=)
to control the dtype of operations that don't support fp8 (#19082)ModelCheckpoint(save_last='link')
to create a symbolic link for the 'last.ckpt' file (#19191)TQDM_MINITERS
for setting the refresh rate (#19381)strategy='deepspeed_stage_1_offload'
to the strategy registry (#19075)LightningModule.strict_loading = True | False
attribute (#19404)Changed
seed_everything()
without passing in a seed no longer randomly selects a seed, and now defaults to0
(#18846)LightningModule.on_{validation,test,predict}_model_{eval,train}
now only get called if they are overridden by the user (#18951)Trainer.fit()
loop no longer callsLightningModule.train()
at the start; it now preserves the user's configuration of frozen layers (#18951)LightningModule.load_from_checkpoint()
function now calls.configure_model()
on the model if it is overridden, to ensure all layers can be loaded from the checkpoint (#19036)step
parameter when logging metrics withNeptuneLogger
(#19126)TransformerEnginePrecision(dtype=)
argument toweights_dtype
and made it required (#19082)metrics.csv
file produced byCSVLogger
are now sorted alphabetically (#19159)ModelCheckpoint(save_last=True)
instead of creating a symbolic link (#19191)Deprecated
lightning.pytorch.plugins
with the suffixPlugin
in the name (#18840)Removed
Fixed
precision="transformer-engine"
argument would not replace layers by default (#19082)LightningModule.setup
orLightningModule.configure_model
wouldn't get converted when using the Bitsandbytes or TransformerEngine plugins (#19061)FSDPStrategy
to accept adevice_mesh
(#19392)Lightning Fabric
Added
lightning.fabric.utilities.ThroughputMonitor
andlightning.fabric.utilities.Throughput
to track throughput and log it (#18848)lightning.fabric.utilities.AttributeDict
for convenient dict-attribute access to represent state in script (#18943)TransformerEnginePrecision(fallback_compute_dtype=)
to control the dtype of operations that don't support fp8 (#19082)Fabric.setup()
over the FSDP/DDP wrappers (#19280)Changed
seed_everything()
without passing in a seed no longer randomly selects a seed, and now defaults to0
(#18846)TransformerEnginePrecision(dtype=)
argument toweights_dtype
and made it required (#19082)metrics.csv
file produced byCSVLogger
are now sorted alphabetically (#19159)Removed
Fixed
get_available_flops
(#18952)precision="transformer-engine"
argument would not replace layers by default (#19082)FSDPStrategy
to accept adevice_mesh
(#19392)Full commit list: 2.1.0 -> 2.2.0
Contributors
Everyone who contributed between 2.1 and 2.2, in no particular order:
Veteran
@nik777 @Raalsky @wouterzwerink @AleksanderWWW @awaelchli @nohalon @ioangatop @Borda @ethanwharris @BoringDonut @mauvilsa @parambharat @tchaton @ryan597 @adamjstewart @rasbt @carmocca
New
@hiaoxui @VictorPrins @jaswon @AMHermansen @JalinWang @MF-FOOM @unacanal @Jamim @harishb00 @asingh9530 @dipta007 @daturkel @jerrymannil @mjbommar @shenmishajing @paganpasta @lauritsf @andyland @mathematicalmichael
Did you know?
Chuck Norris is a big fan and daily user of PyTorch Lightning.
Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR has been generated by Mend Renovate. View repository job log here.