Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deterministic mode (and use it for tests); simplify checkpointing tests #203

Merged
merged 13 commits into from
Jan 6, 2022

Conversation

ravi-mosaicml
Copy link
Contributor

@ravi-mosaicml ravi-mosaicml commented Jan 6, 2022

  1. Deterministic mode on GPUs requires the CUBLAS_WORKSPACE_CONFIG. Setting this variable when configuring deterministic mode.
  2. Renamed composer.utils.determinism to composer.utils.reproducibility to match the pytorch naming in the docs
  3. Configured tests to use deterministic mode by default.
  4. Configured checkpoints to not compress by default, since compression is slow
  5. Made the gpu and deepspeed markers mutually exclusive, since they require different process initialization

1. Deterministic mode on GPUs requires the CUBLAS_WORKSPACE_CONFIG environment variable to be set _before_ the training script is launched. The launch script can properly set this variable
2. Renamed composer.utils.determinism to composer.utils.reproducability to match the pytorch naming in the docs
3. Switched the testrunner to use deterministic mode
4. Renamed the run directory environment variable to COMPOSER_RUN_DIRECTORY, so it would have a parallel name to COMPOSER_USE_DETERMINISTIC_MODE
Copy link
Contributor

@Averylamp Averylamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should remove deterministic as an arg to Trainer, but other than that looks fine

@ravi-mosaicml ravi-mosaicml changed the title Move deterministic mode to the launch script; use it for tests Fix deterministic mode (and use it for tests); simplify checkpointing tests Jan 6, 2022
Copy link
Contributor

@Averylamp Averylamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm,
Deferring to Jamie on deepspeed tests.
Also am noticing that our root conftest.py is getting large again, when we do have our own fixtures subfolder, so bonus points for moving deepspeed fixtures into its own deepspeed fixtures folder

@ravi-mosaicml ravi-mosaicml merged commit 79d0bec into dev Jan 6, 2022
@ravi-mosaicml ravi-mosaicml deleted the ravi/fix_determinism branch January 6, 2022 23:27
coryMosaicML pushed a commit to coryMosaicML/composer that referenced this pull request Feb 23, 2022
… tests (mosaicml#203)

1. Deterministic mode on GPUs requires the `CUBLAS_WORKSPACE_CONFIG`. Setting this variable when configuring deterministic mode.
2. Renamed composer.utils.determinism to composer.utils.reproducibility to match the pytorch naming in the docs
3. Configured tests to use deterministic mode by default.
4. Configured checkpoints to not compress by default, since compression is slow
5. Made the gpu and deepspeed markers mutually exclusive, since they require different process initialization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants