Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TB Logger + ObjectStore quadratic complexity issue by doing 1 file per flush #1283

Merged
merged 9 commits into from
Jul 15, 2022

Conversation

eracah
Copy link
Contributor

@eracah eracah commented Jul 14, 2022

CO-690

This PR:

  • creates and logs to a new file for every flush by calling writer.close()
  • names the artifact based on the tfevents file to prevent overwriting on S3 or other objectstore
  • Logging to a new file for every flush solves the quadratic complexity issue, where the same file was sent to S3, while it gradually increased in size.

wandb report

@eracah eracah marked this pull request as ready for review July 14, 2022 01:51
@eracah eracah changed the title Fix quadratic issue by doing 1 file per flush Fix TB Logger + ObjectStore quadratic complexity issue by doing 1 file per flush Jul 14, 2022
Copy link
Contributor

@hanlint hanlint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think more context is needed here as to why we need to create a new file for every flush (and also re-initialize the logger). If the issue is network traffic, wouldn't increasing the flush_interval be sufficient? This seems convoluted and counter to how SummaryWriter is supposed to be used?

Copy link
Contributor

@ravi-mosaicml ravi-mosaicml left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this! Overall LGTM, though see comments. Curious if you had a chance to test this manually when saving TB traces to S3 buckets, and then having TB stream from S3? Would be good to validate that this doesn't break anything, even when it works locally. Happy to help test this tomorrow.

@ravi-mosaicml
Copy link
Contributor

This seems convoluted and counter to how SummaryWriter is supposed to be used?

My guess is that the authors of the SummaryWriter assumed that tensorboard users would be training and running tensorboard locally, not saving traces to an object store. In this use case, then appending to one file is simpler and does not have O(n^2) overhead, as linux appends do not require the entire file to be overridden.

@eracah
Copy link
Contributor Author

eracah commented Jul 14, 2022

Thanks for fixing this! Overall LGTM, though see comments. Curious if you had a chance to test this manually when saving TB traces to S3 buckets, and then having TB stream from S3? Would be good to validate that this doesn't break anything, even when it works locally. Happy to help test this tomorrow.

Works with streaming tensorboard logs from s3!

@eracah eracah requested review from hanlint and ravi-mosaicml July 14, 2022 23:31
Copy link
Contributor

@hanlint hanlint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code quality looks good to me, but didn't test this PR myself.

only nit is: Did we have some functionality where the objectstorelogger would upload every new file in a folder, possibly used by the profiler traces? That might obviate the need to call a private _file_name variable.

Copy link
Contributor

@ravi-mosaicml ravi-mosaicml left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; this looks much simpler than before. Thanks! Approving; assuming we tested this locally and when "streaming" tb logs over a bucket.

@ravi-mosaicml ravi-mosaicml merged commit d5e9305 into mosaicml:dev Jul 15, 2022
ravi-mosaicml pushed a commit that referenced this pull request Jul 16, 2022
…e per flush (#1283)

This PR creates and logs to a new file for every flush by calling writer.close() names the artifact based on the tfevents file to prevent overwriting on S3 or other objectstore

Logging to a new file for every flush solves the quadratic complexity issue, where the same file was sent to S3, while it gradually increased in size.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants