-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TB Logger + ObjectStore quadratic complexity issue by doing 1 file per flush #1283
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think more context is needed here as to why we need to create a new file for every flush (and also re-initialize the logger). If the issue is network traffic, wouldn't increasing the flush_interval
be sufficient? This seems convoluted and counter to how SummaryWriter
is supposed to be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this! Overall LGTM, though see comments. Curious if you had a chance to test this manually when saving TB traces to S3 buckets, and then having TB stream from S3? Would be good to validate that this doesn't break anything, even when it works locally. Happy to help test this tomorrow.
My guess is that the authors of the SummaryWriter assumed that tensorboard users would be training and running tensorboard locally, not saving traces to an object store. In this use case, then appending to one file is simpler and does not have O(n^2) overhead, as linux appends do not require the entire file to be overridden. |
SummaryWriter. Turns out we can just use writer.close() to get unique files.
Works with streaming tensorboard logs from s3! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code quality looks good to me, but didn't test this PR myself.
only nit is: Did we have some functionality where the objectstorelogger would upload every new file in a folder, possibly used by the profiler traces? That might obviate the need to call a private _file_name
variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; this looks much simpler than before. Thanks! Approving; assuming we tested this locally and when "streaming" tb logs over a bucket.
…e per flush (#1283) This PR creates and logs to a new file for every flush by calling writer.close() names the artifact based on the tfevents file to prevent overwriting on S3 or other objectstore Logging to a new file for every flush solves the quadratic complexity issue, where the same file was sent to S3, while it gradually increased in size.
CO-690
This PR:
writer.close()
wandb report