Fix TB Logger + ObjectStore quadratic complexity issue by doing 1 file per flush #1283

eracah · 2022-07-14T01:51:09Z

This PR:

creates and logs to a new file for every flush by calling writer.close()
names the artifact based on the tfevents file to prevent overwriting on S3 or other objectstore

Logging to a new file for every flush solves the quadratic complexity issue, where the same file was sent to S3, while it gradually increased in size.

hanlint

I think more context is needed here as to why we need to create a new file for every flush (and also re-initialize the logger). If the issue is network traffic, wouldn't increasing the flush_interval be sufficient? This seems convoluted and counter to how SummaryWriter is supposed to be used?

composer/loggers/tensorboard_logger.py

ravi-mosaicml

Thanks for fixing this! Overall LGTM, though see comments. Curious if you had a chance to test this manually when saving TB traces to S3 buckets, and then having TB stream from S3? Would be good to validate that this doesn't break anything, even when it works locally. Happy to help test this tomorrow.

composer/loggers/tensorboard_logger.py

ravi-mosaicml · 2022-07-14T06:08:45Z

This seems convoluted and counter to how SummaryWriter is supposed to be used?

My guess is that the authors of the SummaryWriter assumed that tensorboard users would be training and running tensorboard locally, not saving traces to an object store. In this use case, then appending to one file is simpler and does not have O(n^2) overhead, as linux appends do not require the entire file to be overridden.

SummaryWriter. Turns out we can just use writer.close() to get unique files.

eracah · 2022-07-14T23:31:16Z

Thanks for fixing this! Overall LGTM, though see comments. Curious if you had a chance to test this manually when saving TB traces to S3 buckets, and then having TB stream from S3? Would be good to validate that this doesn't break anything, even when it works locally. Happy to help test this tomorrow.

Works with streaming tensorboard logs from s3!

hanlint

code quality looks good to me, but didn't test this PR myself.

only nit is: Did we have some functionality where the objectstorelogger would upload every new file in a folder, possibly used by the profiler traces? That might obviate the need to call a private _file_name variable.

ravi-mosaicml

LGTM; this looks much simpler than before. Thanks! Approving; assuming we tested this locally and when "streaming" tb logs over a bucket.

…e per flush (#1283) This PR creates and logs to a new file for every flush by calling writer.close() names the artifact based on the tfevents file to prevent overwriting on S3 or other objectstore Logging to a new file for every flush solves the quadratic complexity issue, where the same file was sent to S3, while it gradually increased in size.

Fix quadratic issue by doing 1 file per flush

cbfffd3

eracah marked this pull request as ready for review July 14, 2022 01:51

eracah mentioned this pull request Jul 14, 2022

Fix Tensorboard with ObjectStore quadratic complexity issue and switch from flush_interval -> flush_secs #1282

Closed

4 tasks

eracah requested review from ravi-mosaicml and abhi-mosaic July 14, 2022 01:52

eracah changed the title ~~Fix quadratic issue by doing 1 file per flush~~ Fix TB Logger + ObjectStore quadratic complexity issue by doing 1 file per flush Jul 14, 2022

hanlint reviewed Jul 14, 2022

View reviewed changes

ravi-mosaicml reviewed Jul 14, 2022

View reviewed changes

eracah added 5 commits July 14, 2022 13:04

fix

9ecbd0c

Remove unique file naming and re-initializing

34eeaf6

SummaryWriter. Turns out we can just use writer.close() to get unique files.

Use event file name as artifact name

4d09f4e

Add comment about diabling SW's flushing

81c94e6

Merge branch 'dev' of https://github.com/mosaicml/composer into new-fix

09917d5

eracah requested review from hanlint and ravi-mosaicml July 14, 2022 23:31

hanlint approved these changes Jul 14, 2022

View reviewed changes

eracah added 3 commits July 14, 2022 17:08

pre-commit fixes

300693d

pyright fixes

a6d75da

trigger re-test

4b627ce

ravi-mosaicml approved these changes Jul 15, 2022

View reviewed changes

ravi-mosaicml merged commit d5e9305 into mosaicml:dev Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TB Logger + ObjectStore quadratic complexity issue by doing 1 file per flush #1283

Fix TB Logger + ObjectStore quadratic complexity issue by doing 1 file per flush #1283

eracah commented Jul 14, 2022 •

edited by ravi-mosaicml

Loading

hanlint left a comment •

edited

Loading

ravi-mosaicml left a comment

ravi-mosaicml commented Jul 14, 2022

eracah commented Jul 14, 2022

hanlint left a comment

ravi-mosaicml left a comment

Fix TB Logger + ObjectStore quadratic complexity issue by doing 1 file per flush #1283

Fix TB Logger + ObjectStore quadratic complexity issue by doing 1 file per flush #1283

Conversation

eracah commented Jul 14, 2022 • edited by ravi-mosaicml Loading

hanlint left a comment • edited Loading

Choose a reason for hiding this comment

ravi-mosaicml left a comment

Choose a reason for hiding this comment

ravi-mosaicml commented Jul 14, 2022

eracah commented Jul 14, 2022

hanlint left a comment

Choose a reason for hiding this comment

ravi-mosaicml left a comment

Choose a reason for hiding this comment

eracah commented Jul 14, 2022 •

edited by ravi-mosaicml

Loading

hanlint left a comment •

edited

Loading