-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
program: fall back from data server to multiplexer #4794
Conversation
Summary: Failure cases when obtaining GCS credentials are now handled in a more structured manner. Instead of just logging ad hoc warnings, we return error details to the caller. In particular, this lets the caller distinguish the case where the user has credentials of a format that we don’t support, such as service account private keys. The log messages are nicer, too: in case of a service account key, we recognize and explain the problem instead of just emitting a JSON error. Test Plan: To test each error case, set `GOOGLE_APPLICATION_CREDENTIALS` to: - `/enoent` or `/etc/shadow` to test `Unreadable(...)`; - `/dev/null` to test `Unparseable(...)`; - a service account credentials file to test `Unsupported(...)`. (Or, just a valid JSON file without the needed fields.) Then note that both anonymous access and properly authenticated access still work. wchargin-branch: rust-gcs-error-handling wchargin-source: d133342ff6b388b69f4f2176ef6a74acd32ec695
wchargin-branch: rust-gcs-error-handling wchargin-source: 7d15db9b71c24fb806fae44b4d7697f13a462be0
Summary: RustBoard currently writes normal logs to stderr, and also writes any fatal startup errors to stderr. This makes it impossible to detect and capture startup errors without also suppressing the log output. As of this patch, you can pass `--error-file FILE` to ask RustBoard to write fatal errors to `FILE` instead, so you can leave stderr unencumbered. In this patch, we also improve error handling for unknown protocols (like `--logdir wat://mnist`), which complements `--error-file` by making the startup errors themselves more helpful. This will be used in `tensorboard(1)` to gracefully fall back to the legacy multiplexer in case of unsupported invocations: i.e., option (3) as discussed in #4786. Test Plan: Run the data server with `--logdir wat://`, or point to a GCS logdir while `GOOGLE_APPLICATION_CREDENTIALS` is set to something unreadable. Note that without `--error-file`, these both continue to write an error to stderr, and with `--error-file`, they both write to the given file. wchargin-branch: rust-improve-startup-errors wchargin-source: d438ff46b6d9b637609b0b4f329b55e0be6e4456
Summary: Under `--load_fast=auto`, if the data server exits without writing a port file, we now gracefully degrade to the multiplexer reading path. This subsumes and expands upon the checks from #4786, with semantics now defined by the data server as in #PARENT. This depends on the `--error-file` data server flag added in #PARENT, which is not yet released. To handle this, we add a mechanism for gating flags against data server versions. As [discussed in #4689][1], this raises a concern of what to do when the data server is chosen by a mechanism other than the Python package. We resolve this by simply treating such data servers as bleeding-edge. If you link in a data server yourself, you should use it with a copy of `tb-nightly` built from the same Git commit, or your mileage may vary. See also #4786, which first proposed this functionality and offered some alternative mechanisms. [1]: #4689 (comment) Test Plan: Set your `GOOGLE_APPLICATION_CREDENTIALS` to an invalid JSON file, and try running with `--load_fast=auto` with release v0.4.0 of the data server and also with latest head (`//tensorboard/data/server:install`). Note with `--verbosity 0` that with the old server, `--error-file` is not passed and the data server has to fall back to anonymous creds, whereas with the new server, `--error-file` is passed and TensorBoard correctly falls back to the multiplexer. wchargin-branch: data-server-fallback wchargin-source: ac77fa64fc6b10b6dc80f9962ee291e8084dba80
wchargin-branch: rust-improve-startup-errors wchargin-source: 9f81fe49a0039155702737b014d06099f3457185 # Conflicts: # tensorboard/data/server/cli/dynamic_logdir.rs # tensorboard/data/server/gcs/auth.rs
wchargin-branch: rust-improve-startup-errors wchargin-source: 9f81fe49a0039155702737b014d06099f3457185
wchargin-branch: data-server-fallback wchargin-source: 443e7bdabad0dc2c3d4ab6f47c7ada2e04219acf
wchargin-branch: data-server-fallback wchargin-source: 75a5e10c133d532b959b2b30877dd232cf5fdc03 # Conflicts: # tensorboard/data/server/cli.rs # tensorboard/data/server/cli/dynamic_logdir.rs
wchargin-branch: data-server-fallback wchargin-source: 75a5e10c133d532b959b2b30877dd232cf5fdc03
tensorboard/program.py
Outdated
ingester.start() | ||
return ingester | ||
|
||
if flags.load_fast in ("auto", "true"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional, but I find the control flow here to have gotten pretty convoluted. What about just putting server_ingester.get_server_binary()
, ingester = server_ingester.SubprocessServerDataIngester()
, and ingester.start()
all into a single common helper (say start_data_server_and_ingester()
) and then splitting out the two cases for "auto" and "true"? It does mean you have a broader try-except block but since the exceptions are typed that doesn't seem like the end of the world, and it would simplify the logic to:
if flags.load_fast == "true":
try:
return start_data_server_and_ingester()
except server_ingester.NoDataServerError as e:
sys.stderr.write("Option --load_fast=true not available" ... )
sys.exit(1)
except server_ingester.DataServerStartupError as e:
sys.stderr.write(_DATA_SERVER_STARTUP_ERROR_MESSAGE_TEMPLATE % e)
sys.exit(1)
if flags.load_fast == "auto":
try:
return start_data_server_and_ingester()
except (server_ingester.NoDataServerError, server_ingester.DataServerStartupError) as e:
logger.info("Data server error: %s; falling back to multiplexer", e)
ingester = local_ingester.LocalDataIngester(flags)
ingester.start()
return ingester
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it; thanks. Done.
@@ -441,23 +467,6 @@ def _make_server(self): | |||
return self.server_class(app, self.flags) | |||
|
|||
|
|||
def _should_use_data_server(load_fast_flag, logdir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, it seems like we still want this logic if we want to avoid behavior with a data server <0.5.0 regressing to the way it was before #4786, i.e. where auto
will try to use the data server even if the logdir is not compatible. I suppose if we see this PR as just supplanting #4786 entirely (such that if you want graceful fallback you need to be on 0.5.0+ rather than having the logic replicated here) that's fine, but just pointing it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I was intending to supplant, but there’s not much harm in
keeping it for a few more days. Done.
wchargin-branch: data-server-fallback wchargin-source: 3b4de7071af56ce0d420dfd5e7a8c73e11b3e2c8
Summary: Cosmetic mistake in #4794: while refactoring around this logic, I accidentally the line to print the `--load_fast=auto` message. Test Plan: Run with `--load_fast auto` and observe the expected `NOTE: ...`. wchargin-branch: program-advisory-message wchargin-source: 6e67dd90881dd6038ffee3dac91cbe331d5bfe62
Summary: Cosmetic mistake in #4794: while refactoring around this logic, I accidentally the line to print the `--load_fast=auto` message. Test Plan: Run with `--load_fast auto` and observe the expected `NOTE: ...`. wchargin-branch: program-advisory-message
Summary:
Under
--load_fast=auto
, if the data server exits without writing aport file, we now gracefully degrade to the multiplexer reading path.
This subsumes and expands upon the checks from #4786, with semantics now
defined by the data server as in #4793.
This depends on the
--error-file
data server flag added in #4793,which is not yet released. To handle this, we add a mechanism for gating
flags against data server versions. As discussed in #4689, this
raises a concern of what to do when the data server is chosen by a
mechanism other than the Python package. We resolve this by simply
treating such data servers as bleeding-edge. If you link in a data
server yourself, you should use it with a copy of
tb-nightly
builtfrom the same Git commit, or your mileage may vary.
See also #4786, which first proposed this functionality and offered some
alternative mechanisms.
Test Plan:
Set your
GOOGLE_APPLICATION_CREDENTIALS
to an invalid JSON file, andtry running with
--load_fast=auto
with release v0.4.0 of the dataserver and also with latest head (
//tensorboard/data/server:install
).Note with
--verbosity 0
that with the old server,--error-file
isnot passed and the data server has to fall back to anonymous creds,
whereas with the new server,
--error-file
is passed and TensorBoardcorrectly falls back to the multiplexer.
wchargin-branch: data-server-fallback