program: fall back from data server to multiplexer #4794

wchargin · 2021-03-18T17:28:02Z

Summary:
Under --load_fast=auto, if the data server exits without writing a
port file, we now gracefully degrade to the multiplexer reading path.
This subsumes and expands upon the checks from #4786, with semantics now
defined by the data server as in #4793.

This depends on the --error-file data server flag added in #4793,
which is not yet released. To handle this, we add a mechanism for gating
flags against data server versions. As discussed in #4689, this
raises a concern of what to do when the data server is chosen by a
mechanism other than the Python package. We resolve this by simply
treating such data servers as bleeding-edge. If you link in a data
server yourself, you should use it with a copy of tb-nightly built
from the same Git commit, or your mileage may vary.

See also #4786, which first proposed this functionality and offered some
alternative mechanisms.

Test Plan:
Set your GOOGLE_APPLICATION_CREDENTIALS to an invalid JSON file, and
try running with --load_fast=auto with release v0.4.0 of the data
server and also with latest head (//tensorboard/data/server:install).
Note with --verbosity 0 that with the old server, --error-file is
not passed and the data server has to fall back to anonymous creds,
whereas with the new server, --error-file is passed and TensorBoard
correctly falls back to the multiplexer.

wchargin-branch: data-server-fallback

Summary: Failure cases when obtaining GCS credentials are now handled in a more structured manner. Instead of just logging ad hoc warnings, we return error details to the caller. In particular, this lets the caller distinguish the case where the user has credentials of a format that we don’t support, such as service account private keys. The log messages are nicer, too: in case of a service account key, we recognize and explain the problem instead of just emitting a JSON error. Test Plan: To test each error case, set `GOOGLE_APPLICATION_CREDENTIALS` to: - `/enoent` or `/etc/shadow` to test `Unreadable(...)`; - `/dev/null` to test `Unparseable(...)`; - a service account credentials file to test `Unsupported(...)`. (Or, just a valid JSON file without the needed fields.) Then note that both anonymous access and properly authenticated access still work. wchargin-branch: rust-gcs-error-handling wchargin-source: d133342ff6b388b69f4f2176ef6a74acd32ec695

wchargin-branch: rust-gcs-error-handling wchargin-source: 7d15db9b71c24fb806fae44b4d7697f13a462be0

Summary: RustBoard currently writes normal logs to stderr, and also writes any fatal startup errors to stderr. This makes it impossible to detect and capture startup errors without also suppressing the log output. As of this patch, you can pass `--error-file FILE` to ask RustBoard to write fatal errors to `FILE` instead, so you can leave stderr unencumbered. In this patch, we also improve error handling for unknown protocols (like `--logdir wat://mnist`), which complements `--error-file` by making the startup errors themselves more helpful. This will be used in `tensorboard(1)` to gracefully fall back to the legacy multiplexer in case of unsupported invocations: i.e., option (3) as discussed in #4786. Test Plan: Run the data server with `--logdir wat://`, or point to a GCS logdir while `GOOGLE_APPLICATION_CREDENTIALS` is set to something unreadable. Note that without `--error-file`, these both continue to write an error to stderr, and with `--error-file`, they both write to the given file. wchargin-branch: rust-improve-startup-errors wchargin-source: d438ff46b6d9b637609b0b4f329b55e0be6e4456

Summary: Under `--load_fast=auto`, if the data server exits without writing a port file, we now gracefully degrade to the multiplexer reading path. This subsumes and expands upon the checks from #4786, with semantics now defined by the data server as in #PARENT. This depends on the `--error-file` data server flag added in #PARENT, which is not yet released. To handle this, we add a mechanism for gating flags against data server versions. As [discussed in #4689][1], this raises a concern of what to do when the data server is chosen by a mechanism other than the Python package. We resolve this by simply treating such data servers as bleeding-edge. If you link in a data server yourself, you should use it with a copy of `tb-nightly` built from the same Git commit, or your mileage may vary. See also #4786, which first proposed this functionality and offered some alternative mechanisms. [1]: #4689 (comment) Test Plan: Set your `GOOGLE_APPLICATION_CREDENTIALS` to an invalid JSON file, and try running with `--load_fast=auto` with release v0.4.0 of the data server and also with latest head (`//tensorboard/data/server:install`). Note with `--verbosity 0` that with the old server, `--error-file` is not passed and the data server has to fall back to anonymous creds, whereas with the new server, `--error-file` is passed and TensorBoard correctly falls back to the multiplexer. wchargin-branch: data-server-fallback wchargin-source: ac77fa64fc6b10b6dc80f9962ee291e8084dba80

wchargin-branch: rust-improve-startup-errors wchargin-source: 9f81fe49a0039155702737b014d06099f3457185 # Conflicts: # tensorboard/data/server/cli/dynamic_logdir.rs # tensorboard/data/server/gcs/auth.rs

wchargin-branch: rust-improve-startup-errors wchargin-source: 9f81fe49a0039155702737b014d06099f3457185

wchargin-branch: data-server-fallback wchargin-source: 443e7bdabad0dc2c3d4ab6f47c7ada2e04219acf

wchargin-branch: data-server-fallback wchargin-source: 75a5e10c133d532b959b2b30877dd232cf5fdc03 # Conflicts: # tensorboard/data/server/cli.rs # tensorboard/data/server/cli/dynamic_logdir.rs

wchargin-branch: data-server-fallback wchargin-source: 75a5e10c133d532b959b2b30877dd232cf5fdc03

nfelt · 2021-03-19T15:43:47Z

tensorboard/program.py

+            ingester.start()
+            return ingester
+
+        if flags.load_fast in ("auto", "true"):


Optional, but I find the control flow here to have gotten pretty convoluted. What about just putting server_ingester.get_server_binary(), ingester = server_ingester.SubprocessServerDataIngester(), and ingester.start() all into a single common helper (say start_data_server_and_ingester()) and then splitting out the two cases for "auto" and "true"? It does mean you have a broader try-except block but since the exceptions are typed that doesn't seem like the end of the world, and it would simplify the logic to:

if flags.load_fast == "true": try: return start_data_server_and_ingester() except server_ingester.NoDataServerError as e: sys.stderr.write("Option --load_fast=true not available" ... ) sys.exit(1) except server_ingester.DataServerStartupError as e: sys.stderr.write(_DATA_SERVER_STARTUP_ERROR_MESSAGE_TEMPLATE % e) sys.exit(1) if flags.load_fast == "auto": try: return start_data_server_and_ingester() except (server_ingester.NoDataServerError, server_ingester.DataServerStartupError) as e: logger.info("Data server error: %s; falling back to multiplexer", e) ingester = local_ingester.LocalDataIngester(flags) ingester.start() return ingester

I like it; thanks. Done.

nfelt · 2021-03-19T15:48:40Z

tensorboard/program.py

@@ -441,23 +467,6 @@ def _make_server(self):
        return self.server_class(app, self.flags)


-def _should_use_data_server(load_fast_flag, logdir):


FWIW, it seems like we still want this logic if we want to avoid behavior with a data server <0.5.0 regressing to the way it was before #4786, i.e. where auto will try to use the data server even if the logdir is not compatible. I suppose if we see this PR as just supplanting #4786 entirely (such that if you want graceful fallback you need to be on 0.5.0+ rather than having the logic replicated here) that's fine, but just pointing it out.

Fair enough. I was intending to supplant, but there’s not much harm in
keeping it for a few more days. Done.

wchargin-branch: data-server-fallback wchargin-source: 3b4de7071af56ce0d420dfd5e7a8c73e11b3e2c8

Summary: Cosmetic mistake in #4794: while refactoring around this logic, I accidentally the line to print the `--load_fast=auto` message. Test Plan: Run with `--load_fast auto` and observe the expected `NOTE: ...`. wchargin-branch: program-advisory-message wchargin-source: 6e67dd90881dd6038ffee3dac91cbe331d5bfe62

Summary: Cosmetic mistake in #4794: while refactoring around this logic, I accidentally the line to print the `--load_fast=auto` message. Test Plan: Run with `--load_fast auto` and observe the expected `NOTE: ...`. wchargin-branch: program-advisory-message

wchargin added 4 commits March 17, 2021 17:00

[rust-gcs-error-handling: update diffbase]

29b7555

wchargin-branch: rust-gcs-error-handling wchargin-source: 7d15db9b71c24fb806fae44b4d7697f13a462be0

wchargin added theme:usability Areas to reduce confusion and frustration. core:rustboard //tensorboard/data/server/... labels Mar 18, 2021

google-cla bot added the cla: yes label Mar 18, 2021

wchargin requested a review from nfelt March 18, 2021 17:49

wchargin mentioned this pull request Mar 18, 2021

rust: improve GCS credential error handling #4788

Merged

wchargin added 3 commits March 18, 2021 12:47

[rust-improve-startup-errors: update diffbase]

3884c83

wchargin-branch: rust-improve-startup-errors wchargin-source: 9f81fe49a0039155702737b014d06099f3457185 # Conflicts: # tensorboard/data/server/cli/dynamic_logdir.rs # tensorboard/data/server/gcs/auth.rs

[rust-improve-startup-errors: resolve conflicts]

593d46a

wchargin-branch: rust-improve-startup-errors wchargin-source: 9f81fe49a0039155702737b014d06099f3457185

[data-server-fallback: update diffbase]

e6e0abc

wchargin-branch: data-server-fallback wchargin-source: 443e7bdabad0dc2c3d4ab6f47c7ada2e04219acf

Base automatically changed from wchargin-rust-improve-startup-errors to master March 18, 2021 22:30

wchargin added 2 commits March 18, 2021 17:19

[data-server-fallback: update diffbase]

d444064

wchargin-branch: data-server-fallback wchargin-source: 75a5e10c133d532b959b2b30877dd232cf5fdc03 # Conflicts: # tensorboard/data/server/cli.rs # tensorboard/data/server/cli/dynamic_logdir.rs

[data-server-fallback: resolve conflicts]

65db5cc

wchargin-branch: data-server-fallback wchargin-source: 75a5e10c133d532b959b2b30877dd232cf5fdc03

nfelt approved these changes Mar 19, 2021

View reviewed changes

This was referenced Mar 19, 2021

Explicitly indicate --logdir_spec/--load_fast mismatch #4802

Open

Fast data loading feedback (--load_fast=true; “RustBoard”) #4784

Open

[data-server-fallback: address review comments]

3867367

wchargin-branch: data-server-fallback wchargin-source: 3b4de7071af56ce0d420dfd5e7a8c73e11b3e2c8

nfelt approved these changes Mar 19, 2021

View reviewed changes

wchargin merged commit aebedd6 into master Mar 19, 2021

wchargin deleted the wchargin-data-server-fallback branch March 19, 2021 19:44

wchargin mentioned this pull request Mar 22, 2021

program: fix data server advisory message #4809

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

program: fall back from data server to multiplexer #4794

program: fall back from data server to multiplexer #4794

wchargin commented Mar 18, 2021

nfelt Mar 19, 2021

wchargin Mar 19, 2021

nfelt Mar 19, 2021

wchargin Mar 19, 2021

		@@ -441,23 +467,6 @@ def _make_server(self):
		return self.server_class(app, self.flags)


		def _should_use_data_server(load_fast_flag, logdir):

program: fall back from data server to multiplexer #4794

program: fall back from data server to multiplexer #4794

Conversation

wchargin commented Mar 18, 2021

nfelt Mar 19, 2021

Choose a reason for hiding this comment

wchargin Mar 19, 2021

Choose a reason for hiding this comment

nfelt Mar 19, 2021

Choose a reason for hiding this comment

wchargin Mar 19, 2021

Choose a reason for hiding this comment