Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

program: fall back from data server to multiplexer #4794

Merged
merged 10 commits into from
Mar 19, 2021

Conversation

wchargin
Copy link
Contributor

Summary:
Under --load_fast=auto, if the data server exits without writing a
port file, we now gracefully degrade to the multiplexer reading path.
This subsumes and expands upon the checks from #4786, with semantics now
defined by the data server as in #4793.

This depends on the --error-file data server flag added in #4793,
which is not yet released. To handle this, we add a mechanism for gating
flags against data server versions. As discussed in #4689, this
raises a concern of what to do when the data server is chosen by a
mechanism other than the Python package. We resolve this by simply
treating such data servers as bleeding-edge. If you link in a data
server yourself, you should use it with a copy of tb-nightly built
from the same Git commit, or your mileage may vary.

See also #4786, which first proposed this functionality and offered some
alternative mechanisms.

Test Plan:
Set your GOOGLE_APPLICATION_CREDENTIALS to an invalid JSON file, and
try running with --load_fast=auto with release v0.4.0 of the data
server and also with latest head (//tensorboard/data/server:install).
Note with --verbosity 0 that with the old server, --error-file is
not passed and the data server has to fall back to anonymous creds,
whereas with the new server, --error-file is passed and TensorBoard
correctly falls back to the multiplexer.

wchargin-branch: data-server-fallback

Summary:
Failure cases when obtaining GCS credentials are now handled in a more
structured manner. Instead of just logging ad hoc warnings, we return
error details to the caller. In particular, this lets the caller
distinguish the case where the user has credentials of a format that we
don’t support, such as service account private keys.

The log messages are nicer, too: in case of a service account key, we
recognize and explain the problem instead of just emitting a JSON error.

Test Plan:
To test each error case, set `GOOGLE_APPLICATION_CREDENTIALS` to:

  - `/enoent` or `/etc/shadow` to test `Unreadable(...)`;
  - `/dev/null` to test `Unparseable(...)`;
  - a service account credentials file to test `Unsupported(...)`.
    (Or, just a valid JSON file without the needed fields.)

Then note that both anonymous access and properly authenticated access
still work.

wchargin-branch: rust-gcs-error-handling
wchargin-source: d133342ff6b388b69f4f2176ef6a74acd32ec695
wchargin-branch: rust-gcs-error-handling
wchargin-source: 7d15db9b71c24fb806fae44b4d7697f13a462be0
Summary:
RustBoard currently writes normal logs to stderr, and also writes any
fatal startup errors to stderr. This makes it impossible to detect and
capture startup errors without also suppressing the log output. As of
this patch, you can pass `--error-file FILE` to ask RustBoard to write
fatal errors to `FILE` instead, so you can leave stderr unencumbered.

In this patch, we also improve error handling for unknown protocols
(like `--logdir wat://mnist`), which complements `--error-file` by
making the startup errors themselves more helpful.

This will be used in `tensorboard(1)` to gracefully fall back to the
legacy multiplexer in case of unsupported invocations: i.e., option (3)
as discussed in #4786.

Test Plan:
Run the data server with `--logdir wat://`, or point to a GCS logdir
while `GOOGLE_APPLICATION_CREDENTIALS` is set to something unreadable.
Note that without `--error-file`, these both continue to write an error
to stderr, and with `--error-file`, they both write to the given file.

wchargin-branch: rust-improve-startup-errors
wchargin-source: d438ff46b6d9b637609b0b4f329b55e0be6e4456
Summary:
Under `--load_fast=auto`, if the data server exits without writing a
port file, we now gracefully degrade to the multiplexer reading path.
This subsumes and expands upon the checks from #4786, with semantics now
defined by the data server as in #PARENT.

This depends on the `--error-file` data server flag added in #PARENT,
which is not yet released. To handle this, we add a mechanism for gating
flags against data server versions. As [discussed in #4689][1], this
raises a concern of what to do when the data server is chosen by a
mechanism other than the Python package. We resolve this by simply
treating such data servers as bleeding-edge. If you link in a data
server yourself, you should use it with a copy of `tb-nightly` built
from the same Git commit, or your mileage may vary.

See also #4786, which first proposed this functionality and offered some
alternative mechanisms.

[1]: #4689 (comment)

Test Plan:
Set your `GOOGLE_APPLICATION_CREDENTIALS` to an invalid JSON file, and
try running with `--load_fast=auto` with release v0.4.0 of the data
server and also with latest head (`//tensorboard/data/server:install`).
Note with `--verbosity 0` that with the old server, `--error-file` is
not passed and the data server has to fall back to anonymous creds,
whereas with the new server, `--error-file` is passed and TensorBoard
correctly falls back to the multiplexer.

wchargin-branch: data-server-fallback
wchargin-source: ac77fa64fc6b10b6dc80f9962ee291e8084dba80
@wchargin wchargin added theme:usability Areas to reduce confusion and frustration. core:rustboard //tensorboard/data/server/... labels Mar 18, 2021
@google-cla google-cla bot added the cla: yes label Mar 18, 2021
@wchargin wchargin requested a review from nfelt March 18, 2021 17:49
wchargin-branch: rust-improve-startup-errors
wchargin-source: 9f81fe49a0039155702737b014d06099f3457185

# Conflicts:
#	tensorboard/data/server/cli/dynamic_logdir.rs
#	tensorboard/data/server/gcs/auth.rs
wchargin-branch: rust-improve-startup-errors
wchargin-source: 9f81fe49a0039155702737b014d06099f3457185
wchargin-branch: data-server-fallback
wchargin-source: 443e7bdabad0dc2c3d4ab6f47c7ada2e04219acf
Base automatically changed from wchargin-rust-improve-startup-errors to master March 18, 2021 22:30
wchargin-branch: data-server-fallback
wchargin-source: 75a5e10c133d532b959b2b30877dd232cf5fdc03

# Conflicts:
#	tensorboard/data/server/cli.rs
#	tensorboard/data/server/cli/dynamic_logdir.rs
wchargin-branch: data-server-fallback
wchargin-source: 75a5e10c133d532b959b2b30877dd232cf5fdc03
ingester.start()
return ingester

if flags.load_fast in ("auto", "true"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional, but I find the control flow here to have gotten pretty convoluted. What about just putting server_ingester.get_server_binary(), ingester = server_ingester.SubprocessServerDataIngester(), and ingester.start() all into a single common helper (say start_data_server_and_ingester()) and then splitting out the two cases for "auto" and "true"? It does mean you have a broader try-except block but since the exceptions are typed that doesn't seem like the end of the world, and it would simplify the logic to:

if flags.load_fast == "true":
    try:
        return start_data_server_and_ingester()
    except server_ingester.NoDataServerError as e:
        sys.stderr.write("Option --load_fast=true not available" ... )
        sys.exit(1)
    except server_ingester.DataServerStartupError as e:
        sys.stderr.write(_DATA_SERVER_STARTUP_ERROR_MESSAGE_TEMPLATE % e)
        sys.exit(1)

if flags.load_fast == "auto":
    try:
        return start_data_server_and_ingester()
    except (server_ingester.NoDataServerError, server_ingester.DataServerStartupError) as e:
        logger.info("Data server error: %s; falling back to multiplexer", e)

ingester = local_ingester.LocalDataIngester(flags)
ingester.start()
return ingester

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it; thanks. Done.

@@ -441,23 +467,6 @@ def _make_server(self):
return self.server_class(app, self.flags)


def _should_use_data_server(load_fast_flag, logdir):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, it seems like we still want this logic if we want to avoid behavior with a data server <0.5.0 regressing to the way it was before #4786, i.e. where auto will try to use the data server even if the logdir is not compatible. I suppose if we see this PR as just supplanting #4786 entirely (such that if you want graceful fallback you need to be on 0.5.0+ rather than having the logic replicated here) that's fine, but just pointing it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I was intending to supplant, but there’s not much harm in
keeping it for a few more days. Done.

wchargin-branch: data-server-fallback
wchargin-source: 3b4de7071af56ce0d420dfd5e7a8c73e11b3e2c8
@wchargin wchargin merged commit aebedd6 into master Mar 19, 2021
@wchargin wchargin deleted the wchargin-data-server-fallback branch March 19, 2021 19:44
wchargin added a commit that referenced this pull request Mar 22, 2021
Summary:
Cosmetic mistake in #4794: while refactoring around this logic,
I accidentally the line to print the `--load_fast=auto` message.

Test Plan:
Run with `--load_fast auto` and observe the expected `NOTE: ...`.

wchargin-branch: program-advisory-message
wchargin-source: 6e67dd90881dd6038ffee3dac91cbe331d5bfe62
wchargin added a commit that referenced this pull request Mar 23, 2021
Summary:
Cosmetic mistake in #4794: while refactoring around this logic,
I accidentally the line to print the `--load_fast=auto` message.

Test Plan:
Run with `--load_fast auto` and observe the expected `NOTE: ...`.

wchargin-branch: program-advisory-message
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes core:rustboard //tensorboard/data/server/... theme:usability Areas to reduce confusion and frustration.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants