Nextstrain.org remote #141

tsibley · 2022-01-25T01:17:48Z

Builds upon #140 to add nextstrain.org support to the nextstrain remote family of commands.

Kicking the tires

Works well in my extensive local testing against nextstrain/nextstrain.org#452. To do so yourself, run the nextstrain.org server from that PR:

# in checkout of nextstrain/nextstrain.org repo
git switch restful-api-rw
npm run build
node server.js --verbose

and then in another shell, before running nextstrain remote commands from this PR, make sure to:

export NEXTSTRAIN_DOT_ORG=http://localhost:5000

As a confirmation, you should see localhost:5000 in the output of nextstrain remote ls nextstrain.org, e.g.:

$ nextstrain remote ls nextstrain.org
http://localhost:5000/dengue/denv1
http://localhost:5000/dengue/denv2
http://localhost:5000/dengue/denv3
http://localhost:5000/dengue/denv4
http://localhost:5000/dengue/all
http://localhost:5000/ebola
…

victorlin · 2022-01-26T21:59:02Z

Diff between this and #140 for easier review: 6d33886...trs/remote/nextstrain.org

These functions are intended to be used by other commands internally and stderr avoids polluting those commands' stdout.

Renewal happens automatically when needed, but it can be useful to renew early to, for example, refresh group membership information.

Missed doing this as part of the behaviour change from post-action return to pre-action yield in "remote.s3: Rework delete() to have a similar API to the other commands" (cc86a1a).

Using Path is a bit of a fib and means unnecessarily sprinkling Path() calls around. A future improvement might be defining a new RemotePath type for this use, if it'd help with interp between RemoteModule implementations and the remote commands. For just interpretability, it didn't seem worth it.

This helps with some kinds of messages where the interpolation expands to multiple-lines that would otherwise interfere with dedent.

I ran into a couple of new asserts during testing and realized handling them centrally would be ~easy and a better UX. If we're doing a good job, this is unlikely to be seen by users, but no one's perfect.

…not a dir

To be used with API requests to nextstrain.org. Exposed as an Authorization header value instead of the bare id token on the principle of isolating and centralizing code which directly handles tokens.

Uses the new RESTful API for download (GET), upload (PUT), delete (DELETE) operations and the Charon API ("getAvailable") for listing.

Moves information on our two supported remotes to separate doc pages since we can't reasonably stuff everything into --help. As a future improvement, I'd love to see a way to read this documentation at the CLI without having to open a web page or have internet access, e.g. a new command like `nextstrain help remotes/s3` or extending the --help-all pattern to `nextstrain remote` to include description and not just options.

nextstrain/cli/remote/nextstrain_dot_org.py

victorlin

Looks good to me with some basic testing! Learned a few things while looking over this too 🙂

Let me know if there's anything specific you'd like me to test locally.

tsibley · 2022-01-29T00:59:36Z

@victorlin Thanks for giving it a whirl! Aside from "does it work?" testing, I think the next thing to focus on is how the command behaviour/UX feels. Does it feel right? Is there anything that feels off or that you reached for but didn't work/exist?

Slightly tangentially, one missing thing I'd like to add support for (and which I think @jameshadfield also wants) is specifying a constant filename prefix to local files when doing download --recursively. Current that mode only supports specifying a local directory, but I think I know how to extend that reasonably.

tsibley · 2022-02-01T22:05:08Z

My plan is to wait to merge this until nextstrain/nextstrain.org#452 gets deployed, as otherwise it won't work as advertised.

joverlee521

Very cool to see this work so well! I mostly left comments about the general "feel" of the commands, so nothing blocking.

A couple general comments I have:

It would be nice to have a --dry-run option for these commands.
Should user credentials be renewed before running each command?
Currently, if I remove myself from a Cognito group, I can still run commands for the group as long as I don't renew my local tokens.

joverlee521 · 2022-02-10T00:27:56Z

nextstrain/cli/remote/nextstrain_dot_org.py

+        default_sidecars = ["root-sequence", "tip-frequencies"]
+
+        self.subresources = [
+            SubResource("application/vnd.nextstrain.dataset.main+json", ".json", primary = True),


It makes me a little nervous to see all of the hardcoded media types throughout this file...can we house these in a dictionary or enum class?

Maybe? My thinking here was that media types are the kind of constants where the value is the name and vice versa, like, say, HTTP methods. That is, they're different from constants where the value is not synonymous with the name.

For this type of self-describing constant, I think they're more clearly used directly rather than indirectly behind a translation layer (dict/enum) that gives them some other name. My thinking goes, why have multiple names, varying across code bases, for a thing when you can have one that's consistent across codebases?

One benefit of indirection with an enum (but not dict) is that it would let the type checker tell us if we made a typo, but that feels like a marginal gain not worth the indirection to me. (The code will also likely fail to work with a typo in the media type, so would be noticeable.)

For this type of self-describing constant, I think they're more clearly used directly rather than indirectly behind a translation layer (dict/enum) that gives them some other name

That's fair. I just personally have a hard time remembering long strings 😆 I can always refer to the Media Types docs.

joverlee521 · 2022-02-10T00:35:44Z

nextstrain/cli/remote/nextstrain_dot_org.py

+        # conventions, we might want to identify files using partial content
+        # inspection (e.g. does the file contain the expected top-level keys
+        # for a given sidecar type?) or full schema validation (e.g. try to
+        # validate all schemas and see which passes).


Full schema validation here would be really cool! It would be nice to give users feedback that their dataset will not be visualized properly because their JSON doesn't match our schema.

Agreed! Part of why I gave our existing schemas proper, resolvable ids (e.g. https://nextstrain.org/schemas/dataset/v2) is so we could leverage those here eventually. :-)

(I'd also like to do server-side validation.)

nextstrain/cli/remote/nextstrain_dot_org.py

nextstrain/cli/remote/__init__.py

joverlee521 · 2022-02-10T22:12:11Z

nextstrain/cli/remote/nextstrain_dot_org.py

+    if unknowns:
+        # XXX TODO: Handle group-logo.png and group-overview.md by making
+        # appropriate API requests to the group endpoint (which doesn't yet
+        # exist).
+        #   -trs, 23 Sept 2021
+        raise UserError(f"""
+            Only datasets (v2) and narratives are currently supported for
+            upload to nextstrain.org, but other files were given:
+
+            {{files}}
+            """, files = file_list(unknowns))


I wonder if it would be more user friendly to upload all valid files but print out a warning that the unknowns were excluded from the upload instead of raising an error.

I'm thinking of the use case where I want to upload everything within a data/ directory, but I don't want to figure out the glob pattern to exclude the bad files since I'm not as command line savvy.

Nod. That's fair. This might be too cautious. My thinking was the presence of unexpected files indicates the issue might be larger than just a few skip-able files, e.g. imagining a data/* glob picking up a README.md that we think is narrative because we don't inspect contents or picking up an auspice_config.json that we think is a dataset (again because we don't inspect contents). So, I thought, better to abort and ask the user to be more specific, than proceed and potentially leave the user with a mess to clean up? 🤷‍♂️

Uploading unexpected files technically wouldn't "break" anything, which is why I thought this is overly cautious.

I do agree it's not nice to leave a mess for the user to clean up. This makes me think it's insufficient that we only catch this problem sometimes. Maybe we should check this systematically and not just rely on the coincidence that they included an unknown file.

It would be nice to get confirmation from the user when uploading files. This can be a list of the datasets and narratives or even just a summary of "Please confirm X datasets and X narratives will be uploaded". We can always add a -f option to allow users to bypass this confirmation step.

It wouldn't "break" things in the sense of "this didn't work", but it breaks things in the sense that there'd be garbage datasets/narratives listed on the Groups page, private info might be leaked, etc. And I think it breaks an expectation that these commands won't let you upload just any old files.

I agree the detection can be more systematic, hence my comments about using our schemas to do detection pre-upload instead of naive filename matching.

Confirmation is a good suggestion; I've updated #149 to describe a --prompt (and --no-prompt) mode because they're both related to --dry-run.

joverlee521 · 2022-02-10T22:30:57Z

nextstrain/cli/remote/nextstrain_dot_org.py

+            # I plan to extend the /charon/getAvailable API endpoint (or maybe
+            # switch to a new endpoint) in the future to include the "sidecars"
+            # field listing the available sidecars for each dataset, so that
+            # this code only has to try to fetch what is reported to exist.


This would also be helpful for nextstrain remote ls to add an option to list all files for a dataset (similar to how the remote list for s3 lists all objects in the bucket).

Yep. I imagined ls output to maybe initially surface this like:

$ nextstrain remote ls … …/beta-cov …/ncov/19B (tip-frequencies, root-sequence) …/ncov/19B/2021-02-11 (tip-frequencies, root-sequence) …/zika (tip-frequencies)

similar to how download does.

But there's also the possibility of jumping right to something fancier, e.g. with rich, as well as doing something machine-readable, e.g. JSON or YAML output mode.

joverlee521 · 2022-02-10T23:22:52Z

nextstrain/cli/remote/nextstrain_dot_org.py

+                            # If we're recursively downloading under a path
+                            # which is itself a dataset that exists, e.g.
+                            # `nextstrain remote download -r a/b` when the
+                            # following datasets exist:
+                            #
+                            #       a/b
+                            #       a/b/x
+                            #       a/b/y
+                            #       a/b/z
+                            #
+                            # then we need to include a little of the given
+                            # path in the local filenames, e.g. strip only "a",
+                            # leaving "b".  Otherwise, e.g. if a/b doesn't
+                            # exist itself only a/b/x, etc, then we can strip
+                            # "a/b" entirely.


Hmm, this behavior was very unexpected for me when I was testing the recursive download.

Flu datasets might be a very specific use case:

run remote download -r groups/test/flu /tmp/testing

run remote download -r groups/test/flu/seasonal /tmp/testing

I expected the second command to just overwrite the existing files within /tmp/testing. Instead I got two sets of the same files but with slightly different names because the prefixes were stripped.

Well, "very unexpected" is not what we want! 😂 It's a good point that it's maybe a bit weird that downloading the same datasets two different ways produces different local filenames. I wonder if that situation is likely to crop up in typical usage as opposed to testing? I'd liked keeping filenames shorter than longer (e.g. instead of using the whole URL path) and picking a place to shorten them at that was under the user's control rather than picked/hardcoded by us. But maybe it's too confusing.

There's lots of brainstorming in my notes about possible behaviours here, and there were one or two decent discussions in Slack a while back, but nothing very conclusive. Maybe I should go back and summarize it all in writing and bring it back for wider consideration as part of (or separate from) this PR? Definitely would appreciate more input here.

(Ah, I just realized that I completely missed the flu example in the download help docs)

I actually prefer the opposite: keep the whole URL path so that the user can see exactly what they downloaded without having to inspect the file. They can easily rename the files locally.

Another problem is overwriting since the path prefixes are used to scope/filter downloads.
If I want to download H3N2 and H1N1pdm builds to the same directory and I run

download nextstrain.org/flu/seasonal/h3n2/

download nextstrain.org/flu/seasonal/h1n1pdm/

The second command would just overwrite my first download 😞

(Relevant Slack discussion from a while back)

I actually prefer the opposite: keep the whole URL path so that the user can see exactly what they downloaded without having to inspect the file.

I see the appeal of keeping the whole URL (though think doing / → _ transformation on the whole URL is weird), but I guess I don't get why someone wouldn't know what they just downloaded.

They can easily rename the files locally.

For a few files, yes. I'm not so confident in this being a reasonable expectation if someone does nextstrain remote download -r nextstrain.org/ncov (or worse, for flu) and ends up with 42 files (2 databases (GISAID, Open) × 7 regions × 3 files (main, root-sequence, tip-frequencies)).

One of the reasons for download using the shortest local filenames that are still distinguishing is that it helps play nice with upload when handling multiple datasets. Although upload doesn't care about local filename if uploading a single dataset, the filename still matters for multiple uploads and becomes a suffix on the given destination URL.

I'm imagining that a download + upload combo would be used for copying files between Groups, but maybe that would be less common than "download, do secondary analyses, never upload back", for which you'd tend to want longer more descriptive filenames. (The primary usage is still "run a build, upload files", and download doesn't matter for that.)

Another problem is overwriting since the path prefixes are used to scope/filter downloads. If I want to download H3N2 and H1N1pdm builds to the same directory and I run

1. download nextstrain.org/flu/seasonal/h3n2/ 2. download nextstrain.org/flu/seasonal/h1n1pdm/

The second command would just overwrite my first download

Yeah, this is not ideal. The workaround is downloading each into a specific directory (or, in the nearish future, giving a filename prefix instead of a dir).

This comes back to the Slack discussion from the fall, where you pointed out that providing a toggle between two different ways of naming local files may be best. What is better for the "download then re-upload" workflow is not better for the "download and analyze" workflow and vice versa.

@joverlee521

…ed in A great suggestion by @joverlee521 during review. Thanks!

tsibley · 2022-02-11T23:20:28Z

It would be nice to have a --dry-run option for these commands.

Agreed. I'll make a new issue for that!

Should user credentials be renewed before running each command?
Currently, if I remove myself from a Cognito group, I can still run commands for the group as long as I don't renew my local tokens.

Maybe? This staleness is expected and part of Cognito's authn model because it issues JWTs. We can tune the maximum duration of the staleness, however. The access and id tokens are currently valid for an hour after creation and the refresh token for 30 days, which means there's up to an hour of staleness baked into an id token. We can tune it as low as 5 minutes, although this would mean very frequent background renewals.

The server side will still accept a valid token even if the CLI forces renewal before every command, so forcing a renewal keeps honest users honest ("oh, I guess I can't do that anymore") but doesn't prevent savvy/bad users from manually using the pre-renewal token ("oh, you say I can't do that anymore? I'll just use this other token that's still valid and says I can!").

We actually have a longer staleness issue for web sessions right now because we don't check Cognito for updates once a session is established and we don't force periodic reauth. We should reconsider some choices there and fix that as well.

I think the more comprehensive way to deal with this staleness is to track on the server side when things like group membership changes happen and record a timestamp that says "don't accept tokens created before X for user Y", e.g. implement a basic authn revocation system.

Other than tweaking the token lifetimes, which is an ~easy knob, I think we can make the other improvements as part of further work to manage group memberships/roles within the nextstrain.org UI and API.

tsibley · 2022-02-11T23:31:06Z

Issue for a --dry-run mode → #149

tsibley requested a review from a team January 25, 2022 20:38

tsibley added 10 commits January 26, 2022 22:23

authn: Send informational messages to stderr instead of stdout

19f85b0

These functions are intended to be used by other commands internally and stderr avoids polluting those commands' stdout.

login: Add --renew flag to refresh the authn tokens

0a4fc48

Renewal happens automatically when needed, but it can be useful to renew early to, for example, refresh group membership information.

remote delete: Update messaging to match pre-deletion behaviour

bb4ab4c

Missed doing this as part of the behaviour change from post-action return to pre-action yield in "remote.s3: Rework delete() to have a similar API to the other commands" (cc86a1a).

errors: Support interpolation into UserErrors _after_ multi-line dedent

94f92d3

This helps with some kinds of messages where the interpolation expands to multiple-lines that would otherwise interfere with dedent.

Print nicer errors for the user when an internal assert fails

c046fb2

I ran into a couple of new asserts during testing and realized handling them centrally would be ~easy and a better UX. If we're doing a good job, this is unlikely to be seen by users, but no one's perfect.

remote download: Improve error with --recursively when local path is …

360cfbf

…not a dir

authn: Add an HTTP Authorization value property on User objects

0d82c15

To be used with API requests to nextstrain.org. Exposed as an Authorization header value instead of the bare id token on the principle of isolating and centralizing code which directly handles tokens.

Add nextstrain.org as a remote

053add4

Uses the new RESTful API for download (GET), upload (PUT), delete (DELETE) operations and the Charon API ("getAvailable") for listing.

tsibley force-pushed the trs/remote/nextstrain.org branch from 9e5112f to a197ac4 Compare January 27, 2022 06:23

victorlin reviewed Jan 27, 2022

View reviewed changes

nextstrain/cli/remote/nextstrain_dot_org.py Show resolved Hide resolved

nextstrain/cli/remote/nextstrain_dot_org.py Show resolved Hide resolved

victorlin approved these changes Jan 28, 2022

View reviewed changes

joverlee521 approved these changes Feb 10, 2022

View reviewed changes

Suggest another possible fix to a "Permission denied" error when logg…

4d2a0b9

…ed in A great suggestion by @joverlee521 during review. Thanks!

tsibley mentioned this pull request Feb 15, 2022

Schemas for our sidecars nextstrain/augur#852

Merged

tsibley merged commit 989a4ea into master Feb 28, 2022

tsibley deleted the trs/remote/nextstrain.org branch February 28, 2022 23:33

tsibley mentioned this pull request Apr 8, 2022

review of nextstrain remote behavior #169

Closed

tsibley mentioned this pull request Jun 17, 2022

[remote] Support Nextstrain Groups as remote destinations using nextstrain.org credentials #90

Closed

tsibley mentioned this pull request Sep 27, 2022

RFD: remote upload's use of filenames #221

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nextstrain.org remote #141

Nextstrain.org remote #141

tsibley commented Jan 25, 2022 •

edited

Loading

victorlin commented Jan 26, 2022 •

edited

Loading

victorlin left a comment

tsibley commented Jan 29, 2022 •

edited

Loading

tsibley commented Feb 1, 2022

joverlee521 left a comment

joverlee521 Feb 10, 2022

tsibley Feb 11, 2022

joverlee521 Feb 12, 2022

joverlee521 Feb 10, 2022

tsibley Feb 11, 2022

tsibley Feb 11, 2022

joverlee521 Feb 10, 2022

tsibley Feb 11, 2022

joverlee521 Feb 12, 2022

tsibley Feb 14, 2022

joverlee521 Feb 10, 2022

tsibley Feb 11, 2022

joverlee521 Feb 10, 2022

tsibley Feb 11, 2022

joverlee521 Feb 12, 2022

joverlee521 Feb 12, 2022

tsibley Feb 14, 2022

tsibley Feb 14, 2022

tsibley commented Feb 11, 2022

tsibley commented Feb 11, 2022 •

edited

Loading

Nextstrain.org remote #141

Nextstrain.org remote #141

Conversation

tsibley commented Jan 25, 2022 • edited Loading

Kicking the tires

victorlin commented Jan 26, 2022 • edited Loading

victorlin left a comment

Choose a reason for hiding this comment

tsibley commented Jan 29, 2022 • edited Loading

tsibley commented Feb 1, 2022

joverlee521 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsibley commented Feb 11, 2022

tsibley commented Feb 11, 2022 • edited Loading

tsibley commented Jan 25, 2022 •

edited

Loading

victorlin commented Jan 26, 2022 •

edited

Loading

tsibley commented Jan 29, 2022 •

edited

Loading

tsibley commented Feb 11, 2022 •

edited

Loading