Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nextstrain.org remote #141

Merged
merged 11 commits into from
Feb 28, 2022
Merged

Nextstrain.org remote #141

merged 11 commits into from
Feb 28, 2022

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Jan 25, 2022

Builds upon #140 to add nextstrain.org support to the nextstrain remote family of commands.

Kicking the tires

Works well in my extensive local testing against nextstrain/nextstrain.org#452. To do so yourself, run the nextstrain.org server from that PR:

# in checkout of nextstrain/nextstrain.org repo
git switch restful-api-rw
npm run build
node server.js --verbose

and then in another shell, before running nextstrain remote commands from this PR, make sure to:

export NEXTSTRAIN_DOT_ORG=http://localhost:5000

As a confirmation, you should see localhost:5000 in the output of nextstrain remote ls nextstrain.org, e.g.:

$ nextstrain remote ls nextstrain.org
http://localhost:5000/dengue/denv1
http://localhost:5000/dengue/denv2
http://localhost:5000/dengue/denv3
http://localhost:5000/dengue/denv4
http://localhost:5000/dengue/all
http://localhost:5000/ebola
…

@tsibley tsibley requested a review from a team January 25, 2022 20:38
@victorlin
Copy link
Member

victorlin commented Jan 26, 2022

Diff between this and #140 for easier review: 6d33886...trs/remote/nextstrain.org

These functions are intended to be used by other commands internally and
stderr avoids polluting those commands' stdout.
Renewal happens automatically when needed, but it can be useful to renew
early to, for example, refresh group membership information.
Missed doing this as part of the behaviour change from post-action
return to pre-action yield in "remote.s3: Rework delete() to have a
similar API to the other commands" (cc86a1a).
Using Path is a bit of a fib and means unnecessarily sprinkling Path()
calls around.

A future improvement might be defining a new RemotePath type for this
use, if it'd help with interp between RemoteModule implementations and
the remote commands.  For just interpretability, it didn't seem worth
it.
This helps with some kinds of messages where the interpolation expands
to multiple-lines that would otherwise interfere with dedent.
I ran into a couple of new asserts during testing and realized handling
them centrally would be ~easy and a better UX.  If we're doing a good
job, this is unlikely to be seen by users, but no one's perfect.
To be used with API requests to nextstrain.org.  Exposed as an
Authorization header value instead of the bare id token on the principle
of isolating and centralizing code which directly handles tokens.
Uses the new RESTful API for download (GET), upload (PUT), delete
(DELETE) operations and the Charon API ("getAvailable") for listing.
Moves information on our two supported remotes to separate doc pages
since we can't reasonably stuff everything into --help.

As a future improvement, I'd love to see a way to read this
documentation at the CLI without having to open a web page or have
internet access, e.g. a new command like `nextstrain help remotes/s3` or
extending the --help-all pattern to `nextstrain remote` to include
description and not just options.
@tsibley tsibley force-pushed the trs/remote/nextstrain.org branch from 9e5112f to a197ac4 Compare January 27, 2022 06:23
Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me with some basic testing! Learned a few things while looking over this too 🙂

Let me know if there's anything specific you'd like me to test locally.

@tsibley
Copy link
Member Author

tsibley commented Jan 29, 2022

@victorlin Thanks for giving it a whirl! Aside from "does it work?" testing, I think the next thing to focus on is how the command behaviour/UX feels. Does it feel right? Is there anything that feels off or that you reached for but didn't work/exist?

Slightly tangentially, one missing thing I'd like to add support for (and which I think @jameshadfield also wants) is specifying a constant filename prefix to local files when doing download --recursively. Current that mode only supports specifying a local directory, but I think I know how to extend that reasonably.

@tsibley
Copy link
Member Author

tsibley commented Feb 1, 2022

My plan is to wait to merge this until nextstrain/nextstrain.org#452 gets deployed, as otherwise it won't work as advertised.

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool to see this work so well! I mostly left comments about the general "feel" of the commands, so nothing blocking.

A couple general comments I have:

  1. It would be nice to have a --dry-run option for these commands.
  2. Should user credentials be renewed before running each command?
    Currently, if I remove myself from a Cognito group, I can still run commands for the group as long as I don't renew my local tokens.

default_sidecars = ["root-sequence", "tip-frequencies"]

self.subresources = [
SubResource("application/vnd.nextstrain.dataset.main+json", ".json", primary = True),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes me a little nervous to see all of the hardcoded media types throughout this file...can we house these in a dictionary or enum class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? My thinking here was that media types are the kind of constants where the value is the name and vice versa, like, say, HTTP methods. That is, they're different from constants where the value is not synonymous with the name.

For this type of self-describing constant, I think they're more clearly used directly rather than indirectly behind a translation layer (dict/enum) that gives them some other name. My thinking goes, why have multiple names, varying across code bases, for a thing when you can have one that's consistent across codebases?

One benefit of indirection with an enum (but not dict) is that it would let the type checker tell us if we made a typo, but that feels like a marginal gain not worth the indirection to me. (The code will also likely fail to work with a typo in the media type, so would be noticeable.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this type of self-describing constant, I think they're more clearly used directly rather than indirectly behind a translation layer (dict/enum) that gives them some other name

That's fair. I just personally have a hard time remembering long strings 😆 I can always refer to the Media Types docs.

# conventions, we might want to identify files using partial content
# inspection (e.g. does the file contain the expected top-level keys
# for a given sidecar type?) or full schema validation (e.g. try to
# validate all schemas and see which passes).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full schema validation here would be really cool! It would be nice to give users feedback that their dataset will not be visualized properly because their JSON doesn't match our schema.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Part of why I gave our existing schemas proper, resolvable ids (e.g. https://nextstrain.org/schemas/dataset/v2) is so we could leverage those here eventually. :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'd also like to do server-side validation.)

Comment on lines +176 to +186
if unknowns:
# XXX TODO: Handle group-logo.png and group-overview.md by making
# appropriate API requests to the group endpoint (which doesn't yet
# exist).
# -trs, 23 Sept 2021
raise UserError(f"""
Only datasets (v2) and narratives are currently supported for
upload to nextstrain.org, but other files were given:

{{files}}
""", files = file_list(unknowns))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be more user friendly to upload all valid files but print out a warning that the unknowns were excluded from the upload instead of raising an error.

I'm thinking of the use case where I want to upload everything within a data/ directory, but I don't want to figure out the glob pattern to exclude the bad files since I'm not as command line savvy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nod. That's fair. This might be too cautious. My thinking was the presence of unexpected files indicates the issue might be larger than just a few skip-able files, e.g. imagining a data/* glob picking up a README.md that we think is narrative because we don't inspect contents or picking up an auspice_config.json that we think is a dataset (again because we don't inspect contents). So, I thought, better to abort and ask the user to be more specific, than proceed and potentially leave the user with a mess to clean up? 🤷‍♂️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uploading unexpected files technically wouldn't "break" anything, which is why I thought this is overly cautious.

I do agree it's not nice to leave a mess for the user to clean up. This makes me think it's insufficient that we only catch this problem sometimes. Maybe we should check this systematically and not just rely on the coincidence that they included an unknown file.

It would be nice to get confirmation from the user when uploading files. This can be a list of the datasets and narratives or even just a summary of "Please confirm X datasets and X narratives will be uploaded". We can always add a -f option to allow users to bypass this confirmation step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't "break" things in the sense of "this didn't work", but it breaks things in the sense that there'd be garbage datasets/narratives listed on the Groups page, private info might be leaked, etc. And I think it breaks an expectation that these commands won't let you upload just any old files.

I agree the detection can be more systematic, hence my comments about using our schemas to do detection pre-upload instead of naive filename matching.

Confirmation is a good suggestion; I've updated #149 to describe a --prompt (and --no-prompt) mode because they're both related to --dry-run.

Comment on lines +132 to +135
# I plan to extend the /charon/getAvailable API endpoint (or maybe
# switch to a new endpoint) in the future to include the "sidecars"
# field listing the available sidecars for each dataset, so that
# this code only has to try to fetch what is reported to exist.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also be helpful for nextstrain remote ls to add an option to list all files for a dataset (similar to how the remote list for s3 lists all objects in the bucket).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I imagined ls output to maybe initially surface this like:

$ nextstrain remote ls …
…/beta-cov
…/ncov/19B (tip-frequencies, root-sequence)
…/ncov/19B/2021-02-11 (tip-frequencies, root-sequence)
…/zika (tip-frequencies)

similar to how download does.

But there's also the possibility of jumping right to something fancier, e.g. with rich, as well as doing something machine-readable, e.g. JSON or YAML output mode.

Comment on lines +262 to +276
# If we're recursively downloading under a path
# which is itself a dataset that exists, e.g.
# `nextstrain remote download -r a/b` when the
# following datasets exist:
#
# a/b
# a/b/x
# a/b/y
# a/b/z
#
# then we need to include a little of the given
# path in the local filenames, e.g. strip only "a",
# leaving "b". Otherwise, e.g. if a/b doesn't
# exist itself only a/b/x, etc, then we can strip
# "a/b" entirely.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this behavior was very unexpected for me when I was testing the recursive download.

Flu datasets might be a very specific use case:

  1. run remote download -r groups/test/flu /tmp/testing
  2. run remote download -r groups/test/flu/seasonal /tmp/testing

I expected the second command to just overwrite the existing files within /tmp/testing. Instead I got two sets of the same files but with slightly different names because the prefixes were stripped.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, "very unexpected" is not what we want! 😂 It's a good point that it's maybe a bit weird that downloading the same datasets two different ways produces different local filenames. I wonder if that situation is likely to crop up in typical usage as opposed to testing? I'd liked keeping filenames shorter than longer (e.g. instead of using the whole URL path) and picking a place to shorten them at that was under the user's control rather than picked/hardcoded by us. But maybe it's too confusing.

There's lots of brainstorming in my notes about possible behaviours here, and there were one or two decent discussions in Slack a while back, but nothing very conclusive. Maybe I should go back and summarize it all in writing and bring it back for wider consideration as part of (or separate from) this PR? Definitely would appreciate more input here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Ah, I just realized that I completely missed the flu example in the download help docs)

I actually prefer the opposite: keep the whole URL path so that the user can see exactly what they downloaded without having to inspect the file. They can easily rename the files locally.

Another problem is overwriting since the path prefixes are used to scope/filter downloads.
If I want to download H3N2 and H1N1pdm builds to the same directory and I run

  1. download nextstrain.org/flu/seasonal/h3n2/
  2. download nextstrain.org/flu/seasonal/h1n1pdm/

The second command would just overwrite my first download 😞

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Relevant Slack discussion from a while back)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer the opposite: keep the whole URL path so that the user can see exactly what they downloaded without having to inspect the file.

I see the appeal of keeping the whole URL (though think doing /_ transformation on the whole URL is weird), but I guess I don't get why someone wouldn't know what they just downloaded.

They can easily rename the files locally.

For a few files, yes. I'm not so confident in this being a reasonable expectation if someone does nextstrain remote download -r nextstrain.org/ncov (or worse, for flu) and ends up with 42 files (2 databases (GISAID, Open) × 7 regions × 3 files (main, root-sequence, tip-frequencies)).

One of the reasons for download using the shortest local filenames that are still distinguishing is that it helps play nice with upload when handling multiple datasets. Although upload doesn't care about local filename if uploading a single dataset, the filename still matters for multiple uploads and becomes a suffix on the given destination URL.

I'm imagining that a download + upload combo would be used for copying files between Groups, but maybe that would be less common than "download, do secondary analyses, never upload back", for which you'd tend to want longer more descriptive filenames. (The primary usage is still "run a build, upload files", and download doesn't matter for that.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another problem is overwriting since the path prefixes are used to scope/filter downloads. If I want to download H3N2 and H1N1pdm builds to the same directory and I run

1. download nextstrain.org/flu/seasonal/h3n2/

2. download nextstrain.org/flu/seasonal/h1n1pdm/

The second command would just overwrite my first download

Yeah, this is not ideal. The workaround is downloading each into a specific directory (or, in the nearish future, giving a filename prefix instead of a dir).

This comes back to the Slack discussion from the fall, where you pointed out that providing a toggle between two different ways of naming local files may be best. What is better for the "download then re-upload" workflow is not better for the "download and analyze" workflow and vice versa.

@tsibley
Copy link
Member Author

tsibley commented Feb 11, 2022

  1. It would be nice to have a --dry-run option for these commands.

Agreed. I'll make a new issue for that!

  1. Should user credentials be renewed before running each command?
    Currently, if I remove myself from a Cognito group, I can still run commands for the group as long as I don't renew my local tokens.

Maybe? This staleness is expected and part of Cognito's authn model because it issues JWTs. We can tune the maximum duration of the staleness, however. The access and id tokens are currently valid for an hour after creation and the refresh token for 30 days, which means there's up to an hour of staleness baked into an id token. We can tune it as low as 5 minutes, although this would mean very frequent background renewals.

The server side will still accept a valid token even if the CLI forces renewal before every command, so forcing a renewal keeps honest users honest ("oh, I guess I can't do that anymore") but doesn't prevent savvy/bad users from manually using the pre-renewal token ("oh, you say I can't do that anymore? I'll just use this other token that's still valid and says I can!").

We actually have a longer staleness issue for web sessions right now because we don't check Cognito for updates once a session is established and we don't force periodic reauth. We should reconsider some choices there and fix that as well.

I think the more comprehensive way to deal with this staleness is to track on the server side when things like group membership changes happen and record a timestamp that says "don't accept tokens created before X for user Y", e.g. implement a basic authn revocation system.

Other than tweaking the token lifetimes, which is an ~easy knob, I think we can make the other improvements as part of further work to manage group memberships/roles within the nextstrain.org UI and API.

@tsibley
Copy link
Member Author

tsibley commented Feb 11, 2022

Issue for a --dry-run mode → #149

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

3 participants