-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nextstrain.org remote #141
Conversation
Diff between this and #140 for easier review: 6d33886...trs/remote/nextstrain.org |
These functions are intended to be used by other commands internally and stderr avoids polluting those commands' stdout.
Renewal happens automatically when needed, but it can be useful to renew early to, for example, refresh group membership information.
Missed doing this as part of the behaviour change from post-action return to pre-action yield in "remote.s3: Rework delete() to have a similar API to the other commands" (cc86a1a).
Using Path is a bit of a fib and means unnecessarily sprinkling Path() calls around. A future improvement might be defining a new RemotePath type for this use, if it'd help with interp between RemoteModule implementations and the remote commands. For just interpretability, it didn't seem worth it.
This helps with some kinds of messages where the interpolation expands to multiple-lines that would otherwise interfere with dedent.
I ran into a couple of new asserts during testing and realized handling them centrally would be ~easy and a better UX. If we're doing a good job, this is unlikely to be seen by users, but no one's perfect.
To be used with API requests to nextstrain.org. Exposed as an Authorization header value instead of the bare id token on the principle of isolating and centralizing code which directly handles tokens.
Uses the new RESTful API for download (GET), upload (PUT), delete (DELETE) operations and the Charon API ("getAvailable") for listing.
Moves information on our two supported remotes to separate doc pages since we can't reasonably stuff everything into --help. As a future improvement, I'd love to see a way to read this documentation at the CLI without having to open a web page or have internet access, e.g. a new command like `nextstrain help remotes/s3` or extending the --help-all pattern to `nextstrain remote` to include description and not just options.
9e5112f
to
a197ac4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me with some basic testing! Learned a few things while looking over this too 🙂
Let me know if there's anything specific you'd like me to test locally.
@victorlin Thanks for giving it a whirl! Aside from "does it work?" testing, I think the next thing to focus on is how the command behaviour/UX feels. Does it feel right? Is there anything that feels off or that you reached for but didn't work/exist? Slightly tangentially, one missing thing I'd like to add support for (and which I think @jameshadfield also wants) is specifying a constant filename prefix to local files when doing |
My plan is to wait to merge this until nextstrain/nextstrain.org#452 gets deployed, as otherwise it won't work as advertised. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool to see this work so well! I mostly left comments about the general "feel" of the commands, so nothing blocking.
A couple general comments I have:
- It would be nice to have a
--dry-run
option for these commands. - Should user credentials be renewed before running each command?
Currently, if I remove myself from a Cognito group, I can still run commands for the group as long as I don't renew my local tokens.
default_sidecars = ["root-sequence", "tip-frequencies"] | ||
|
||
self.subresources = [ | ||
SubResource("application/vnd.nextstrain.dataset.main+json", ".json", primary = True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes me a little nervous to see all of the hardcoded media types throughout this file...can we house these in a dictionary or enum class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? My thinking here was that media types are the kind of constants where the value is the name and vice versa, like, say, HTTP methods. That is, they're different from constants where the value is not synonymous with the name.
For this type of self-describing constant, I think they're more clearly used directly rather than indirectly behind a translation layer (dict/enum) that gives them some other name. My thinking goes, why have multiple names, varying across code bases, for a thing when you can have one that's consistent across codebases?
One benefit of indirection with an enum (but not dict) is that it would let the type checker tell us if we made a typo, but that feels like a marginal gain not worth the indirection to me. (The code will also likely fail to work with a typo in the media type, so would be noticeable.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this type of self-describing constant, I think they're more clearly used directly rather than indirectly behind a translation layer (dict/enum) that gives them some other name
That's fair. I just personally have a hard time remembering long strings 😆 I can always refer to the Media Types docs.
# conventions, we might want to identify files using partial content | ||
# inspection (e.g. does the file contain the expected top-level keys | ||
# for a given sidecar type?) or full schema validation (e.g. try to | ||
# validate all schemas and see which passes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full schema validation here would be really cool! It would be nice to give users feedback that their dataset will not be visualized properly because their JSON doesn't match our schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! Part of why I gave our existing schemas proper, resolvable ids (e.g. https://nextstrain.org/schemas/dataset/v2) is so we could leverage those here eventually. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'd also like to do server-side validation.)
if unknowns: | ||
# XXX TODO: Handle group-logo.png and group-overview.md by making | ||
# appropriate API requests to the group endpoint (which doesn't yet | ||
# exist). | ||
# -trs, 23 Sept 2021 | ||
raise UserError(f""" | ||
Only datasets (v2) and narratives are currently supported for | ||
upload to nextstrain.org, but other files were given: | ||
|
||
{{files}} | ||
""", files = file_list(unknowns)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be more user friendly to upload all valid files but print out a warning that the unknowns
were excluded from the upload instead of raising an error.
I'm thinking of the use case where I want to upload everything within a data/
directory, but I don't want to figure out the glob pattern to exclude the bad files since I'm not as command line savvy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nod. That's fair. This might be too cautious. My thinking was the presence of unexpected files indicates the issue might be larger than just a few skip-able files, e.g. imagining a data/*
glob picking up a README.md
that we think is narrative because we don't inspect contents or picking up an auspice_config.json
that we think is a dataset (again because we don't inspect contents). So, I thought, better to abort and ask the user to be more specific, than proceed and potentially leave the user with a mess to clean up? 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uploading unexpected files technically wouldn't "break" anything, which is why I thought this is overly cautious.
I do agree it's not nice to leave a mess for the user to clean up. This makes me think it's insufficient that we only catch this problem sometimes. Maybe we should check this systematically and not just rely on the coincidence that they included an unknown file.
It would be nice to get confirmation from the user when uploading files. This can be a list of the datasets and narratives or even just a summary of "Please confirm X datasets and X narratives will be uploaded". We can always add a -f
option to allow users to bypass this confirmation step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wouldn't "break" things in the sense of "this didn't work", but it breaks things in the sense that there'd be garbage datasets/narratives listed on the Groups page, private info might be leaked, etc. And I think it breaks an expectation that these commands won't let you upload just any old files.
I agree the detection can be more systematic, hence my comments about using our schemas to do detection pre-upload instead of naive filename matching.
Confirmation is a good suggestion; I've updated #149 to describe a --prompt
(and --no-prompt
) mode because they're both related to --dry-run
.
# I plan to extend the /charon/getAvailable API endpoint (or maybe | ||
# switch to a new endpoint) in the future to include the "sidecars" | ||
# field listing the available sidecars for each dataset, so that | ||
# this code only has to try to fetch what is reported to exist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would also be helpful for nextstrain remote ls
to add an option to list all files for a dataset (similar to how the remote list for s3 lists all objects in the bucket).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I imagined ls
output to maybe initially surface this like:
$ nextstrain remote ls …
…/beta-cov
…/ncov/19B (tip-frequencies, root-sequence)
…/ncov/19B/2021-02-11 (tip-frequencies, root-sequence)
…/zika (tip-frequencies)
similar to how download
does.
But there's also the possibility of jumping right to something fancier, e.g. with rich, as well as doing something machine-readable, e.g. JSON or YAML output mode.
# If we're recursively downloading under a path | ||
# which is itself a dataset that exists, e.g. | ||
# `nextstrain remote download -r a/b` when the | ||
# following datasets exist: | ||
# | ||
# a/b | ||
# a/b/x | ||
# a/b/y | ||
# a/b/z | ||
# | ||
# then we need to include a little of the given | ||
# path in the local filenames, e.g. strip only "a", | ||
# leaving "b". Otherwise, e.g. if a/b doesn't | ||
# exist itself only a/b/x, etc, then we can strip | ||
# "a/b" entirely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this behavior was very unexpected for me when I was testing the recursive download.
Flu datasets might be a very specific use case:
- run
remote download -r groups/test/flu /tmp/testing
- run
remote download -r groups/test/flu/seasonal /tmp/testing
I expected the second command to just overwrite the existing files within /tmp/testing
. Instead I got two sets of the same files but with slightly different names because the prefixes were stripped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, "very unexpected" is not what we want! 😂 It's a good point that it's maybe a bit weird that downloading the same datasets two different ways produces different local filenames. I wonder if that situation is likely to crop up in typical usage as opposed to testing? I'd liked keeping filenames shorter than longer (e.g. instead of using the whole URL path) and picking a place to shorten them at that was under the user's control rather than picked/hardcoded by us. But maybe it's too confusing.
There's lots of brainstorming in my notes about possible behaviours here, and there were one or two decent discussions in Slack a while back, but nothing very conclusive. Maybe I should go back and summarize it all in writing and bring it back for wider consideration as part of (or separate from) this PR? Definitely would appreciate more input here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Ah, I just realized that I completely missed the flu example in the download help docs)
I actually prefer the opposite: keep the whole URL path so that the user can see exactly what they downloaded without having to inspect the file. They can easily rename the files locally.
Another problem is overwriting since the path prefixes are used to scope/filter downloads.
If I want to download H3N2 and H1N1pdm builds to the same directory and I run
- download nextstrain.org/flu/seasonal/h3n2/
- download nextstrain.org/flu/seasonal/h1n1pdm/
The second command would just overwrite my first download 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Relevant Slack discussion from a while back)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually prefer the opposite: keep the whole URL path so that the user can see exactly what they downloaded without having to inspect the file.
I see the appeal of keeping the whole URL (though think doing /
→ _
transformation on the whole URL is weird), but I guess I don't get why someone wouldn't know what they just downloaded.
They can easily rename the files locally.
For a few files, yes. I'm not so confident in this being a reasonable expectation if someone does nextstrain remote download -r nextstrain.org/ncov
(or worse, for flu) and ends up with 42 files (2 databases (GISAID, Open) × 7 regions × 3 files (main, root-sequence, tip-frequencies)).
One of the reasons for download
using the shortest local filenames that are still distinguishing is that it helps play nice with upload
when handling multiple datasets. Although upload
doesn't care about local filename if uploading a single dataset, the filename still matters for multiple uploads and becomes a suffix on the given destination URL.
I'm imagining that a download + upload combo would be used for copying files between Groups, but maybe that would be less common than "download, do secondary analyses, never upload back", for which you'd tend to want longer more descriptive filenames. (The primary usage is still "run a build, upload files", and download
doesn't matter for that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another problem is overwriting since the path prefixes are used to scope/filter downloads. If I want to download H3N2 and H1N1pdm builds to the same directory and I run
1. download nextstrain.org/flu/seasonal/h3n2/ 2. download nextstrain.org/flu/seasonal/h1n1pdm/
The second command would just overwrite my first download
Yeah, this is not ideal. The workaround is downloading each into a specific directory (or, in the nearish future, giving a filename prefix instead of a dir).
This comes back to the Slack discussion from the fall, where you pointed out that providing a toggle between two different ways of naming local files may be best. What is better for the "download then re-upload" workflow is not better for the "download and analyze" workflow and vice versa.
…ed in A great suggestion by @joverlee521 during review. Thanks!
Agreed. I'll make a new issue for that!
Maybe? This staleness is expected and part of Cognito's authn model because it issues JWTs. We can tune the maximum duration of the staleness, however. The access and id tokens are currently valid for an hour after creation and the refresh token for 30 days, which means there's up to an hour of staleness baked into an id token. We can tune it as low as 5 minutes, although this would mean very frequent background renewals. The server side will still accept a valid token even if the CLI forces renewal before every command, so forcing a renewal keeps honest users honest ("oh, I guess I can't do that anymore") but doesn't prevent savvy/bad users from manually using the pre-renewal token ("oh, you say I can't do that anymore? I'll just use this other token that's still valid and says I can!"). We actually have a longer staleness issue for web sessions right now because we don't check Cognito for updates once a session is established and we don't force periodic reauth. We should reconsider some choices there and fix that as well. I think the more comprehensive way to deal with this staleness is to track on the server side when things like group membership changes happen and record a timestamp that says "don't accept tokens created before X for user Y", e.g. implement a basic authn revocation system. Other than tweaking the token lifetimes, which is an ~easy knob, I think we can make the other improvements as part of further work to manage group memberships/roles within the nextstrain.org UI and API. |
Issue for a |
Builds upon #140 to add nextstrain.org support to the
nextstrain remote
family of commands.Kicking the tires
Works well in my extensive local testing against nextstrain/nextstrain.org#452. To do so yourself, run the nextstrain.org server from that PR:
and then in another shell, before running
nextstrain remote
commands from this PR, make sure to:As a confirmation, you should see
localhost:5000
in the output ofnextstrain remote ls nextstrain.org
, e.g.: