review of nextstrain remote behavior #169

jameshadfield · 2022-04-06T05:30:20Z

This is essentially a review of the nextstrain remote functionality and how it interacts with data stored in multi-tenant groups. It comes out of a review of nextstrain/docs.nextstrain.org#104 which documents this functionality.

Single-asset vs multi-asset differences

The behavior for uploading a single narrative or sinlge dataset + associated sidecars is great. I think it's intuitive the way the CLI detects that the provided files belong to the same asset (i.e. dataset + sidecar(s)) and then follows the pattern of [single asset] -> [specified destination] or [single asset] -> [group URL + filename converted to URL] depending on whether a base groups URL is provided or a specific URL corresponding to an asset is provided. This is nice.

However in multi-asset mode things get confusing. This is because the CLI now uses a different pattern: [files corresponding to multiple assets] -> [provided destination + asset filenames joined together]. This is similar to the AWS CLI and I always find this unintuitive and confusing

Multi-asset mode is also limiting in that you can't specify the destination for each asset, and you also can't upload narrative + dataset assets together.

suggested solution

Force the user to be explicit, via the syntax:

nextstrain remote upload nextstrain.org/groups/<name>  a.json a_measurements.json \
                         nextstrain.org/groups/<name>/specific/name  b.json \
                         nextstrain.org/groups/<name>/narratives/my-name c.md

Benefits:

One syntax to rule them all. You don’t have to remember when a prefix is going to be used or not.
Allows uploading narrative assets and dataset assets at the same time
Explicit about what you want to happen
More versatile - you can rename files as needed
Everything still happens in one command, which will help if/when we do cloudfront invalidation ($$$)

Downsides:

More verbose. Often for multi-asset uploads this is being scripted so it’s not too bad.

Existence checks

Currently you can overwrite a file with no warning or log that it exists. We should consider if this is the right move, rather than defaulting to not overwriting files and requiring --force or similar. (I think we should require this.)

Sidecars without main dataset files
Currently you can upload a sidecar without the main dataset existing. We should check that the main dataset exists in the group (or exists in the files being uploaded) before being able to upload sidecars.

Relatedly, you can't delete a sidecar file if the main dataset doesn't exist, however I think this should be addressed by preventing the sidecar upload in the first place.

Push and pull between files and assets

One of the tensions which the CLI tries to abstract away is the differences between a set of files representing the asset (i.e. main JSON + sidecars) and the asset itself. In some circumstances it's really nice to simply reason about an asset, but in others we need to know which underlying files are there. I only really found this tension problematic in one place:

nextstrain remote list should convey which assets have which sidecar files. This could be done in a compact format such as

$ nextstrain remote list nextstrain.org/groups/<name>
https://nextstrain.org/groups/<name>/a
https://nextstrain.org/groups/<name>/b [tip-frequencies, root-sequence]
https://nextstrain.org/groups/<name>/c [measurements]

upload and download default behavior is different

Most easily seen with an example

nextstrain remote upload nextstrain.org/groups/<name>  a_b.json
# available as https://nextstrain.org/groups/<name>/a/b
nextstrain remote download nextstrain.org/groups/<name>/a/b  .
# file downloaded is b.json rather than a_b.json

I know this has been discussed from time to time, but I really think the behavior here should be to save the asset as a_b.json. This has two main benefits for me:

Descriptive. Saving nextstrain.org/groups/<name>/flu/seasonal/h3n2/ha/2y as 2y.json is going to lead to cognitive disassociation between the file and the URL.
Closes the circle. Allows users to (e.g.) download an asset, modify it locally, and re-upload it without having to remember that they have to change the upload destination.

Minor notes

Things I noticed during testing, but are probably too minor to address. Listed here for completeness.

Slight conflict between delete and list

nextstrain remote list nextstrain.org/groups/blab-private/james lists all datasets under that URL prefix
nextstrain remote rm nextstrain.org/groups/blab-private/james doesn't work. (And I'm really happy about this!)

Multi-asset delete
I found myself wishing you could do things like:
nextstrain remote delete nextstrain.org/groups/<name>/a nextstrain.org/groups/<name>/b

(Wildcard expansion also doesn't work, but I see this as a feature not a bug.)

The text was updated successfully, but these errors were encountered:

tsibley · 2022-04-08T19:41:52Z

Thanks for the review, @jameshadfield. Really appreciate more 👀 and 🧠 on this stuff! There's a lot here and I'm away for a bit starting Real Soon Now, so I'll come back to this more fully when I return. But I thought I'd jot down a few first thoughts as I read through now. I expect this issue will want to get broken out into several sub-issues with concise descriptions of what to change once we have some discussion and reach consensus.

Re: the suggested syntax for multi-resource uploads:

nextstrain remote upload nextstrain.org/groups/<name>  a.json a_measurements.json \
                         nextstrain.org/groups/<name>/specific/name  b.json \
                         nextstrain.org/groups/<name>/narratives/my-name c.md

I don't think this syntax is all that appealing for several reasons:

It introduces significant semantics on argument order that go beyond what's typical for most command-line programs. Departing from the typical conventions can be ok in exceptional situations, but I don't think its benefits warrant that here given my other concerns.
It introduces ambiguity between files and remote URLs, particularly if using the shortened remote URL forms of groups/<name>/…. Heuristics are possible, of course, but won't always be right and it isn't clear what a user's escape hatch is if the heuristics guess wrong.
Of the benefits you list, they're essentially all the same if you split the single-invocation into multiple invocations.

The exception is CloudFront invalidation, but there is no reason to and no plan to add that for Groups. For our core and staging remotes, it's not clear when/if we'll support uploading via the nextstrain.org remote. When/if we do, I'd prefer to start using proper caching headers instead which can make invalidations basically unnecessary and other parts of our codebase simpler too; this is on my low-priority todo (one example).

The one downside you list, verbosity, is increased slightly with multiple invocations, but the same mitigating factor applies: these are often scripted so verbosity doesn't matter as much.
Multiple invocations instead of a single invocation are much easier to explain, as there's just one form of the command.
Each upload happens independently anyway, so single vs. multiple command invocations don't meaningfully change the overhead.

All that said, I do agree that the UX of uploading multi-resources in a single invocation can be improved! I've sketched some ideas out before in my (public) notes and review comments, Slack, etc. and will try to pull them together and summarize here when I'm back.

One such improvement suggested also by @joverlee521 is --dry-run and --prompt modes, which I described in #149.

Re: existence checks.

I understand the desire to prevent accidental overwrites and am open to reconsidering the default, but I think routinely overwriting a dataset with an updated version is a very common pattern and something that shouldn't require an extra flag (much less a flag scarily-named --force; --update would be better). Instead, I would prefer a way to opt-into overwrite protection, e.g. nextstrain remote upload --if-not-exists. Separately, prompting on overwrite could also be tied into --prompt from above.

Uploading a sidecar without a main dataset is indeed a corner case that we could improve. I would want to implement it server-side not client-side but didn't because it wasn't critical and there's some nuance given race conditions introduced by existence pre-checks.

Re: surfacing which sidecars exist.

Definitely. Something like that was my already my plan, and @joverlee521 and I discusesd it in the nextstrain.org remote review.

Re: upload vs. download behaviour.

I agree there's more consideration to do here. @joverlee521 and I most recently discussed this a bit in the prior review, and there were many other discussions around this elsewhere beforehand.

I see the upsides to your preferred behaviour, though there are downsides too. I think it's worth a larger team discussion over the baseline/typical usage patterns we expect (for both ourselves and others) and what behaviour would best support those and remain straightforward to reason about. It might be as you suggest! but that's not clear to me right now.

At this point, such discussion is probably best done as a scheduled, real-time meeting instead of async writing. Let's do that when I'm back!

Re: minor notes:

remote rm -r will recursively delete the same set that ls lists, but yes, I agree good to keep this off by default.
Multi-resource delete (and list) could easily be added, either with multi-arg or wildcards or both. I'd probably just implement multi-arg to start. Someone should feel free to do this and submit a PR for it!

tsibley · 2022-09-26T21:41:38Z

@jameshadfield A few weeks ago I changed the behaviour of nextstrain remote download with respect to local filenames it produces (see changelog), which addresses some of the concerns above.

tsibley · 2022-09-27T23:18:10Z

I just now opened a discussion issue for nextstrain remote upload behaviour.

tsibley · 2022-10-28T23:50:12Z

Closing as discussion has been had here. We can re-open if more discussion is needed or open new issues.

nextstrain-bot added this to Nextstrain planning (archived) Apr 6, 2022

nextstrain-bot moved this to New in Nextstrain planning (archived) Apr 6, 2022

jameshadfield mentioned this issue Apr 7, 2022

Overhaul Nextstrain Groups documentation nextstrain/docs.nextstrain.org#104

Merged

victorlin moved this from New to Backlog in Nextstrain planning (archived) Apr 27, 2022

tsibley mentioned this issue Aug 1, 2022

remote download: Use more of the resource path in the local filename #213

Merged

3 tasks

tsibley mentioned this issue Sep 27, 2022

RFD: remote upload's use of filenames #221

Closed

tsibley added the proposal Proposals that warrant further discussion label Sep 27, 2022

tsibley closed this as completed Oct 28, 2022

Repository owner moved this from Backlog to Done in Nextstrain planning (archived) Oct 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

review of nextstrain remote behavior #169

review of nextstrain remote behavior #169

jameshadfield commented Apr 6, 2022

tsibley commented Apr 8, 2022

tsibley commented Sep 26, 2022

tsibley commented Sep 27, 2022

tsibley commented Oct 28, 2022

review of nextstrain remote behavior #169

review of nextstrain remote behavior #169

Comments

jameshadfield commented Apr 6, 2022

Single-asset vs multi-asset differences

Existence checks

Push and pull between files and assets

upload and download default behavior is different

Minor notes

tsibley commented Apr 8, 2022

tsibley commented Sep 26, 2022

tsibley commented Sep 27, 2022

tsibley commented Oct 28, 2022