Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: Add --exclude-from-upload and --exclude-from-download options #370

Merged
merged 1 commit into from
May 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,15 @@ installation method we recommend in the [Nextstrain installation
documentation](https://docs.nextstrain.org/page/install.html). In that case, a
supported Python version is always bundled with `nextstrain`.

## Features

* `nextstrain build` now supports two new options when using the AWS Batch
runtime: [`--exclude-from-upload`][] and [`--exclude-from-download`][]. The
former is useful for avoiding the upload of large, ancillary files not needed
by the build. The latter exists to parallel the former and make it easier to
exclude files from both upload and download.
([#370](https://github.com/nextstrain/cli/pull/370))

## Improvements

* The Conda runtime now uses Micromamba 1.5.8 (upgraded from 1.1.0) to manage
Expand All @@ -32,6 +41,19 @@ supported Python version is always bundled with `nextstrain`.
the Conda runtime.
([#367](https://github.com/nextstrain/cli/pull/367))

## Bug fixes

* The [`--download`][] option of `nextstrain build` now supports passing _only_
negated patterns (e.g. `!…` and `!(…)`). All files which _don't_ match the
negated patterns will be downloaded. Previously, no files were downloaded
unless at least one positive pattern was given.
([#370](https://github.com/nextstrain/cli/pull/370))


[`--exclude-from-upload`]: https://docs.nextstrain.org/projects/cli/en/__NEXT__/commands/build/#cmdoption-nextstrain-build-exclude-from-upload
[`--exclude-from-download`]: https://docs.nextstrain.org/projects/cli/en/__NEXT__/commands/build/#cmdoption-nextstrain-build-exclude-from-download
[`--download`]: https://docs.nextstrain.org/projects/cli/en/__NEXT__/commands/build/#cmdoption-nextstrain-build-download


# 8.3.0 (30 April 2024)

Expand Down
54 changes: 54 additions & 0 deletions doc/commands/build.rst
Original file line number Diff line number Diff line change
Expand Up @@ -97,13 +97,67 @@ options
wildcards (``**``), extended globbing (``@(…)``, ``+(…)``, etc.),
and negation (``!…``).

Patterns should be relative to the build directory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially tangential to this PR

Reading this code my understanding is that only the build directory is uploaded, and so patterns are relative to that. How does this work when the build directory is (e.g.) ingest/ but within that the Snakefile refers back to a parent directory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a great point. The directory here that matters—the one that's uploaded and one that patterns must be relative to—is influenced by the nextstrain-pathogen.yaml file. For example, running nextstrain build ingest/ in a directory with nextstrain-pathogen.yaml will require patterns be relative to the directory containing the marker file, not the ingest/ directory, i.e. to exclude something in ingest/ would require the pattern to be ingest/x/y/z not just x/y/z.

I'll update the docs here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I want to leave the option docs/help as-is for now for this PR. I'm realizing there's larger documentation reworking to do to clarify terminology here as a result of #355 (something I'd postponed in that PR). Writing a clear help snippet for these pattern options will be much easier and less inconsistent/confusing once the larger terminology shift and doc updates happen. I should prioritize that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@jameshadfield jameshadfield May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 totally forgot about this subtlety. Is this behaviour mirrored in what directories are bind-mounted into the docker container (when running locally)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

Before #355 the directory given to nextstrain build ended up as /nextstrain/build in the container and /nextstrain/build was the fixed working directory for the container process.

After #355, the /nextstrain/build mount is either the directory given to nextstrain build or, if present, the closest ancestor of the given directory which contains a nextstrain-pathogen.yaml. The working directory for the container process is the /nextstrain/build path corresponding to the directory given to nextstrain build.





.. option:: --no-download

Do not download any files from the remote build when it completes. Currently only supported when also using :option:`--aws-batch`.

.. option:: --exclude-from-download <pattern>

Exclude files matching ``<pattern>`` from being downloaded from
the remote build. Equivalent to passing a negated pattern to
:option:`--download`. That is, the following are equivalent::

--exclude-from-download 'xyz'
--download '!xyz'

Refer to :option:`--download` for usage details, but note that this
option doesn't support already-negated patterns (e.g. ``!…`` or
``!(…)``).

This option exists to parallel :option:`--exclude-from-upload`.




.. option:: --exclude-from-upload <pattern>

Exclude files matching ``<pattern>`` from being uploaded as part of
the remote build. Shell-style advanced globbing is supported, but
be sure to escape wildcards or quote the whole pattern so your
shell doesn't expand them. May be passed more than once.
Currently only supported when also using :option:`--aws-batch`.
Default is to upload the entire pathogen build directory (except
for some ancillary files which are always excluded).

Note that files excluded from upload may still be downloaded from
the remote build, e.g. if they're created by it, and if downloaded
will overwrite the local files. Use :option:`--no-download` to
avoid downloading any files, or :option:`--exclude-from-download`
to avoid downloading specific files, e.g.::

--exclude-from-upload 'results/**' \
--exclude-from-download 'results/**'

Your shell's brace expansion can also be used to shorten this, e.g.::

--exclude-from-{up,down}load='results/**'

Besides basic glob features like single-part wildcards (``*``),
character classes (``[…]``), and brace expansion (``{…, …}``),
several advanced globbing features are also supported: multi-part
wildcards (``**``), extended globbing (``@(…)``, ``+(…)``, etc.),
and negation (``!…``).

Patterns should be relative to the build directory.




.. option:: --no-logs

Do not show the log messages of the remote build. Currently only supported when also using :option:`--aws-batch`. Default is to show all log messages, even when attaching to a completed build.
Expand Down
62 changes: 62 additions & 0 deletions nextstrain/cli/command/build.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,8 @@ def register_parser(subparser):
wildcards (``**``), extended globbing (``@(…)``, ``+(…)``, etc.),
and negation (``!…``).

Patterns should be relative to the build directory.

{SKIP_AUTO_DEFAULT_IN_HELP}
"""),
default = True,
Expand All @@ -117,6 +119,66 @@ def register_parser(subparser):
dest = "download",
action = "store_false")

parser.add_argument(
"--exclude-from-download",
metavar = "<pattern>",
help = dedent(f"""\
Exclude files matching ``<pattern>`` from being downloaded from
the remote build. Equivalent to passing a negated pattern to
:option:`--download`. That is, the following are equivalent::

--exclude-from-download 'xyz'
--download '!xyz'

Refer to :option:`--download` for usage details, but note that this
option doesn't support already-negated patterns (e.g. ``!…`` or
``!(…)``).

This option exists to parallel :option:`--exclude-from-upload`.

{SKIP_AUTO_DEFAULT_IN_HELP}
"""),
dest = "download",
type = lambda pattern: f"!{pattern}",
action = AppendOverwriteDefault)

parser.add_argument(
"--exclude-from-upload",
metavar = "<pattern>",
help = dedent(f"""\
Exclude files matching ``<pattern>`` from being uploaded as part of
the remote build. Shell-style advanced globbing is supported, but
be sure to escape wildcards or quote the whole pattern so your
shell doesn't expand them. May be passed more than once.
Currently only supported when also using :option:`--aws-batch`.
Default is to upload the entire pathogen build directory (except
for some ancillary files which are always excluded).

Note that files excluded from upload may still be downloaded from
the remote build, e.g. if they're created by it, and if downloaded
will overwrite the local files. Use :option:`--no-download` to
avoid downloading any files, or :option:`--exclude-from-download`
to avoid downloading specific files, e.g.::

--exclude-from-upload 'results/**' \\
--exclude-from-download 'results/**'

Your shell's brace expansion can also be used to shorten this, e.g.::

--exclude-from-{{up,down}}load='results/**'

Besides basic glob features like single-part wildcards (``*``),
character classes (``[…]``), and brace expansion (``{{…, …}}``),
several advanced globbing features are also supported: multi-part
wildcards (``**``), extended globbing (``@(…)``, ``+(…)``, etc.),
and negation (``!…``).

Patterns should be relative to the build directory.

{SKIP_AUTO_DEFAULT_IN_HELP}
"""),
action = "append")

# A --logs option doesn't make much sense right now for most of our
# runtimes, but I can see how it might in the future. So we're ready if
# that future comes to pass, set up --no-logs as if there's a --logs option
Expand Down
2 changes: 1 addition & 1 deletion nextstrain/cli/runner/aws_batch/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -213,7 +213,7 @@ def run(opts, argv, working_volume = None, extra_env: Env = {}, cpus: int = None
print_stage("Uploading %s to S3" % local_workdir)

bucket = s3.bucket(opts.s3_bucket)
remote_workdir = s3.upload_workdir(local_workdir, bucket, run_id)
remote_workdir = s3.upload_workdir(local_workdir, bucket, run_id, opts.exclude_from_upload)

print("uploaded:", s3.object_url(remote_workdir))

Expand Down
23 changes: 21 additions & 2 deletions nextstrain/cli/runner/aws_batch/s3.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,17 +38,20 @@ def object_from_url(s3url: str) -> S3Object:
return bucket(url.netloc).Object(key)


def upload_workdir(workdir: Path, bucket: S3Bucket, run_id: str) -> S3Object:
def upload_workdir(workdir: Path, bucket: S3Bucket, run_id: str, patterns: List[str] = None) -> S3Object:
"""
Upload a ZIP archive of the local *workdir* to the remote S3 *bucket* for
the given *run_id*.

An optional list of *patterns* (shell-style advanced globs) can be passed
to selectively exclude part of the local *workdir* from being uploaded.

Returns the S3.Object instance of the uploaded archive.
"""

remote_workdir = bucket.Object(run_id + ".zip")

excluded = path_matcher([
always_excluded = path_matcher([
# Jobs don't use .git, so save the bandwidth/space/time. It may also
# contain information in history that shouldn't be uploaded.
".git/",
Expand All @@ -65,6 +68,13 @@ def upload_workdir(workdir: Path, bucket: S3Bucket, run_id: str) -> S3Object:
"__pycache__/",
])

if patterns:
deselected = glob_matcher(patterns, root = workdir)
else:
deselected = lambda path: False

excluded = lambda path: always_excluded(path) or deselected(path)

# Stream writes directly to the remote ZIP file
remote_file: Any
with fsspec.open(object_url(remote_workdir), "wb", auto_mkdir = False) as remote_file:
Expand All @@ -86,6 +96,15 @@ def download_workdir(remote_workdir: S3Object, workdir: Path, patterns: List[str
to selectively download only part of the remote workdir.
"""

# XXX TODO: Consider extending excluded patterns with the globs from any
# --exclude-from-upload options given, so that files excluded from upload
# are by default also excluded from download. That behaviour would seems
# sometimes useful but other times confusing. See also my discussion with
# Trevor and James about this.¹
# -trs, 23 May 2024
#
# ¹ <https://bedfordlab.slack.com/archives/C0K3GS3J8/p1715750350029879?thread_ts=1715695506.612949&cid=C0K3GS3J8>

excluded = path_matcher([
# Jobs don't use .git and it may also contain information that
# shouldn't be uploaded.
Expand Down
20 changes: 15 additions & 5 deletions nextstrain/cli/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
from shlex import quote as shquote
from shutil import which
from textwrap import dedent, indent
from wcmatch.glob import globmatch, GLOBSTAR, EXTGLOB, BRACE, MATCHBASE, NEGATE
from wcmatch.glob import globmatch, GLOBSTAR, EXTGLOB, BRACE, MATCHBASE, NEGATE, NEGATEALL, REALPATH
from .__version__ import __version__
from .debug import debug
from .types import RunnerModule, RunnerTestResults, RunnerTestResultStatus
Expand Down Expand Up @@ -537,20 +537,23 @@ def split_image_name(name: str, implicit_latest: bool = True) -> Tuple[str, Opti
return (repository, tag)


def glob_matcher(patterns: Sequence[str]) -> Callable[[Union[str, Path]], bool]:
def glob_matcher(patterns: Sequence[str], *, root: Path = None) -> Callable[[Union[str, Path]], bool]:
"""
Generate a function which matches a string or path-like object against the
list of Bash-like glob *patterns*.

If a *root* path is provided, paths will be matched against *patterns* as
real filesystem paths relative to the given *root* directory.

See :func:`glob_match` for supported pattern features.
"""
def matcher(path: Union[str, Path]) -> bool:
return glob_match(path, patterns)
return glob_match(path, patterns, root = root)

return matcher


def glob_match(path: Union[str, Path], patterns: Union[str, Sequence[str]]) -> bool:
def glob_match(path: Union[str, Path], patterns: Union[str, Sequence[str]], *, root: Path = None) -> bool:
"""
Test if *path* matches any of the glob *patterns*.

Expand All @@ -559,10 +562,17 @@ def glob_match(path: Union[str, Path], patterns: Union[str, Sequence[str]]) -> b
globbing features are also supported: multi-part wildcards (``**``),
extended globbing (``@(…)``, ``+(…)``, etc.), basename matching for
patterns containing only a single path part, and negation (``!…``).
Passing only negated patterns is explicitly supported and will match
anything that doesn't match the given patterns.

If a *root* path is provided, *path* will be matched against *patterns* as
real filesystem paths relative to the given *root* directory.

Implemented with with :func:`wcmatch.glob.globmatch`.
"""
return globmatch(path, patterns, flags = GLOBSTAR | BRACE | EXTGLOB | MATCHBASE | NEGATE)
if root:
path = Path(path).resolve(strict = True).relative_to(root)
return globmatch(path, patterns, flags = GLOBSTAR | BRACE | EXTGLOB | MATCHBASE | NEGATE | NEGATEALL | (REALPATH if root else 0), root_dir = root)


def runner_tests_ok(tests: RunnerTestResults) -> bool:
Expand Down
29 changes: 29 additions & 0 deletions tests/command-build.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
from nextstrain.cli import make_parser


def pytest_build_download_options():
parser = make_parser()

opts = parser.parse_args(["build", "."])
assert opts.download is True

opts = parser.parse_args(["build", "--no-download", "."])
assert opts.download is False

opts = parser.parse_args(["build", "--download", "x", "."])
assert opts.download == ["x"]

opts = parser.parse_args(["build", "--download", "x", "--download", "y", "."])
assert opts.download == ["x", "y"]

opts = parser.parse_args(["build", "--exclude-from-download", "z", "."])
assert opts.download == ["!z"]

opts = parser.parse_args(["build", "--exclude-from-download", "z", "--exclude-from-download", "a", "."])
assert opts.download == ["!z", "!a"]

opts = parser.parse_args(["build", "--download", "y", "--exclude-from-download", "z", "."])
assert opts.download == ["y", "!z"]

opts = parser.parse_args(["build", "--download", "y", "--download", "!z", "."])
assert opts.download == ["y", "!z"]
Loading