Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use nextstrain/ingest #164

Merged
merged 5 commits into from
Jul 31, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/fetch-and-ingest.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -73,4 +73,4 @@ jobs:

- name: notify_pipeline_failed
if: ${{ failure() }}
run: ./ingest/bin/notify-on-job-fail
run: ./ingest/vendored/notify-on-job-fail Ingest nextstrain/monkeypox
2 changes: 1 addition & 1 deletion .github/workflows/rebuild-all.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,6 @@ jobs:
steps:
- uses: actions/checkout@v3
- name: Repository Dispatch
run: ./ingest/bin/trigger monkeypox rebuild
run: ./ingest/vendored/trigger monkeypox rebuild
env:
PAT_GITHUB_DISPATCH: ${{ secrets.GH_TOKEN_NEXTSTRAIN_BOT_WORKFLOW_DISPATCH }}
4 changes: 2 additions & 2 deletions bin/notify-on-deploy
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ set -euo pipefail
: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"

base="$(realpath "$(dirname "$0")/..")"
ingest_bin="$base/ingest/bin"
ingest_vendored="$base/ingest/vendored"

deployment_url="${1:?A deployment url is required as the first argument.}"
slack_ts_file="${2:?A Slack thread timestamp file is required as the second argument.}"

echo "Notifying Slack about deployed builds."
"$ingest_bin"/notify-slack "Deployed this build to $deployment_url" \
"$ingest_vendored"/notify-slack "Deployed this build to $deployment_url" \
--thread-ts="$(cat "$slack_ts_file")"
4 changes: 2 additions & 2 deletions bin/notify-on-error
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ set -euo pipefail
: "${GITHUB_RUN_ID:=}"

base="$(realpath "$(dirname "$0")/..")"
ingest_bin="$base/ingest/bin"
ingest_vendored="$base/ingest/vendored"

slack_ts_file="${1:-}"

Expand All @@ -26,6 +26,6 @@ elif [[ -n "${GITHUB_RUN_ID}" ]]; then
message+="See GitHub Action <https://github.com/nextstrain/monkeypox/actions/runs/${GITHUB_RUN_ID}?check_suite_focus=true|${GITHUB_RUN_ID}> for error details."
fi

"$ingest_bin"/notify-slack "$message" \
"$ingest_vendored"/notify-slack "$message" \
--thread-ts="$thread_ts" \
--broadcast
4 changes: 2 additions & 2 deletions bin/notify-on-start
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ set -euo pipefail
: "${GITHUB_RUN_ID:=}"

base="$(realpath "$(dirname "$0")/..")"
ingest_bin="$base/ingest/bin"
ingest_vendored="$base/ingest/vendored"

build_name="${1:?A build name is required as the first argument.}"
slack_ts_output="${2:?A Slack thread timestamp file is required as the second argument}"
Expand All @@ -29,7 +29,7 @@ if [[ -n "${AWS_BATCH_JOB_ID}" ]]; then
message+=" Follow along in your local \`monkeypox\` repo with: "'```'"nextstrain build --aws-batch --no-download --attach ${AWS_BATCH_JOB_ID} . "'```'
fi

"$ingest_bin"/notify-slack "$message" --output="$slack_response"
"$ingest_vendored"/notify-slack "$message" --output="$slack_response"

echo "Saving Slack thread timestamp to '$slack_ts_output'."

Expand Down
4 changes: 2 additions & 2 deletions bin/notify-on-success
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ set -euo pipefail
: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"

base="$(realpath "$(dirname "$0")/..")"
ingest_bin="$base/ingest/bin"
ingest_vendored="$base/ingest/vendored"

slack_ts_file="${1:?A Slack thread timestamp file is required as the first argument.}"

echo "Notifying Slack about successful build."
"$ingest_bin"/notify-slack "✅ This pipeline has successfully finished 🎉" \
"$ingest_vendored"/notify-slack "✅ This pipeline has successfully finished 🎉" \
--thread-ts="$(cat "$slack_ts_file")"
13 changes: 13 additions & 0 deletions ingest/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,3 +85,16 @@ These are optional environment variables used in our automated pipeline for prov

GenBank sequences and metadata are fetched via NCBI Virus.
The exact URL used to fetch data is constructed in `bin/genbank-url`.

## `ingest/vendored`

This repository uses [`git subrepo`](https://github.com/ingydotnet/git-subrepo) to manage copies of ingest scripts in `ingest/vendored`, from [nextstrain/ingest](https://github.com/nextstrain/ingest). To pull new changes from the central ingest repository, first install `git subrepo`, then run:

```sh
git subrepo pull ingest/vendored
```

Changes should not be pushed using `git subrepo push`.

1. For pathogen-specific changes, make them in this repository via a pull request.
2. For pathogen-agnostic changes, make them on [nextstrain/ingest](https://github.com/nextstrain/ingest) via pull request there, then use `git subrepo pull` to add those changes to this repository.
4 changes: 2 additions & 2 deletions ingest/bin/download-from-s3
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# Originally copied from nextstrain/ncov-ingest repo
set -euo pipefail

bin="$(dirname "$0")"
vendored="$(dirname "$0")"/../vendored

main() {
local src="${1:?A source s3:// URL is required as the first argument.}"
Expand All @@ -13,7 +13,7 @@ main() {
local key="${s3path#*/}"

local src_hash dst_hash no_hash=0000000000000000000000000000000000000000000000000000000000000000
dst_hash="$("$bin/sha256sum" < "$dst" || true)"
dst_hash="$("$vendored/sha256sum" < "$dst" || true)"
src_hash="$(aws s3api head-object --bucket "$bucket" --key "$key" --query Metadata.sha256sum --output text 2>/dev/null || echo "$no_hash")"

echo "[ INFO] Downloading $src → $dst"
Expand Down
7 changes: 4 additions & 3 deletions ingest/bin/notify-on-diff
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ set -euo pipefail
: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"

bin="$(dirname "$0")"
vendored="$(dirname "$0")"/../vendored

src="${1:?A source file is required as the first argument.}"
dst="${2:?A destination s3:// URL is required as the second argument.}"
Expand All @@ -16,7 +17,7 @@ diff="$(mktemp -t diff-XXXXXX)"
trap "rm -f '$dst_local' '$diff'" EXIT

# if the file is not already present, just exit
"$bin"/s3-object-exists "$dst" || exit 0
"$vendored"/s3-object-exists "$dst" || exit 0

"$bin"/download-from-s3 "$dst" "$dst_local"

Expand All @@ -26,10 +27,10 @@ diff "$dst_local" "$src" > "$diff" || diff_exit_code=$?

if [[ "$diff_exit_code" -eq 1 ]]; then
echo "Notifying Slack about diff."
"$bin"/notify-slack --upload "$src.diff" < "$diff"
"$vendored"/notify-slack --upload "$src.diff" < "$diff"
elif [[ "$diff_exit_code" -gt 1 ]]; then
echo "Notifying Slack about diff failure"
"$bin"/notify-slack "Diff failed for $src"
"$vendored"/notify-slack "Diff failed for $src"
else
echo "No change in $src."
fi
21 changes: 0 additions & 21 deletions ingest/bin/notify-on-job-fail

This file was deleted.

5 changes: 3 additions & 2 deletions ingest/bin/notify-on-record-change
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,14 @@ set -euo pipefail
: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"

bin="$(dirname "$0")"
vendored="$(dirname "$0")"/../vendored

src="${1:?A source ndjson file is required as the first argument.}"
dst="${2:?A destination ndjson s3:// URL is required as the second argument.}"
source_name=${3:?A record source name is required as the third argument.}

# if the file is not already present, just exit
"$bin"/s3-object-exists "$dst" || exit 0
"$vendored"/s3-object-exists "$dst" || exit 0

s3path="${dst#s3://}"
bucket="${s3path%%/*}"
Expand Down Expand Up @@ -51,4 +52,4 @@ fi

slack_message+=" (Total record count: $src_record_count)"

"$bin"/notify-slack "$slack_message"
"$vendored"/notify-slack "$slack_message"
6 changes: 3 additions & 3 deletions ingest/bin/trigger-on-new-data
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ set -euo pipefail

: "${PAT_GITHUB_DISPATCH:?The PAT_GITHUB_DISPATCH environment variable is required.}"

bin="$(dirname "$0")"
vendored="$(dirname "$0")"/../vendored

metadata="${1:?A metadata upload output file is required as the first argument.}"
sequences="${2:?An sequence FASTA upload output file is required as the second argument.}"
Expand All @@ -17,14 +17,14 @@ slack_message=""
# grep exit status 0 for found match, 1 for no match, 2 if an error occurred
if [[ $new_metadata -eq 1 || $new_sequences -eq 1 ]]; then
slack_message="Triggering new builds due to updated metadata and/or sequences"
"$bin"/trigger "monkeypox" "rebuild"
"$vendored"/trigger "monkeypox" "rebuild"
elif [[ $new_metadata -eq 0 && $new_sequences -eq 0 ]]; then
slack_message="Skipping trigger of rebuild: Both metadata TSV and sequences FASTA are identical to S3 files."
else
slack_message="Skipping trigger of rebuild: Unable to determine if data has been updated."
fi


if ! "$bin"/notify-slack "$slack_message"; then
if ! "$vendored"/notify-slack "$slack_message"; then
echo "Notifying Slack failed, but exiting with success anyway."
fi
8 changes: 4 additions & 4 deletions ingest/bin/upload-to-s3
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# Originally copied from nextstrain/ncov-ingest repo
set -euo pipefail

bin="$(dirname "$0")"
vendored="$(dirname "$0")"/../vendored

main() {
local quiet=0
Expand All @@ -26,7 +26,7 @@ main() {
local key="${s3path#*/}"

local src_hash dst_hash no_hash=0000000000000000000000000000000000000000000000000000000000000000
src_hash="$("$bin/sha256sum" < "$src")"
src_hash="$("$vendored/sha256sum" < "$src")"
dst_hash="$(aws s3api head-object --bucket "$bucket" --key "$key" --query Metadata.sha256sum --output text 2>/dev/null || echo "$no_hash")"

if [[ $src_hash != "$dst_hash" ]]; then
Expand All @@ -46,7 +46,7 @@ main() {

if [[ -n $cloudfront_domain ]]; then
echo "Creating CloudFront invalidation for $cloudfront_domain/$key"
if ! "$bin"/cloudfront-invalidate "$cloudfront_domain" "/$key"; then
if ! "$vendored"/cloudfront-invalidate "$cloudfront_domain" "/$key"; then
echo "CloudFront invalidation failed, but exiting with success anyway."
fi
fi
Expand All @@ -56,7 +56,7 @@ main() {
exit 0
fi

if ! "$bin"/notify-slack "Updated $dst available."; then
if ! "$vendored"/notify-slack "Updated $dst available."; then
echo "Notifying Slack failed, but exiting with success anyway."
fi
else
Expand Down
13 changes: 13 additions & 0 deletions ingest/vendored/.github/workflows/ci.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: CI

on:
- push
- pull_request
- workflow_dispatch

jobs:
shellcheck:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: nextstrain/.github/actions/shellcheck@master
12 changes: 12 additions & 0 deletions ingest/vendored/.gitrepo
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
; DO NOT EDIT (unless you know what you are doing)
;
; This subdirectory is a git "subrepo", and this file is maintained by the
; git-subrepo command. See https://github.com/ingydotnet/git-subrepo#readme
;
[subrepo]
remote = https://github.com/nextstrain/ingest
branch = main
commit = 9082700fdbb99007d5e852e657cedf3723d13181
parent = 5c461dc7e90cd70c1f16b193f82fd1666d4c95e2
method = merge
cmdver = 0.4.6
60 changes: 60 additions & 0 deletions ingest/vendored/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# ingest

Shared internal tooling for pathogen data ingest. Used by our individual
pathogen repos which produce Nextstrain builds. Expected to be vendored by
each pathogen repo using `git subtree` (or `git subrepo`).

Some tools may only live here temporarily before finding a permanent home in
`augur curate` or Nextstrain CLI. Others may happily live out their days here.

## History

Much of this tooling originated in
[ncov-ingest](https://github.com/nextstrain/ncov-ingest) and was passaged thru
[monkeypox's ingest/](https://github.com/nextstrain/monkeypox/tree/@/ingest/).
It subsequently proliferated from [monkeypox][] to other pathogen repos
([rsv][], [zika][], [dengue][], [hepatitisB][], [forecasts-ncov][]) primarily
thru copying. To [counter that
proliferation](https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1688577879947079),
this repo was made.

[monkeypox]: https://github.com/nextstrain/monkeypox
[rsv]: https://github.com/nextstrain/rsv
[zika]: https://github.com/nextstrain/zika/pull/24
[dengue]: https://github.com/nextstrain/dengue/pull/10
[hepatitisB]: https://github.com/nextstrain/hepatitisB
[forecasts-ncov]: https://github.com/nextstrain/forecasts-ncov

## Elsewhere

The creation of this repo, in both the abstract and concrete, and the general
approach to "ingest" has been discussed in various internal places, including:

- https://github.com/nextstrain/private/issues/59
- @joverlee521's [workflows document](https://docs.google.com/document/d/1rLWPvEuj0Ayc8MR0O1lfRJZfj9av53xU38f20g8nU_E/edit#heading=h.4g0d3mjvb89i)
- [5 July 2023 Slack thread](https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1688577879947079)
- [6 July 2023 team meeting](https://docs.google.com/document/d/1FPfx-ON5RdqL2wyvODhkrCcjgOVX3nlXgBwCPhIEsco/edit)
- _…many others_

## Scripts

Scripts for supporting ingest workflow automation that don’t really belong in any of our existing tools.

- [notify-on-job-fail](notify-on-job-fail) - Send Slack message with details about failed workflow job on GitHub Actions and/or AWS Batch
- [notify-on-job-start](notify-on-job-start) - Send Slack message with details about workflow job on GitHub Actions and/or AWS Batch
- [notify-slack](notify-slack) - Send message or file to Slack
- [s3-object-exists](s3-object-exists) - Used to prevent 404 errors during S3 file comparisons in the notify-* scripts
- [trigger](trigger) - Triggers downstream GitHub Actions via the GitHub API using repository_dispatch events.

Potential Nextstrain CLI scripts

- [sha256sum](sha256sum) - Used to check if files are identical in upload-to-s3 and download-from-s3 scripts.
- [cloudfront-invalidate](cloudfront-invalidate) - CloudFront invalidation is already supported in the [nextstrain remote command for S3 files](https://github.com/nextstrain/cli/blob/a5dda9c0579ece7acbd8e2c32a4bbe95df7c0bce/nextstrain/cli/remote/s3.py#L104).
This exists as a separate script to support CloudFront invalidation when using the upload-to-s3 script.

Potential augur curate scripts

- [merge-user-metadata](merge-user-metadata) - Merges user annotations with NDJSON records
- [transform-authors](transform-authors) - Abbreviates full author lists to '<first author> et al.'
- [transform-field-names](transform-field-names) - Rename fields of NDJSON records
- [transform-genbank-location](transform-genbank-location) - Parses `location` field with the expected pattern `"<country_value>[:<region>][, <locality>]"` based on [GenBank's country field](https://www.ncbi.nlm.nih.gov/genbank/collab/country/)
File renamed without changes.
File renamed without changes.
23 changes: 23 additions & 0 deletions ingest/vendored/notify-on-job-fail
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash
set -euo pipefail

: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"

: "${AWS_BATCH_JOB_ID:=}"
: "${GITHUB_RUN_ID:=}"

bin="$(dirname "$0")"
job_name="${1:?A job name is required as the first argument}"
github_repo="${2:?A GitHub repository with owner and repository name is required as the second argument}"

echo "Notifying Slack about failed ${job_name} job."
message="❌ ${job_name} job has FAILED 😞 "

if [[ -n "${AWS_BATCH_JOB_ID}" ]]; then
message+="See AWS Batch job \`${AWS_BATCH_JOB_ID}\` (<https://console.aws.amazon.com/batch/v2/home?region=us-east-1#jobs/detail/${AWS_BATCH_JOB_ID}|link>) for error details. "
elif [[ -n "${GITHUB_RUN_ID}" ]]; then
message+="See GitHub Action <https://github.com/${github_repo}/actions/runs/${GITHUB_RUN_ID}?check_suite_focus=true|${GITHUB_RUN_ID}> for error details. "
fi

"$bin"/notify-slack "$message"
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,20 @@ set -euo pipefail
: "${GITHUB_RUN_ID:=}"

bin="$(dirname "$0")"
job_name="${1:?A job name is required as the first argument}"
github_repo="${2:?A GitHub repository with owner and repository name is required as the second argument}"
build_dir="${3:-ingest}"

echo "Notifying Slack about started ingest job."
message="🐵 Monkeypox ingest job has started."
echo "Notifying Slack about started ${job_name} job."
message="${job_name} job has started."

if [[ -n "${GITHUB_RUN_ID}" ]]; then
message+=" The job was submitted by GitHub Action <https://github.com/nextstrain/monkeypox/actions/runs/${GITHUB_RUN_ID}?check_suite_focus=true|${GITHUB_RUN_ID}>."
message+=" The job was submitted by GitHub Action <https://github.com/${github_repo}/actions/runs/${GITHUB_RUN_ID}?check_suite_focus=true|${GITHUB_RUN_ID}>."
fi

if [[ -n "${AWS_BATCH_JOB_ID}" ]]; then
message+=" The job was launched as AWS Batch job \`${AWS_BATCH_JOB_ID}\` (<https://console.aws.amazon.com/batch/v2/home?region=us-east-1#jobs/detail/${AWS_BATCH_JOB_ID}|link>)."
message+=" Follow along in your local \`monkeypox\` repo with: "'```'"nextstrain build --aws-batch --no-download --attach ${AWS_BATCH_JOB_ID} ingest/"'```'
message+=" Follow along in your local clone of ${github_repo} with: "'```'"nextstrain build --aws-batch --no-download --attach ${AWS_BATCH_JOB_ID} ${build_dir}"'```'
fi

"$bin"/notify-slack "$message"
2 changes: 0 additions & 2 deletions ingest/bin/notify-slack → ingest/vendored/notify-slack
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
#!/bin/bash
# Originally copied from nextstrain/ncov-ingest repo
set -euo pipefail

: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
Expand Down Expand Up @@ -38,7 +37,6 @@ if [[ "$upload" == 1 ]]; then
--form-string title="$text" \
--form-string filename="$text" \
--form-string thread_ts="$thread_ts" \
--form-string reply_broadcast="$broadcast" \
--form file=@/dev/stdin \
--form filetype=text \
--fail --silent --show-error \
Expand Down
Loading