Skip to content
This repository has been archived by the owner on Jul 21, 2023. It is now read-only.

Retry job creation in workflow manager as appropriate #151

Closed
tgeoghegan opened this issue Nov 4, 2020 · 0 comments · Fixed by #258
Closed

Retry job creation in workflow manager as appropriate #151

tgeoghegan opened this issue Nov 4, 2020 · 0 comments · Fixed by #258
Assignees
Labels
p1 Must be fixed for the corresponding milestone to be reached.

Comments

@tgeoghegan
Copy link
Contributor

tgeoghegan commented Nov 4, 2020

While doing some testing, I had a workflow-manager job fail with this log output:

2020/11/04 17:50:10 starting /workflow-manager version timg/robust-fake-ingestor+c156cd89 - Wed 04 Nov 2020 05:29:30 PM UTC. Args: [--is-first=false --k8s-namespace narnia --k8s-service-account narnia-ingestor-2-workflow-manager --ingestor-input s3://us-west-1/prio-hatfield-narnia-ingestor-2-ingestion --ingestor-identity arn:aws:iam::338276578713:role/prio-hatfield-narnia-ingestor-2-bucket-role --own-validation-input gs://prio-hatfield-narnia-ingestor-2-own-validation --peer-validation-input s3://us-west-1/prio-hatfield-narnia-ingestor-2-peer-validation --peer-validation-identity arn:aws:iam::338276578713:role/prio-hatfield-narnia-ingestor-2-bucket-role --bsk-secret-name hatfield-narnia-ingestor-2-batch-signing-key --pdks-secret-name hatfield-narnia-ingestion-packet-decryption-key --intake-batch-config-map narnia-ingestor-2-intake-batch-config --aggregate-config-map narnia-ingestor-2-aggregate-config --facilitator-image us.gcr.io/prio-atredis-oct-2020/prio-facilitator:latest]
2020/11/04 17:50:10 looking for ready batches in s3://prio-hatfield-narnia-ingestor-2-ingestion as arn:aws:iam::338276578713:role/prio-hatfield-narnia-ingestor-2-bucket-role
2020/11/04 17:50:14 fetched token from http://metadata.google.internal:80/computeMetadata/v1/instance/service-accounts/default/identity?audience=sts.amazonaws.com/338276578713
2020/11/04 17:50:14 ready: kittens-seen/2020/11/04/17/47/a023beed-a03c-4122-a063-f94f558e7110
2020/11/04 17:50:14 ready: kittens-seen/2020/11/04/17/48/8f0af102-caa2-4fcf-9650-80ab2b3e615b
2020/11/04 17:50:14 ready: kittens-seen/2020/11/04/17/49/a5065a9c-0fb6-4532-95a8-bd0e4c31cb46
2020/11/04 17:50:14 ready: kittens-seen/2020/11/04/17/43/d2e1d3d8-e2b3-4455-beb5-aedec8975285
2020/11/04 17:50:14 ready: kittens-seen/2020/11/04/17/44/ca47585c-8efc-432f-b8eb-c3c62fc31d10
2020/11/04 17:50:14 ready: kittens-seen/2020/11/04/17/45/a48ac90a-f016-4a92-a0f1-e801d0eb145e
2020/11/04 17:50:14 ready: kittens-seen/2020/11/04/17/46/b8e844d7-3919-490f-bb31-597fe9aa7741
2020/11/04 17:50:14 starting 7 jobs
2020/11/04 17:50:14 starting job for batch "kittens-seen/2020/11/04/17/47/a023beed-a03c-4122-a063-f94f558e7110" with args [intake-batch --aggregation-id kittens-seen --batch-id a023beed-a03c-4122-a063-f94f558e7110 --date 2020/11/04/17/47]
2020/11/04 17:50:14 Created job "i-batch-a023beed-a03c-4122-a063-f94f558e7110": "b95fbfb2-4f9d-4702-a49e-e12f48025a15"
2020/11/04 17:50:14 starting job for batch "kittens-seen/2020/11/04/17/48/8f0af102-caa2-4fcf-9650-80ab2b3e615b" with args [intake-batch --aggregation-id kittens-seen --batch-id 8f0af102-caa2-4fcf-9650-80ab2b3e615b --date 2020/11/04/17/48]
2020/11/04 17:50:15 starting job for batch "kittens-seen/2020/11/04/17/48/8f0af102-caa2-4fcf-9650-80ab2b3e615b": creating job: Operation cannot be fulfilled on resourcequotas "gke-resource-quotas": the object has been modified; please apply your changes to the latest version and try again

Searching for that error message led me to some relevant GitHub Issues: 1 2.

This specific error seems to be a Kubernetes bug, but we could be checking for transient errors of this nature and simply retry, as was done in this pull request on Argo (hi CNCF friends!)

@tgeoghegan tgeoghegan added the p1 Must be fixed for the corresponding milestone to be reached. label Nov 4, 2020
@tgeoghegan tgeoghegan added this to the Production readiness milestone Nov 4, 2020
@aaomidi aaomidi self-assigned this Dec 1, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
p1 Must be fixed for the corresponding milestone to be reached.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants