fix(controller): Improve resilience to transient errors. Fixes #3791 and #3217 #3800

alexec · 2020-08-17T21:17:26Z

Checklist:

Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
I've signed the CLA.
I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
My builds are green. Try syncing with master if they are not.
My organization is added to USERS.md.

Fixes #3791
Fixes #3217
Fixes #3645
See #1913
See #3812
See #3745
Depends on #3846

Changes

test: Disable DB logging by default. test: E2E test refactoring #3849
feat(controller): Panic or log error when modifying fulfilled status. feat(controller): Panic or log error when modifying fulfilled status #3850
feat(controller): Do not try to create pods we know exists to prevent exceeded quota conflict errors. feat(controller): Do not try to create pods we know exists to prevent exceeded quota errors. Fixes #3791 #3851
feat(controller): Write back workflow to informer to prevent conflict errors.
feat(controller): Always retry when IsTransientErr to tolerate transient errors.
test: Enable scaling of MinIO and MySQL to assist FMEA testing.

…oj#3217

…roj#3791

workflow/controller/workflowpod.go

alexec · 2020-08-18T17:56:01Z

workflow/controller/workflowpod.go

+
+	// we must check to see if the pod exists rather than just optimistically creating the pod and see if we get
+	// an `AlreadyExists` error because we won't get that error if there is not enough resources
+	obj, exists, err := woc.controller.podInformer.GetStore().Get(cache.ExplicitKey(pod.Namespace + "/" + pod.Name))


This solution depends on the informer being correct when queries here.

If the informer is incorrect, i.e believe that a pod has not been created when it in fact has, then the workflow will error.

An alternative solution is to do podInterface.Get(pod.Name) and see what error it returns (i.e. is it an apierr.IsNotFound(..)?).

alexec · 2020-08-21T22:26:17Z

Makefile

@@ -56,6 +56,7 @@ AUTH_MODE             := client
 endif
 K3D                   := $(shell if [ "`which kubectl`" != '' ] && [ "`kubectl config current-context`" = "k3s-default" ]; then echo true; else echo false; fi)
 LOG_LEVEL             := debug
+UPPERIO_DB_DEBUG      := 0


disabled the SQL logging by default now (I found it confusing)

alexec · 2020-08-21T22:26:49Z

manifests/quick-start/base/minio/kustomization.yaml

@@ -2,6 +2,6 @@ apiVersion: kustomize.config.k8s.io/v1beta1
 kind: Kustomization

 resources:
-  - minio-pod.yaml
+  - minio-deployment.yaml


changing to a deployment allows us to disable mysql by scaling it down

alexec · 2020-08-21T22:27:16Z

manifests/quick-start/mysql/mysql-persistentvolumecliaim.yaml

@@ -0,0 +1,9 @@
+apiVersion: v1
+kind: PersistentVolumeClaim


we need a PVC for data to survive scaled down/up

alexec · 2020-08-21T22:28:03Z

test/e2e/manifests/mixins/workflow-controller-configmap.yaml

@@ -5,7 +5,7 @@ metadata:
 data:
  containerRuntimeExecutor: pns
  executor: |
-    imagePullPolicy: Never
+    imagePullPolicy: IfNotPresent


change the default for tests

sarabala1979 · 2020-08-23T23:11:14Z

workflow/controller/operator.go

+	// * Fails the `reapplyUpdate` pre-condition - it can never recover.
+	// * It will double the number of Kubernetes API requests.
+	if woc.orig.ResourceVersion != woc.wf.ResourceVersion {
+		panic("cannot re-apply update with unequal original and modified resource versions")


Instead of panic. Can we re-queue the workflow and return? otherwise, log monitor like (Splunk) will keep reporting the panic

alexec · 2020-08-23T23:57:09Z

Broken up into smaller PRs that will be easier to review and to revert if/when needed.

sarabala1979 and others added 22 commits July 28, 2020 14:56

fix: DAG level Output Artifacts on K8S and Kubelet executor

e663374

Merge branch 'master' of https://github.com/argoproj/argo

e5d78d4

Merge branch 'master' of https://github.com/argoproj/argo

e784e9d

Merge branch 'master' of https://github.com/argoproj/argo

a92c98d

Merge branch 'master' of https://github.com/argoproj/argo

d6456f6

Merge branch 'master' of https://github.com/argoproj/argo

8020986

Merge branch 'master' of https://github.com/argoproj/argo

3a21a15

Merge branch 'master' of https://github.com/argoproj/argo

81149df

fix(controller): Tolerate resourcequota conflict errors. Fixes argopr…

aa4a591

…oj#3217

fix-3217

5520665

Merge branch 'master' of https://github.com/argoproj/argo

886e68b

Merge branch 'master' of https://github.com/argoproj/argo

e860842

Merge branch 'master' of https://github.com/argoproj/argo

461d03c

Merge branch 'master' of https://github.com/argoproj/argo

cbaa8ac

Merge branch 'master' into fix-3217

1264571

Merge branch 'master' of https://github.com/argoproj/argo

fc86b40

Merge branch 'master' into fix-3217

ea89131

ci: FMEA tests

bbd6ff1

try and test persistence

e6ab59d

fix(controller): Do not try to create pods that may fail. Fixes argop…

4bbb223

…roj#3791

fix-3791

fb260ca

fix-3791

3787127

sarabala1979 reviewed Aug 17, 2020

View reviewed changes

workflow/controller/workflowpod.go Outdated Show resolved Hide resolved

sarabala1979 self-assigned this Aug 17, 2020

sarabala1979 and others added 3 commits August 17, 2020 20:38

Merge branch 'master' of https://github.com/argoproj/argo

2846f55

Merge branch 'master' into fix-3791

c2a9851

fix-3791

4663bf5

alexec commented Aug 18, 2020

View reviewed changes

alexec added 2 commits August 18, 2020 10:56

Merge branch 'master' into fix-3217

9faad8d

Merge branch 'fix-3217' of github.com:alexec/argo into fix-3217

1220f37

sarabala1979 and others added 3 commits August 21, 2020 12:16

Update operator.go

b9908c6

fix-3791

b30a5fe

Merge branch 'fmea' into fix-3791

af5cdbc

alexec commented Aug 21, 2020

View reviewed changes

alexec and others added 18 commits August 21, 2020 15:38

fix-3791

7c0b38b

fix-3791

e44d972

Merge branch 'master' into fix-3791

8e19db6

Update operator_concurrency_test.go

81e52b7

Update operator_concurrency_test.go

a808070

fix-3791

a478c7a

fix-3791

70bf5e2

ok

5d4eae4

Merge remote-tracking branch 'upstream/master' into refactor

770eb20

ok

7fc6133

Update operator_concurrency_test.go

f560800

Merge branch 'master' into fix-3791

9d1c620

Update operator_concurrency_test.go

55181dc

ok

f5b6fbc

Merge branch 'refactor' into fix-3791

44ab442

fix-3791

98fd50d

fix-3791

f6c2e38

fix-3791

398d42d

sarabala1979 reviewed Aug 23, 2020

View reviewed changes

alexec unassigned sarabala1979 Aug 23, 2020

alexec closed this Aug 23, 2020

alexec deleted the fix-3791 branch December 5, 2020 00:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(controller): Improve resilience to transient errors. Fixes #3791 and #3217 #3800

fix(controller): Improve resilience to transient errors. Fixes #3791 and #3217 #3800

alexec commented Aug 17, 2020 •

edited

Loading

alexec Aug 18, 2020

alexec Aug 21, 2020

alexec Aug 21, 2020

alexec Aug 21, 2020

alexec Aug 21, 2020

sarabala1979 Aug 23, 2020

alexec commented Aug 23, 2020

fix(controller): Improve resilience to transient errors. Fixes #3791 and #3217 #3800

fix(controller): Improve resilience to transient errors. Fixes #3791 and #3217 #3800

Conversation

alexec commented Aug 17, 2020 • edited Loading

Changes

alexec Aug 18, 2020

Choose a reason for hiding this comment

alexec Aug 21, 2020

Choose a reason for hiding this comment

alexec Aug 21, 2020

Choose a reason for hiding this comment

alexec Aug 21, 2020

Choose a reason for hiding this comment

alexec Aug 21, 2020

Choose a reason for hiding this comment

sarabala1979 Aug 23, 2020

Choose a reason for hiding this comment

alexec commented Aug 23, 2020

alexec commented Aug 17, 2020 •

edited

Loading