Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(controller): Retry archiving later on error. Fixes #3786 #3862

Merged
merged 42 commits into from
Sep 18, 2020

Conversation

alexec
Copy link
Contributor

@alexec alexec commented Aug 25, 2020

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed the CLA.
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

Fixes #3786

@alexec alexec changed the title feat(controller): Retry archiving later on error. feat(controller): Retry archiving later on error. Fixes #3837 Aug 25, 2020
@alexec alexec marked this pull request as ready for review August 26, 2020 21:03
@alexec alexec marked this pull request as draft August 27, 2020 19:24
@alexec alexec marked this pull request as draft August 27, 2020 19:24
@alexec alexec added this to the v2.11 milestone Aug 28, 2020
@alexec alexec marked this pull request as ready for review August 28, 2020 20:57
@alexec
Copy link
Contributor Author

alexec commented Aug 28, 2020

TestEventOnWorkflowSuccess

1 similar comment
@alexec
Copy link
Contributor Author

alexec commented Aug 28, 2020

TestEventOnWorkflowSuccess

@alexec
Copy link
Contributor Author

alexec commented Aug 28, 2020

TestEventOnNodeFail

@alexec alexec added the P3 label Aug 31, 2020
@alexec alexec modified the milestones: v2.11, v2.12 Aug 31, 2020
@alexec
Copy link
Contributor Author

alexec commented Sep 4, 2020

FAIL: TestFunctionalSuite/TestEventOnNodeFail

@alexec
Copy link
Contributor Author

alexec commented Sep 4, 2020

Event occurring out of order:

m controller | �[mtime="2020-09-04T01:39:52Z" level=info msg="Marking workflow completed" namespace=argo workflow=failed-step-event-wf7tt
�[32m controller | �[mtime="2020-09-04T01:39:52Z" level=info msg="Marking workflow as pending archiving" namespace=argo workflow=failed-step-event-wf7tt
�[32m controller | �[mtime="2020-09-04T01:39:52Z" level=info msg="Checking daemoned children of " namespace=argo workflow=failed-step-event-wf7tt
�[32m controller | �[mtime="2020-09-04T01:39:52Z" level=debug msg="Event(v1.ObjectReference{Kind:\"Workflow\", Namespace:\"argo\", Name:\"failed-step-event-wf7tt\", UID:\"6e23f453-b98b-4836-8fb9-f13447d91c44\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"3869\", FieldPath:\"\"}): type: 'Warning' reason: 'WorkflowFailed' failed with exit code 1"
�[32m controller | �[mtime="2020-09-04T01:39:52Z" level=debug msg="Event(v1.ObjectReference{Kind:\"Workflow\", Namespace:\"argo\", Name:\"failed-step-event-wf7tt\", UID:\"6e23f453-b98b-4836-8fb9-f13447d91c44\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"3869\", FieldPath:\"\"}): type: 'Warning' reason: 'WorkflowNodeFailed' Failed node failed-step-event-wf7tt: failed with exit code 1"
�[32m controller | �[mtime="2020-09-04T01:39:52Z" level=info msg="Workflow update successful" namespace=argo phase=Failed resourceVersion=3873 workflow=failed-step-event-wf7tt
�[32m controller | �[mtime="2020-09-04T01:39:53Z" level=info msg="Labeled pod completed" namespace=argo pod=failed-step-event-wf7tt
�[32m controller | �[mtime="2020-09-04T01:39:53Z" level=info msg="archiving workflow" namespace=argo uid=6e23f453-b98b-4836-8fb9-f13447d91c44 workflow=failed-step-event-wf7tt
�[32m controller | �[mtime="2020-09-04T01:39:53Z" level=info msg="archiving workflow" namespace=argo uid=6e23f453-b98b-4836-8fb9-f13447d91c44 workflow=failed-step-event-wf7tt

@alexec
Copy link
Contributor Author

alexec commented Sep 4, 2020

Added diagnostics to print when we send a message.

@alexec alexec marked this pull request as ready for review September 6, 2020 17:20
entrypoint: run-archie
templates:
- name: run-archie
container:
image: argoproj/argosay:v2`).
When().
SubmitWorkflow().
WaitForWorkflow().
WaitForWorkflow(fixtures.ToBeArchived).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add message

assert.Equal(t, "failed with exit code 1", e.Message)
2,
func(t *testing.T, es []corev1.Event) {
for _, e := range es {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these now appear out of order - which I'm guessing is something to do with Goroutine scheduling

@alexec alexec requested a review from jessesuen as a code owner September 15, 2020 00:44
@alexec alexec mentioned this pull request Sep 15, 2020
@sarabala1979
Copy link
Member

@alexec Can we set up some time to go through changes?

@alexec
Copy link
Contributor Author

alexec commented Sep 17, 2020

@sarabala1979 ready for review.

Copy link
Member

@sarabala1979 sarabala1979 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, The workflow key lock will be refactored in controller level parallelism PR

@alexec alexec merged commit 5461d54 into argoproj:master Sep 18, 2020
@alexec alexec deleted the archiving branch September 18, 2020 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

More robust workflow archiving
2 participants