Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

argo3.3.10 termante workflow then retry meet error: Workflow operation error #10285

Closed
3 tasks done
yeicandoit opened this issue Dec 28, 2022 · 11 comments
Closed
3 tasks done
Assignees
Labels
P3 Low priority type/bug

Comments

@yeicandoit
Copy link
Contributor

yeicandoit commented Dec 28, 2022

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

argo3.3.10-debug.mp4

when the workflow is running, stop/terminate it, then retry it. the workflow will meet issue: Workflow operation error.
I except the workflow to re-run correctly

Version

v3.3.10

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: dag-diamond-1
spec:
  entrypoint: diamond
  templates:
  - name: echo
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:3.7
      command: [sleep, "600s"]
  - name: diamond
    dag:
      tasks:
      - name: A
        template: echo
        arguments:
          parameters: [{name: message, value: A}]
      - name: B
        dependencies: [A]
        template: echo
        arguments:
          parameters: [{name: message, value: B}]
      - name: C
        dependencies: [A]
        template: echo
        arguments:
          parameters: [{name: message, value: C}]
      - name: D
        dependencies: [B, C]
        template: echo
        arguments:
          parameters: [{name: message, value: D}]

Logs from the workflow controller

time="2022-12-28T10:37:43.765Z" level=info msg="Create pods 201"
time="2022-12-28T10:37:43.766Z" level=info msg="Created pod: dag-diamond-1-lkhtv.A (dag-diamond-1-lkhtv-1695834569)" namespace=enos workflow=dag-diamond-1-lkhtv
time="2022-12-28T10:37:43.766Z" level=error msg="Recovered from panic" namespace=enos r="runtime error: invalid memory address or nil pointer dereference" stack="goroutine 154 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x65\ngithub.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).operate.func2()\n\t/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:194 +0xd4\npanic({0x1d506c0, 0x3432820})\n\t/usr/local/go/src/runtime/panic.go:1047 +0x266\ngithub.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeTemplate(0xc0009f86c0, {0x2339ae8, 0xc00005a018}, {0xc00106a480, 0x15}, {0x231a750, 0xc000b2c5a0}, 0x0, {{0x0, 0x0, ...}, ...}, ...)\n\t/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:1965 +0x32c5\ngithub.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeDAGTask(0xc0009f86c0, {0x2339ae8, 0xc00005a018}, 0xc000441880, {0xc000d22841, 0x1})\n\t/go/src/github.com/argoproj/argo-workflows/workflow/controller/dag.go:513 +0x1888\ngithub.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeDAGTask(0xc0009f86c0, {0x2339ae8, 0xc00005a018}, 0xc000441880, {0xc000f84661, 0x1})\n\t/go/src/github.com/argoproj/argo-workflows/workflow/controller/dag.go:438 +0x1f25\ngithub.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeDAGTask(0xc0009f86c0, {0x2339ae8, 0xc00005a018}, 0xc000441880, {0x33da960, 0x1})\n\t/go/src/github.com/argoproj/argo-workflows/workflow/controller/dag.go:438 +0x1f25\ngithub.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeDAG(0xc0009f86c0, {0x2339ae8, 0xc00005a018}, {0xc00106a0c0, 0x6561393034313600}, 0xc000be0c40, {0xc0003759a0, 0x19}, 0xc00103a480, {0x231b9e0, ...}, ...)\n\t/go/src/github.com/argoproj/argo-workflows/workflow/controller/dag.go:244 +0x433\ngithub.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeTemplate(0xc0009f86c0, {0x2339ae8, 0xc00005a018}, {0xc00106a0c0, 0x13}, {0x231b9e0, 0xc0009f8780}, 0xc000be0c00, {{0x0, 0x0, ...}, ...}, ...)\n\t/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:1906 +0x232c\ngithub.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).operate(0xc0009f86c0, {0x2339ae8, 0xc00005a018})\n\t/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:350 +0x16a8\ngithub.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).processNextItem(0xc0004ea400, {0x2339ae8, 0xc00005a018})\n\t/go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:756 +0x8ee\ngithub.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runWorker(0x0)\n\t/go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:678 +0x9e\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f630fcd7730)\n\t/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:155 +0x67\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0, {0x22f6ee0, 0xc000346840}, 0x1, 0xc000099500)\n\t/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:156 +0xb6\nk8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0, 0x3b9aca00, 0x0, 0x0, 0x0)\n\t/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:133 +0x89\nk8s.io/apimachinery/pkg/util/wait.Until(0x0, 0x0, 0x0)\n\t/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:90 +0x25\ncreated by github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run\n\t/go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:294 +0x1a6c\n" workflow=dag-diamond-1-lkhtv
time="2022-12-28T10:37:43.766Z" level=info msg="Updated phase Running -> Error" namespace=enos workflow=dag-diamond-1-lkhtv
time="2022-12-28T10:37:43.766Z" level=info msg="Updated message  -> runtime error: invalid memory address or nil pointer dereference" namespace=enos workflow=dag-diamond-1-lkhtv
time="2022-12-28T10:37:43.766Z" level=info msg="Marking workflow completed" namespace=enos workflow=dag-diamond-1-lkhtv
time="2022-12-28T10:37:43.766Z" level=info msg="Checking daemoned children of " namespace=enos workflow=dag-diamond-1-lkhtv

Logs from in your workflow's wait container

retry failed, so no wait log
@sarabala1979 sarabala1979 self-assigned this Jan 5, 2023
@sarabala1979 sarabala1979 added the P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority label Jan 5, 2023
@sarabala1979
Copy link
Member

@yeicandoit can you try on v3.4.4?

@sarabala1979 sarabala1979 added problem/more information needed Not enough information has been provide to diagnose this issue. and removed P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority labels Jan 5, 2023
@yeicandoit
Copy link
Contributor Author

yeicandoit commented Jan 9, 2023

argo3.4.4-debug.mp4

Hi @sarabala1979
I have tried v3.4.4 . when the workflow is running, stop/terminate it, then retry it. the workflow still meets issue: Workflow operation error.

check workflow-controller.log, it met another error:

time="2023-01-09T06:44:35.880Z" level=error msg="Mark error node" error="task 'dag-diamond-1-tjz7j.A' errored: no Node found by the name of ; wf.Status.Nodes=map[:{ID: Name: DisplayName: Type: TemplateName: TemplateRef:nil TemplateScope: Phase: BoundaryID: Message: StartedAt:2023-01-09 06:44:35.869380936 +0000 UTC FinishedAt:0001-01-01 00:00:00 +0000 UTC EstimatedDuration:0 Progress: ResourcesDuration: PodIP: Daemoned: Inputs:nil Outputs:&Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{Artifact{Name:main-logs,Path:,Mode:nil,From:,ArtifactLocation:ArtifactLocation{ArchiveLogs:nil,S3:&S3Artifact{S3Bucket:S3Bucket{Endpoint:,Bucket:,Region:,Insecure:nil,AccessKeySecret:nil,SecretKeySecret:nil,RoleARN:,UseSDKCreds:false,CreateBucketIfNotPresent:nil,EncryptionOptions:nil,},Key:dag-diamond-1-tjz7j/dag-diamond-1-tjz7j-echo-581138205/main.log,},Git:nil,HTTP:nil,Artifactory:nil,HDFS:nil,Raw:nil,OSS:nil,GCS:nil,Azure:nil,},GlobalName:,Archive:nil,Optional:false,SubPath:,RecurseMode:false,FromExpression:,ArtifactGC:nil,Deleted:false,},},Result:nil,ExitCode:nil,} Children:[] OutboundNodes:[] HostNodeName: MemoizationStatus:nil SynchronizationStatus:nil} dag-diamond-1-tjz7j:{ID:dag-diamond-1-tjz7j Name:dag-diamond-1-tjz7j DisplayName:dag-diamond-1-tjz7j Type:DAG TemplateName:diamond TemplateRef:nil TemplateScope:local/dag-diamond-1-tjz7j Phase:Running BoundaryID: Message: StartedAt:2023-01-09 06:44:35 +0000 UTC FinishedAt:0001-01-01 00:00:00 +0000 UTC EstimatedDuration:0 Progress:0/1 ResourcesDuration: PodIP: Daemoned: Inputs:nil Outputs:nil Children:[] OutboundNodes:[] HostNodeName: MemoizationStatus:nil SynchronizationStatus:nil} dag-diamond-1-tjz7j-581138205:{ID: Name: DisplayName: Type: TemplateName: TemplateRef:nil TemplateScope: Phase: BoundaryID: Message: StartedAt:0001-01-01 00:00:00 +0000 UTC FinishedAt:0001-01-01 00:00:00 +0000 UTC EstimatedDuration:0 Progress: ResourcesDuration: PodIP: Daemoned: Inputs:nil Outputs:&Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{Artifact{Name:main-logs,Path:,Mode:nil,From:,ArtifactLocation:ArtifactLocation{ArchiveLogs:nil,S3:&S3Artifact{S3Bucket:S3Bucket{Endpoint:,Bucket:,Region:,Insecure:nil,AccessKeySecret:nil,SecretKeySecret:nil,RoleARN:,UseSDKCreds:false,CreateBucketIfNotPresent:nil,EncryptionOptions:nil,},Key:dag-diamond-1-tjz7j/dag-diamond-1-tjz7j-echo-581138205/main.log,},Git:nil,HTTP:nil,Artifactory:nil,HDFS:nil,Raw:nil,OSS:nil,GCS:nil,Azure:nil,},GlobalName:,Archive:nil,Optional:false,SubPath:,RecurseMode:false,FromExpression:,ArtifactGC:nil,Deleted:false,},},Result:nil,ExitCode:nil,} Children:[] OutboundNodes:[] HostNodeName: MemoizationStatus:nil SynchronizationStatus:nil}]" namespace=enos nodeName=dag-diamond-1-tjz7j.A workflow=dag-diamond-1-tjz7j

time="2023-01-09T06:44:35.880Z" level=info msg="node phase -> Error" namespace=enos workflow=dag-diamond-1-tjz7j


check argo.log to get complete workflow-controller log
argo.log

@yeicandoit
Copy link
Contributor Author

@sarabala1979 please check this issue, thanks very much

@sarabala1979
Copy link
Member

@yeicandoit It looks like an edge case bug. node is not initialized in the retry case. Do you like to fix this issue?

@sarabala1979 sarabala1979 added P3 Low priority and removed problem/more information needed Not enough information has been provide to diagnose this issue. labels Jan 29, 2023
@yeicandoit
Copy link
Contributor Author

@sarabala1979 OK, I will try to fix it.

@linyao22
Copy link
Contributor

Hi @sarabala1979 @yeicandoit Do you mind share a bit more info on this issue? We encountered no Node found by the name of ; error as well. Wonder in which case this gets triggered, and whether there's something we can do as quick workaround. Thanks!

@stale
Copy link

stale bot commented Mar 25, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

@stale stale bot added the problem/stale This has not had a response in some time label Mar 25, 2023
@yeicandoit
Copy link
Contributor Author

yeicandoit commented Mar 27, 2023

Hi @sarabala1979 @yeicandoit Do you mind share a bit more info on this issue? We encountered no Node found by the name of ; error as well. Wonder in which case this gets triggered, and whether there's something we can do as quick workaround. Thanks!

when the workflow is running, stop/terminate it, then retry it. the workflow will meet issue: Workflow operation error.
the following comment described this issue clearly. I have not found ways to avoid this issue
#10285 (comment)

@stale stale bot removed the problem/stale This has not had a response in some time label Mar 27, 2023
@yeicandoit
Copy link
Contributor Author

yeicandoit commented Apr 9, 2023

I think I found the root cause,
get old node from workflow status by node id in taskresult.go, should check whether this node exists firstly

func (woc *wfOperationCtx) taskResultReconciliation() {
	objs, _ := woc.controller.taskResultInformer.GetIndexer().ByIndex(indexes.WorkflowIndex, woc.wf.Namespace+"/"+woc.wf.Name)
	woc.log.WithField("numObjs", len(objs)).Info("Task-result reconciliation")
	for _, obj := range objs {
		result := obj.(*wfv1.WorkflowTaskResult)
		nodeID := result.Name
		**old := woc.wf.Status.Nodes[nodeID]**
		new := old.DeepCopy()

old := woc.wf.Status.Nodes[nodeID] should be changed to

old, exist := woc.wf.Status.Nodes[nodeID]
if !exist {
    continue
}

https://github.com/argoproj/argo-workflows/blob/master/workflow/controller/taskresult.go

yeicandoit added a commit to yeicandoit/argo-workflows that referenced this issue Apr 12, 2023
fix issue that stop/termante workflow then retry meet error: Workflow operation error <br>
argoproj#10285

Signed-off-by: yeicandoit <410342333@qq.com>
@yeicandoit
Copy link
Contributor Author

yeicandoit commented Apr 12, 2023

@sarabala1979 I have sent pull request #10886, please check, thanks

@linyao22
Copy link
Contributor

Sorry just saw this. Thank you so much for the fix !

terrytangyuan pushed a commit that referenced this issue May 25, 2023
Signed-off-by: yeicandoit <410342333@qq.com>
dpadhiar pushed a commit to dpadhiar/argo-workflows that referenced this issue May 9, 2024
Signed-off-by: yeicandoit <410342333@qq.com>
Signed-off-by: Dillen Padhiar <dillen_padhiar@intuit.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 Low priority type/bug
Projects
None yet
Development

No branches or pull requests

3 participants