panic in ml-pipeline-viewer-controller-deployment #2253

mattnworb · 2019-09-27T19:23:27Z

What happened:
ml-pipeline-viewer-controller-deployment / gcr.io/ml-pipeline/viewer-crd-controller:0.1.25 crashes due to a panic

What did you expect to happen:
For the controller to not panic 😄

What steps did you take:
Create a sufficient number of Viewer instances so that viewer-crd-controller tries to clean up the oldest one when a new Viewer is created. Instead of being able to delete the oldest instance, the controller panics and crashes, in a loop - preventing new Tensorboard instances from being launched.

Anything else you would like to add:

The problem can be remediated by deleting all viewer instances (kubectl -n kubeflow delete viewer --all), but eventually as more instances are launched, the controller will panic again.

Here is the full log output from a crash:

W0923 20:46:10.541524       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0923 20:46:10.917341       1 main.go:91] Starting controller for the Viewer CRD
I0923 20:46:11.221172       1 reconciler.go:72] Reconcile request: kubeflow/viewer-2ed1f6e77d4b3f87638e4d538a94e6d02547adc2
I0923 20:46:11.235848       1 reconciler.go:84] Got instance: &{TypeMeta:{Kind: APIVersion:} ObjectMeta:{Name:viewer-2ed1f6e77d4b3f87638e4d538a94e6d02547adc2 GenerateName: Namespace:kubeflow SelfLink:/apis/kubeflow.org/v1beta1/namespaces/kubeflow/viewers/viewer-2ed1f6e77d4b3f87638e4d538a94e6d02547adc2 UID:a4192871-ce6b-11e9-84e1-42010a8401ce ResourceVersion:20937680 Generation:1 CreationTimestamp:2019-09-03 16:55:32 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[] Annotations:map[] OwnerReferences:[] Initializers:nil Finalizers:[] ClusterName:} Spec:{Type:tensorboard TensorboardSpec:{LogDir:gs://xxxxxx/Trainer/output/5673/20190903T162556.841158-990bccf18ee5/serving_model_dir} PodTemplateSpec:{ObjectMeta:{Name: GenerateName: Namespace: SelfLink: UID: ResourceVersion: Generation:0 CreationTimestamp:0001-01-01 00:00:00 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[] Annotations:map[] OwnerReferences:[] Initializers:nil Finalizers:[] ClusterName:} Spec:{Volumes:[{Name:gcp-credentials VolumeSource:{HostPath:nil EmptyDir:nil GCEPersistentDisk:nil AWSElasticBlockStore:nil GitRepo:nil Secret:nil NFS:nil ISCSI:nil Glusterfs:nil PersistentVolumeClaim:nil RBD:nil FlexVolume:nil Cinder:nil CephFS:nil Flocker:nil DownwardAPI:nil FC:nil AzureFile:nil ConfigMap:nil VsphereVolume:nil Quobyte:nil AzureDisk:nil PhotonPersistentDisk:nil Projected:nil PortworxVolume:nil ScaleIO:nil StorageOS:nil}}] InitContainers:[] Containers:[{Name: Image: Command:[] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:GOOGLE_APPLICATION_CREDENTIALS Value:/secret/gcp-credentials/user-gcp-sa.json ValueFrom:nil}] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[{Name:gcp-credentials ReadOnly:true MountPath:/secret/gcp-credentials/user-gcp-sa.json SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath: TerminationMessagePolicy: ImagePullPolicy: SecurityContext:nil Stdin:false StdinOnce:false TTY:false}] RestartPolicy: TerminationGracePeriodSeconds:<nil> ActiveDeadlineSeconds:<nil> DNSPolicy: NodeSelector:map[] ServiceAccountName: DeprecatedServiceAccount: AutomountServiceAccountToken:<nil> NodeName: HostNetwork:false HostPID:false HostIPC:false ShareProcessNamespace:<nil> SecurityContext:nil ImagePullSecrets:[] Hostname: Subdomain: Affinity:nil SchedulerName: Tolerations:[] HostAliases:[] PriorityClassName: Priority:<nil> DNSConfig:nil ReadinessGates:[]}}}}
E0923 20:46:11.259837       1 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20180621070125-103fd098999d/pkg/util/runtime/runtime.go:72
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20180621070125-103fd098999d/pkg/util/runtime/runtime.go:65
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20180621070125-103fd098999d/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/panic.go:522
/usr/local/go/src/runtime/panic.go:82
/usr/local/go/src/runtime/signal_unix.go:390
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/client/interfaces.go:154
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/client/typed_client.go:75
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/client/client.go:121
/src/github.com/kubeflow/pipelines/backend/src/crd/controller/viewer/reconciler/reconciler.go:276
/src/github.com/kubeflow/pipelines/backend/src/crd/controller/viewer/reconciler/reconciler.go:94
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/internal/controller/controller.go:213
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/internal/controller/controller.go:158
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20180621070125-103fd098999d/pkg/util/wait/wait.go:133
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20180621070125-103fd098999d/pkg/util/wait/wait.go:134
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20180621070125-103fd098999d/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:1337
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xeb2858]

goroutine 224 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20180621070125-103fd098999d/pkg/util/runtime/runtime.go:58 +0x105
panic(0x1142100, 0x1e37850)
	/usr/local/go/src/runtime/panic.go:522 +0x1b5
sigs.k8s.io/controller-runtime/pkg/client.(*DeleteOptions).ApplyOptions(0xc000757f60, 0xc000011358, 0x1, 0x1, 0x0)
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/client/interfaces.go:154 +0x38
sigs.k8s.io/controller-runtime/pkg/client.(*typedClient).Delete(0xc000140d00, 0x1457220, 0xc000050078, 0x14376c0, 0xc000a92000, 0xc000011358, 0x1, 0x1, 0x1, 0xc000011358)
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/client/typed_client.go:75 +0x1d7
sigs.k8s.io/controller-runtime/pkg/client.(*client).Delete(0xc000140d00, 0x1457220, 0xc000050078, 0x14376c0, 0xc000a92000, 0xc000011358, 0x1, 0x1, 0xc00039ae00, 0x1ddeac6)
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/client/client.go:121 +0x11b
github.com/kubeflow/pipelines/backend/src/crd/controller/viewer/reconciler.(*Reconciler).maybeDeleteOldestViewer(0xc000375f80, 0xc0005310d0, 0xb, 0x11, 0xc000933ce0)
	/src/github.com/kubeflow/pipelines/backend/src/crd/controller/viewer/reconciler/reconciler.go:276 +0x273
github.com/kubeflow/pipelines/backend/src/crd/controller/viewer/reconciler.(*Reconciler).Reconcile(0xc000375f80, 0xc00035c020, 0x8, 0xc0007c61e0, 0x2f, 0x1e4ad20, 0xb9, 0xc0004695c0, 0xc0004695c0)
	/src/github.com/kubeflow/pipelines/backend/src/crd/controller/viewer/reconciler/reconciler.go:94 +0x3e9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0004c80a0, 0x0)
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/internal/controller/controller.go:213 +0x213
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1()
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/internal/controller/controller.go:158 +0x36
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc0002eeed0)
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20180621070125-103fd098999d/pkg/util/wait/wait.go:133 +0x54
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0002eeed0, 0x3b9aca00, 0x0, 0x1, 0xc0000ba120)
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20180621070125-103fd098999d/pkg/util/wait/wait.go:134 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc0002eeed0, 0x3b9aca00, 0xc0000ba120)
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20180621070125-103fd098999d/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/internal/controller/controller.go:157 +0x311

The text was updated successfully, but these errors were encountered:

xiaogaozi · 2019-12-03T03:28:23Z

Same case here, we also encounter this bug. Could only delete all viewers then restart the pod as a workaround.

andrewsmartin · 2020-01-03T16:30:07Z

Hi, just wanted to ask if there are any updates on this / fixes in progress. We run into this quite regularly due to high volume of tensorboard instances on a shared cluster, and we have to manually clean up old instances whenever this happens.

neuromage · 2020-01-03T18:13:12Z

/assign @jingzhang36

mattnworb · 2020-01-16T00:17:51Z

I believe the panic is caused by passing nil as the last argument in

pipelines/backend/src/crd/controller/viewer/reconciler/reconciler.go

Line 276 in a16f380

return r.Client.Delete(context.Background(), oldest, nil)

(this line is present in the master branch as well)

The godoc for the sigs.k8s.io/controller-runtime/pkg/client.Client interface: https://godoc.org/sigs.k8s.io/controller-runtime/pkg/client#Writer

// Delete deletes the given obj from Kubernetes cluster.
Delete(ctx context.Context, obj runtime.Object, opts ...DeleteOption) error

DeleteOption is defined as type DeleteOptionFunc func(*DeleteOptions).

Eventually client.Client passes the opts ...DeleteOption argument into https://github.com/kubernetes-sigs/controller-runtime/blob/5e1fc314d87051fda5790d102a2572e60f3fe2df/pkg/client/interfaces.go#L150-L157

and so the panic is caused when optFunc(o) is executed, since optFunc at this point is nil.

mattnworb · 2020-01-16T00:27:43Z

reconciler_test.go has a test that should cover the maybeDeleteOldestViewer logic, but the fake client that controller-runtime gives you via sigs.k8s.io/controller-runtime/pkg/client/fake.NewClient() ignores the opts ...DeleteOption argument altogether 🙃

https://github.com/kubernetes-sigs/controller-runtime/blob/5558165425ef09b6e5c3f61d3457fbeca0d6d579/pkg/client/fake/client.go#L128

addresses issue kubeflow#2253. Passing `nil` as an argument to the varargs `...DeleteOption` parameter of `client.Delete()` will panic since client-go will iterate over the DeleteOption array and invoke each (since DeleteOption is defined as `type DeleteOptionFunc func(*DeleteOptions)`.

descampsk · 2020-01-16T21:21:15Z

Raised the same issue here: #2757

mattnworb · 2020-01-16T23:41:24Z

I’ve submitted a fix in #2853

addresses issue #2253. Passing `nil` as an argument to the varargs `...DeleteOption` parameter of `client.Delete()` will panic since client-go will iterate over the DeleteOption array and invoke each (since DeleteOption is defined as `type DeleteOptionFunc func(*DeleteOptions)`.

Bobgy · 2020-01-22T13:56:58Z

It seems this is already fixed in #2853
/close

Please reopen if it's not the case

k8s-ci-robot · 2020-01-22T13:57:00Z

@Bobgy: Closing this issue.

In response to this:

It seems this is already fixed in #2853
/close

Please reopen if it's not the case

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

addresses issue kubeflow#2253. Passing `nil` as an argument to the varargs `...DeleteOption` parameter of `client.Delete()` will panic since client-go will iterate over the DeleteOption array and invoke each (since DeleteOption is defined as `type DeleteOptionFunc func(*DeleteOptions)`.

jessiezcc assigned jingzhang36 Sep 27, 2019

jessiezcc added priority/p2 status/triaged Whether the issue has been explicitly triaged labels Sep 27, 2019

jessiezcc added this to the M11 milestone Sep 27, 2019

jessiezcc added the area/frontend label Sep 27, 2019

Ark-kun assigned IronPan and neuromage Sep 30, 2019

mattnworb mentioned this issue Jan 16, 2020

fix panic in Viewer cleanup #2853

Merged

k8s-ci-robot closed this as completed Jan 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panic in ml-pipeline-viewer-controller-deployment #2253

panic in ml-pipeline-viewer-controller-deployment #2253

mattnworb commented Sep 27, 2019 •

edited

Loading

xiaogaozi commented Dec 3, 2019 •

edited

Loading

andrewsmartin commented Jan 3, 2020

neuromage commented Jan 3, 2020

mattnworb commented Jan 16, 2020

mattnworb commented Jan 16, 2020

descampsk commented Jan 16, 2020

mattnworb commented Jan 16, 2020

Bobgy commented Jan 22, 2020

k8s-ci-robot commented Jan 22, 2020

panic in ml-pipeline-viewer-controller-deployment #2253

panic in ml-pipeline-viewer-controller-deployment #2253

Comments

mattnworb commented Sep 27, 2019 • edited Loading

xiaogaozi commented Dec 3, 2019 • edited Loading

andrewsmartin commented Jan 3, 2020

neuromage commented Jan 3, 2020

mattnworb commented Jan 16, 2020

mattnworb commented Jan 16, 2020

descampsk commented Jan 16, 2020

mattnworb commented Jan 16, 2020

Bobgy commented Jan 22, 2020

k8s-ci-robot commented Jan 22, 2020

mattnworb commented Sep 27, 2019 •

edited

Loading

xiaogaozi commented Dec 3, 2019 •

edited

Loading