Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic in ml-pipeline-viewer-controller-deployment #2253

Closed
mattnworb opened this issue Sep 27, 2019 · 9 comments
Closed

panic in ml-pipeline-viewer-controller-deployment #2253

mattnworb opened this issue Sep 27, 2019 · 9 comments
Assignees
Labels
area/frontend priority/p2 status/triaged Whether the issue has been explicitly triaged
Milestone

Comments

@mattnworb
Copy link
Contributor

mattnworb commented Sep 27, 2019

What happened:
ml-pipeline-viewer-controller-deployment / gcr.io/ml-pipeline/viewer-crd-controller:0.1.25 crashes due to a panic

What did you expect to happen:
For the controller to not panic 😄

What steps did you take:
Create a sufficient number of Viewer instances so that viewer-crd-controller tries to clean up the oldest one when a new Viewer is created. Instead of being able to delete the oldest instance, the controller panics and crashes, in a loop - preventing new Tensorboard instances from being launched.

Anything else you would like to add:

The problem can be remediated by deleting all viewer instances (kubectl -n kubeflow delete viewer --all), but eventually as more instances are launched, the controller will panic again.

Here is the full log output from a crash:

W0923 20:46:10.541524       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0923 20:46:10.917341       1 main.go:91] Starting controller for the Viewer CRD
I0923 20:46:11.221172       1 reconciler.go:72] Reconcile request: kubeflow/viewer-2ed1f6e77d4b3f87638e4d538a94e6d02547adc2
I0923 20:46:11.235848       1 reconciler.go:84] Got instance: &{TypeMeta:{Kind: APIVersion:} ObjectMeta:{Name:viewer-2ed1f6e77d4b3f87638e4d538a94e6d02547adc2 GenerateName: Namespace:kubeflow SelfLink:/apis/kubeflow.org/v1beta1/namespaces/kubeflow/viewers/viewer-2ed1f6e77d4b3f87638e4d538a94e6d02547adc2 UID:a4192871-ce6b-11e9-84e1-42010a8401ce ResourceVersion:20937680 Generation:1 CreationTimestamp:2019-09-03 16:55:32 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[] Annotations:map[] OwnerReferences:[] Initializers:nil Finalizers:[] ClusterName:} Spec:{Type:tensorboard TensorboardSpec:{LogDir:gs://xxxxxx/Trainer/output/5673/20190903T162556.841158-990bccf18ee5/serving_model_dir} PodTemplateSpec:{ObjectMeta:{Name: GenerateName: Namespace: SelfLink: UID: ResourceVersion: Generation:0 CreationTimestamp:0001-01-01 00:00:00 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[] Annotations:map[] OwnerReferences:[] Initializers:nil Finalizers:[] ClusterName:} Spec:{Volumes:[{Name:gcp-credentials VolumeSource:{HostPath:nil EmptyDir:nil GCEPersistentDisk:nil AWSElasticBlockStore:nil GitRepo:nil Secret:nil NFS:nil ISCSI:nil Glusterfs:nil PersistentVolumeClaim:nil RBD:nil FlexVolume:nil Cinder:nil CephFS:nil Flocker:nil DownwardAPI:nil FC:nil AzureFile:nil ConfigMap:nil VsphereVolume:nil Quobyte:nil AzureDisk:nil PhotonPersistentDisk:nil Projected:nil PortworxVolume:nil ScaleIO:nil StorageOS:nil}}] InitContainers:[] Containers:[{Name: Image: Command:[] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:GOOGLE_APPLICATION_CREDENTIALS Value:/secret/gcp-credentials/user-gcp-sa.json ValueFrom:nil}] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[{Name:gcp-credentials ReadOnly:true MountPath:/secret/gcp-credentials/user-gcp-sa.json SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath: TerminationMessagePolicy: ImagePullPolicy: SecurityContext:nil Stdin:false StdinOnce:false TTY:false}] RestartPolicy: TerminationGracePeriodSeconds:<nil> ActiveDeadlineSeconds:<nil> DNSPolicy: NodeSelector:map[] ServiceAccountName: DeprecatedServiceAccount: AutomountServiceAccountToken:<nil> NodeName: HostNetwork:false HostPID:false HostIPC:false ShareProcessNamespace:<nil> SecurityContext:nil ImagePullSecrets:[] Hostname: Subdomain: Affinity:nil SchedulerName: Tolerations:[] HostAliases:[] PriorityClassName: Priority:<nil> DNSConfig:nil ReadinessGates:[]}}}}
E0923 20:46:11.259837       1 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20180621070125-103fd098999d/pkg/util/runtime/runtime.go:72
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20180621070125-103fd098999d/pkg/util/runtime/runtime.go:65
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20180621070125-103fd098999d/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/panic.go:522
/usr/local/go/src/runtime/panic.go:82
/usr/local/go/src/runtime/signal_unix.go:390
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/client/interfaces.go:154
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/client/typed_client.go:75
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/client/client.go:121
/src/github.com/kubeflow/pipelines/backend/src/crd/controller/viewer/reconciler/reconciler.go:276
/src/github.com/kubeflow/pipelines/backend/src/crd/controller/viewer/reconciler/reconciler.go:94
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/internal/controller/controller.go:213
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/internal/controller/controller.go:158
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20180621070125-103fd098999d/pkg/util/wait/wait.go:133
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20180621070125-103fd098999d/pkg/util/wait/wait.go:134
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20180621070125-103fd098999d/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:1337
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xeb2858]

goroutine 224 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20180621070125-103fd098999d/pkg/util/runtime/runtime.go:58 +0x105
panic(0x1142100, 0x1e37850)
	/usr/local/go/src/runtime/panic.go:522 +0x1b5
sigs.k8s.io/controller-runtime/pkg/client.(*DeleteOptions).ApplyOptions(0xc000757f60, 0xc000011358, 0x1, 0x1, 0x0)
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/client/interfaces.go:154 +0x38
sigs.k8s.io/controller-runtime/pkg/client.(*typedClient).Delete(0xc000140d00, 0x1457220, 0xc000050078, 0x14376c0, 0xc000a92000, 0xc000011358, 0x1, 0x1, 0x1, 0xc000011358)
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/client/typed_client.go:75 +0x1d7
sigs.k8s.io/controller-runtime/pkg/client.(*client).Delete(0xc000140d00, 0x1457220, 0xc000050078, 0x14376c0, 0xc000a92000, 0xc000011358, 0x1, 0x1, 0xc00039ae00, 0x1ddeac6)
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/client/client.go:121 +0x11b
github.com/kubeflow/pipelines/backend/src/crd/controller/viewer/reconciler.(*Reconciler).maybeDeleteOldestViewer(0xc000375f80, 0xc0005310d0, 0xb, 0x11, 0xc000933ce0)
	/src/github.com/kubeflow/pipelines/backend/src/crd/controller/viewer/reconciler/reconciler.go:276 +0x273
github.com/kubeflow/pipelines/backend/src/crd/controller/viewer/reconciler.(*Reconciler).Reconcile(0xc000375f80, 0xc00035c020, 0x8, 0xc0007c61e0, 0x2f, 0x1e4ad20, 0xb9, 0xc0004695c0, 0xc0004695c0)
	/src/github.com/kubeflow/pipelines/backend/src/crd/controller/viewer/reconciler/reconciler.go:94 +0x3e9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0004c80a0, 0x0)
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/internal/controller/controller.go:213 +0x213
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1()
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/internal/controller/controller.go:158 +0x36
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc0002eeed0)
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20180621070125-103fd098999d/pkg/util/wait/wait.go:133 +0x54
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0002eeed0, 0x3b9aca00, 0x0, 0x1, 0xc0000ba120)
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20180621070125-103fd098999d/pkg/util/wait/wait.go:134 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc0002eeed0, 0x3b9aca00, 0xc0000ba120)
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20180621070125-103fd098999d/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.0.0-20181121180216-5558165425ef/pkg/internal/controller/controller.go:157 +0x311
@jessiezcc jessiezcc added priority/p2 status/triaged Whether the issue has been explicitly triaged labels Sep 27, 2019
@jessiezcc jessiezcc added this to the M11 milestone Sep 27, 2019
@xiaogaozi
Copy link

xiaogaozi commented Dec 3, 2019

Same case here, we also encounter this bug. Could only delete all viewers then restart the pod as a workaround.

@andrewsmartin
Copy link
Contributor

Hi, just wanted to ask if there are any updates on this / fixes in progress. We run into this quite regularly due to high volume of tensorboard instances on a shared cluster, and we have to manually clean up old instances whenever this happens.

@neuromage
Copy link
Contributor

/assign @jingzhang36

@mattnworb
Copy link
Contributor Author

I believe the panic is caused by passing nil as the last argument in

return r.Client.Delete(context.Background(), oldest, nil)

(this line is present in the master branch as well)

The godoc for the sigs.k8s.io/controller-runtime/pkg/client.Client interface: https://godoc.org/sigs.k8s.io/controller-runtime/pkg/client#Writer

// Delete deletes the given obj from Kubernetes cluster.
Delete(ctx context.Context, obj runtime.Object, opts ...DeleteOption) error

DeleteOption is defined as type DeleteOptionFunc func(*DeleteOptions).

Eventually client.Client passes the opts ...DeleteOption argument into https://github.com/kubernetes-sigs/controller-runtime/blob/5e1fc314d87051fda5790d102a2572e60f3fe2df/pkg/client/interfaces.go#L150-L157

and so the panic is caused when optFunc(o) is executed, since optFunc at this point is nil.

@mattnworb
Copy link
Contributor Author

reconciler_test.go has a test that should cover the maybeDeleteOldestViewer logic, but the fake client that controller-runtime gives you via sigs.k8s.io/controller-runtime/pkg/client/fake.NewClient() ignores the opts ...DeleteOption argument altogether 🙃

https://github.com/kubernetes-sigs/controller-runtime/blob/5558165425ef09b6e5c3f61d3457fbeca0d6d579/pkg/client/fake/client.go#L128

mattnworb added a commit to mattnworb/pipelines that referenced this issue Jan 16, 2020
addresses issue kubeflow#2253. Passing `nil` as an argument to the varargs
`...DeleteOption` parameter of `client.Delete()` will panic since
client-go will iterate over the DeleteOption array and invoke each
(since DeleteOption is defined as
`type DeleteOptionFunc func(*DeleteOptions)`.
@descampsk
Copy link

Raised the same issue here: #2757

@mattnworb
Copy link
Contributor Author

I’ve submitted a fix in #2853

k8s-ci-robot pushed a commit that referenced this issue Jan 17, 2020
addresses issue #2253. Passing `nil` as an argument to the varargs
`...DeleteOption` parameter of `client.Delete()` will panic since
client-go will iterate over the DeleteOption array and invoke each
(since DeleteOption is defined as
`type DeleteOptionFunc func(*DeleteOptions)`.
@Bobgy
Copy link
Contributor

Bobgy commented Jan 22, 2020

It seems this is already fixed in #2853
/close

Please reopen if it's not the case

@k8s-ci-robot
Copy link
Contributor

@Bobgy: Closing this issue.

In response to this:

It seems this is already fixed in #2853
/close

Please reopen if it's not the case

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jeffwan pushed a commit to Jeffwan/pipelines that referenced this issue Dec 9, 2020
addresses issue kubeflow#2253. Passing `nil` as an argument to the varargs
`...DeleteOption` parameter of `client.Delete()` will panic since
client-go will iterate over the DeleteOption array and invoke each
(since DeleteOption is defined as
`type DeleteOptionFunc func(*DeleteOptions)`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/frontend priority/p2 status/triaged Whether the issue has been explicitly triaged
Projects
None yet
Development

No branches or pull requests

10 participants