Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] Pipeline first step stuck in running state even after completing #7132

Closed
RobinKa opened this issue Jan 1, 2022 · 3 comments
Closed

Comments

@RobinKa
Copy link

RobinKa commented Jan 1, 2022

Hey, hope this is the right place to post this issue at. I'm new to Kubeflow and Kubernetes so please let me know what else would be useful to know.

Environment

  • How did you deploy Kubeflow Pipelines (KFP): Installed Kubeflow on Kubernetes 1.19 with manifests, see below
  • KFP version: 1.7.0
  • KFP SDK version: build version dev_local
  • Server specs: 8 CPUs, 16GB RAM, 240GB SSD (Hetzner Cloud CPX41), Ubuntu 20.04

Steps to reproduce

  1. Install KubeFlow on Kubernetes 1.19 (used K3S) with manifests, full setup script in materials below
  2. Go to Kubeflow dashboard
  3. Start a Pipeline run for [Tutorial] DSL - Control structures
  4. First step completes successfully (eg. logs "tails"), but stays stuck in running state

Terminating the run does nothing. I also tried running other pipelines and the result is the same.

image

Expected result

The pipeline step should complete and run the rest of the pipeline.

Materials and Reference

Setup on Ubuntu 20.04 server from scratch

sudo apt update -y && sudo apt upgrade -y

# Install docker
sudo apt install ca-certificates curl gnupg lsb-release
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io

# Install k3s 1.19 (I tried 1.20 too which had the same issue, but 1.21 is too new for manifests)
export INSTALL_K3S_VERSION="v1.19.16%2Bk3s1"
curl -sfL https://get.k3s.io | sh -

# Get Kustomize 3.2.0
cd /opt/
wget https://github.com/kubernetes-sigs/kustomize/releases/download/v3.2.0/kustomize_3.2.0_linux_amd64
chmod +x kustomize_3.2.0_linux_amd64
ln -s /opt/kustomize_3.2.0_linux_amd64 /usr/bin/kustomize

# Install Kubeflow using manifests
git clone https://github.com/kubeflow/manifests.git
cd manifests
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

# Portforward Kubeflow dashboard in new tmux session
tmux new -d -s kubeflow-dashboard-portforward "kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80"

kubectl get pods output

image

kubectl logs conditional-execution-pipeline-with-exit-handler-scjtr-3243716801 -c wait -n kubeflow-user-example-com

...
time="2022-01-01T15:31:50.462Z" level=info msg="listed containers" containers="map[]"
time="2022-01-01T15:31:51.462Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow-user-example-com --filter=label=io.kubernetes.pod.name=conditional-execution-pipeline-with-exit-handler-scjtr-3243716801"
time="2022-01-01T15:31:51.498Z" level=info msg="listed containers" containers="map[]"
time="2022-01-01T15:31:52.498Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow-user-example-com --filter=label=io.kubernetes.pod.name=conditional-execution-pipeline-with-exit-handler-scjtr-3243716801"
time="2022-01-01T15:31:52.525Z" level=info msg="listed containers" containers="map[]"
time="2022-01-01T15:31:53.525Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow-user-example-com --filter=label=io.kubernetes.pod.name=conditional-execution-pipeline-with-exit-handler-scjtr-3243716801"
... (keeps going)

Step Events tab

kind: EventList
apiVersion: v1
metadata:
  selfLink: /api/v1/namespaces/kubeflow-user-example-com/events
  resourceVersion: '27545'
items:
  - metadata:
      name: >-
        conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd37b36b69a
      namespace: kubeflow-user-example-com
      selfLink: >-
        /api/v1/namespaces/kubeflow-user-example-com/events/conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd37b36b69a
      uid: 76728849-0449-4142-a4a1-bf839192d0a2
      resourceVersion: '6986'
      creationTimestamp: '2022-01-01T15:05:00Z'
      managedFields:
        - manager: k3s
          operation: Update
          apiVersion: events.k8s.io/v1
          time: '2022-01-01T15:05:00Z'
          fieldsType: FieldsV1
          fieldsV1:
            'f:action': {}
            'f:eventTime': {}
            'f:note': {}
            'f:reason': {}
            'f:regarding':
              'f:apiVersion': {}
              'f:kind': {}
              'f:name': {}
              'f:namespace': {}
              'f:resourceVersion': {}
              'f:uid': {}
            'f:reportingController': {}
            'f:reportingInstance': {}
            'f:type': {}
    involvedObject:
      kind: Pod
      namespace: kubeflow-user-example-com
      name: conditional-execution-pipeline-with-exit-handler-scjtr-3243716801
      uid: 542e569f-178b-42eb-a7e7-d07ea643178d
      apiVersion: v1
      resourceVersion: '6983'
    reason: Scheduled
    message: >-
      Successfully assigned
      kubeflow-user-example-com/conditional-execution-pipeline-with-exit-handler-scjtr-3243716801
      to ubuntu-2gb-fsn1-2
    source: {}
    firstTimestamp: null
    lastTimestamp: null
    type: Normal
    eventTime: '2022-01-01T15:05:00.551652Z'
    action: Binding
    reportingComponent: default-scheduler
    reportingInstance: default-scheduler-ubuntu-2gb-fsn1-2
  - metadata:
      name: >-
        conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd39bb346f0
      namespace: kubeflow-user-example-com
      selfLink: >-
        /api/v1/namespaces/kubeflow-user-example-com/events/conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd39bb346f0
      uid: 72582b0d-a333-4284-8f04-d90e11547f28
      resourceVersion: '6997'
      creationTimestamp: '2022-01-01T15:05:01Z'
      managedFields:
        - manager: k3s
          operation: Update
          apiVersion: v1
          time: '2022-01-01T15:05:01Z'
          fieldsType: FieldsV1
          fieldsV1:
            'f:count': {}
            'f:firstTimestamp': {}
            'f:involvedObject':
              'f:apiVersion': {}
              'f:fieldPath': {}
              'f:kind': {}
              'f:name': {}
              'f:namespace': {}
              'f:resourceVersion': {}
              'f:uid': {}
            'f:lastTimestamp': {}
            'f:message': {}
            'f:reason': {}
            'f:source':
              'f:component': {}
              'f:host': {}
            'f:type': {}
    involvedObject:
      kind: Pod
      namespace: kubeflow-user-example-com
      name: conditional-execution-pipeline-with-exit-handler-scjtr-3243716801
      uid: 542e569f-178b-42eb-a7e7-d07ea643178d
      apiVersion: v1
      resourceVersion: '6984'
      fieldPath: 'spec.containers{wait}'
    reason: Pulling
    message: >-
      Pulling image
      "gcr.io/ml-pipeline/argoexec:v3.1.6-patch-license-compliance"
    source:
      component: kubelet
      host: ubuntu-2gb-fsn1-2
    firstTimestamp: '2022-01-01T15:05:01Z'
    lastTimestamp: '2022-01-01T15:05:01Z'
    count: 1
    type: Normal
    eventTime: null
    reportingComponent: ''
    reportingInstance: ''
  - metadata:
      name: >-
        conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4e0da1c08
      namespace: kubeflow-user-example-com
      selfLink: >-
        /api/v1/namespaces/kubeflow-user-example-com/events/conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4e0da1c08
      uid: 7ea0cdf6-001a-4432-8e74-bd5a1e80bfcb
      resourceVersion: '7122'
      creationTimestamp: '2022-01-01T15:05:06Z'
      managedFields:
        - manager: k3s
          operation: Update
          apiVersion: v1
          time: '2022-01-01T15:05:06Z'
          fieldsType: FieldsV1
          fieldsV1:
            'f:count': {}
            'f:firstTimestamp': {}
            'f:involvedObject':
              'f:apiVersion': {}
              'f:fieldPath': {}
              'f:kind': {}
              'f:name': {}
              'f:namespace': {}
              'f:resourceVersion': {}
              'f:uid': {}
            'f:lastTimestamp': {}
            'f:message': {}
            'f:reason': {}
            'f:source':
              'f:component': {}
              'f:host': {}
            'f:type': {}
    involvedObject:
      kind: Pod
      namespace: kubeflow-user-example-com
      name: conditional-execution-pipeline-with-exit-handler-scjtr-3243716801
      uid: 542e569f-178b-42eb-a7e7-d07ea643178d
      apiVersion: v1
      resourceVersion: '6984'
      fieldPath: 'spec.containers{wait}'
    reason: Pulled
    message: >-
      Successfully pulled image
      "gcr.io/ml-pipeline/argoexec:v3.1.6-patch-license-compliance" in
      5.455115342s
    source:
      component: kubelet
      host: ubuntu-2gb-fsn1-2
    firstTimestamp: '2022-01-01T15:05:06Z'
    lastTimestamp: '2022-01-01T15:05:06Z'
    count: 1
    type: Normal
    eventTime: null
    reportingComponent: ''
    reportingInstance: ''
  - metadata:
      name: >-
        conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4f2b7ba3b
      namespace: kubeflow-user-example-com
      selfLink: >-
        /api/v1/namespaces/kubeflow-user-example-com/events/conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4f2b7ba3b
      uid: b52af29f-96d3-4da0-900b-73fce909d9fe
      resourceVersion: '7123'
      creationTimestamp: '2022-01-01T15:05:06Z'
      managedFields:
        - manager: k3s
          operation: Update
          apiVersion: v1
          time: '2022-01-01T15:05:06Z'
          fieldsType: FieldsV1
          fieldsV1:
            'f:count': {}
            'f:firstTimestamp': {}
            'f:involvedObject':
              'f:apiVersion': {}
              'f:fieldPath': {}
              'f:kind': {}
              'f:name': {}
              'f:namespace': {}
              'f:resourceVersion': {}
              'f:uid': {}
            'f:lastTimestamp': {}
            'f:message': {}
            'f:reason': {}
            'f:source':
              'f:component': {}
              'f:host': {}
            'f:type': {}
    involvedObject:
      kind: Pod
      namespace: kubeflow-user-example-com
      name: conditional-execution-pipeline-with-exit-handler-scjtr-3243716801
      uid: 542e569f-178b-42eb-a7e7-d07ea643178d
      apiVersion: v1
      resourceVersion: '6984'
      fieldPath: 'spec.containers{wait}'
    reason: Created
    message: Created container wait
    source:
      component: kubelet
      host: ubuntu-2gb-fsn1-2
    firstTimestamp: '2022-01-01T15:05:06Z'
    lastTimestamp: '2022-01-01T15:05:06Z'
    count: 1
    type: Normal
    eventTime: null
    reportingComponent: ''
    reportingInstance: ''
  - metadata:
      name: >-
        conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4f7efebb2
      namespace: kubeflow-user-example-com
      selfLink: >-
        /api/v1/namespaces/kubeflow-user-example-com/events/conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4f7efebb2
      uid: 260352e6-ab20-4ae7-b522-0f00614d0e6b
      resourceVersion: '7128'
      creationTimestamp: '2022-01-01T15:05:06Z'
      managedFields:
        - manager: k3s
          operation: Update
          apiVersion: v1
          time: '2022-01-01T15:05:06Z'
          fieldsType: FieldsV1
          fieldsV1:
            'f:count': {}
            'f:firstTimestamp': {}
            'f:involvedObject':
              'f:apiVersion': {}
              'f:fieldPath': {}
              'f:kind': {}
              'f:name': {}
              'f:namespace': {}
              'f:resourceVersion': {}
              'f:uid': {}
            'f:lastTimestamp': {}
            'f:message': {}
            'f:reason': {}
            'f:source':
              'f:component': {}
              'f:host': {}
            'f:type': {}
    involvedObject:
      kind: Pod
      namespace: kubeflow-user-example-com
      name: conditional-execution-pipeline-with-exit-handler-scjtr-3243716801
      uid: 542e569f-178b-42eb-a7e7-d07ea643178d
      apiVersion: v1
      resourceVersion: '6984'
      fieldPath: 'spec.containers{wait}'
    reason: Started
    message: Started container wait
    source:
      component: kubelet
      host: ubuntu-2gb-fsn1-2
    firstTimestamp: '2022-01-01T15:05:06Z'
    lastTimestamp: '2022-01-01T15:05:06Z'
    count: 1
    type: Normal
    eventTime: null
    reportingComponent: ''
    reportingInstance: ''
  - metadata:
      name: >-
        conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4f83995c4
      namespace: kubeflow-user-example-com
      selfLink: >-
        /api/v1/namespaces/kubeflow-user-example-com/events/conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4f83995c4
      uid: eb424731-c7df-43c5-a08c-5da8ace8a81f
      resourceVersion: '7129'
      creationTimestamp: '2022-01-01T15:05:06Z'
      managedFields:
        - manager: k3s
          operation: Update
          apiVersion: v1
          time: '2022-01-01T15:05:06Z'
          fieldsType: FieldsV1
          fieldsV1:
            'f:count': {}
            'f:firstTimestamp': {}
            'f:involvedObject':
              'f:apiVersion': {}
              'f:fieldPath': {}
              'f:kind': {}
              'f:name': {}
              'f:namespace': {}
              'f:resourceVersion': {}
              'f:uid': {}
            'f:lastTimestamp': {}
            'f:message': {}
            'f:reason': {}
            'f:source':
              'f:component': {}
              'f:host': {}
            'f:type': {}
    involvedObject:
      kind: Pod
      namespace: kubeflow-user-example-com
      name: conditional-execution-pipeline-with-exit-handler-scjtr-3243716801
      uid: 542e569f-178b-42eb-a7e7-d07ea643178d
      apiVersion: v1
      resourceVersion: '6984'
      fieldPath: 'spec.containers{main}'
    reason: Pulled
    message: 'Container image "python:3.7" already present on machine'
    source:
      component: kubelet
      host: ubuntu-2gb-fsn1-2
    firstTimestamp: '2022-01-01T15:05:06Z'
    lastTimestamp: '2022-01-01T15:05:06Z'
    count: 1
    type: Normal
    eventTime: null
    reportingComponent: ''
    reportingInstance: ''
  - metadata:
      name: >-
        conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4fad8f088
      namespace: kubeflow-user-example-com
      selfLink: >-
        /api/v1/namespaces/kubeflow-user-example-com/events/conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4fad8f088
      uid: f22d88e1-0d43-4a99-b44e-f3a1d44cf524
      resourceVersion: '7130'
      creationTimestamp: '2022-01-01T15:05:06Z'
      managedFields:
        - manager: k3s
          operation: Update
          apiVersion: v1
          time: '2022-01-01T15:05:06Z'
          fieldsType: FieldsV1
          fieldsV1:
            'f:count': {}
            'f:firstTimestamp': {}
            'f:involvedObject':
              'f:apiVersion': {}
              'f:fieldPath': {}
              'f:kind': {}
              'f:name': {}
              'f:namespace': {}
              'f:resourceVersion': {}
              'f:uid': {}
            'f:lastTimestamp': {}
            'f:message': {}
            'f:reason': {}
            'f:source':
              'f:component': {}
              'f:host': {}
            'f:type': {}
    involvedObject:
      kind: Pod
      namespace: kubeflow-user-example-com
      name: conditional-execution-pipeline-with-exit-handler-scjtr-3243716801
      uid: 542e569f-178b-42eb-a7e7-d07ea643178d
      apiVersion: v1
      resourceVersion: '6984'
      fieldPath: 'spec.containers{main}'
    reason: Created
    message: Created container main
    source:
      component: kubelet
      host: ubuntu-2gb-fsn1-2
    firstTimestamp: '2022-01-01T15:05:06Z'
    lastTimestamp: '2022-01-01T15:05:06Z'
    count: 1
    type: Normal
    eventTime: null
    reportingComponent: ''
    reportingInstance: ''
  - metadata:
      name: >-
        conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4ff8dd0c4
      namespace: kubeflow-user-example-com
      selfLink: >-
        /api/v1/namespaces/kubeflow-user-example-com/events/conditional-execution-pipeline-with-exit-handler-scjtr-3243716801.16c62dd4ff8dd0c4
      uid: 798f6d2a-d3eb-4d2d-b449-7eeb6d7c285c
      resourceVersion: '7133'
      creationTimestamp: '2022-01-01T15:05:07Z'
      managedFields:
        - manager: k3s
          operation: Update
          apiVersion: v1
          time: '2022-01-01T15:05:07Z'
          fieldsType: FieldsV1
          fieldsV1:
            'f:count': {}
            'f:firstTimestamp': {}
            'f:involvedObject':
              'f:apiVersion': {}
              'f:fieldPath': {}
              'f:kind': {}
              'f:name': {}
              'f:namespace': {}
              'f:resourceVersion': {}
              'f:uid': {}
            'f:lastTimestamp': {}
            'f:message': {}
            'f:reason': {}
            'f:source':
              'f:component': {}
              'f:host': {}
            'f:type': {}
    involvedObject:
      kind: Pod
      namespace: kubeflow-user-example-com
      name: conditional-execution-pipeline-with-exit-handler-scjtr-3243716801
      uid: 542e569f-178b-42eb-a7e7-d07ea643178d
      apiVersion: v1
      resourceVersion: '6984'
      fieldPath: 'spec.containers{main}'
    reason: Started
    message: Started container main
    source:
      component: kubelet
      host: ubuntu-2gb-fsn1-2
    firstTimestamp: '2022-01-01T15:05:07Z'
    lastTimestamp: '2022-01-01T15:05:07Z'
    count: 1
    type: Normal
    eventTime: null
    reportingComponent: ''
    reportingInstance: ''

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@RobinKa
Copy link
Author

RobinKa commented Jan 1, 2022

Using the pns executor instead makes everything work as described here

kustomize build apps/pipeline/upstream/env/platform-agnostic-multi-user-pns | kubectl apply -f -

So I assume I made a mistake in my Docker setup? Although not much about docker is mentioned in the manifests readme.

@zijianjoy
Copy link
Collaborator

Hello @RobinKa , you can switch over to emissary executor since that is going to be the default executor going forward. #5714

@RobinKa RobinKa closed this as completed Apr 13, 2022
@ketangangal
Copy link

Even with proper emissary executor , sometimes component will stuck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants