Installer freezes node on updating_container_ld_cache #80

heroic · 2018-07-17T21:47:00Z

Same as #71. Running on cos. Cluster version: 1.10.5-gke.0. Here's the log dump: https://gist.github.com/heroic/5bdc756732a8ec5d5081227d1cbb2048

The text was updated successfully, but these errors were encountered:

heroic · 2018-07-17T21:47:17Z

@mindprince Any clues?

rohitagarwal003 · 2018-07-17T21:54:42Z

What's the problem here. The installation seems to be complete in the log.

…

-- Rohit

On Tue, Jul 17, 2018 at 2:47 PM Amit Kumar ***@***.***> wrote: @mindprince <https://github.com/mindprince> Any clues? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#80 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAqb-i8Q0CLj2ibbP41Gf-CqauMMXTUxks5uHltmgaJpZM4VTnQJ> .

heroic · 2018-07-17T22:02:04Z

@mindprince The container I am using on this node doesn't have a /usr/local/nvidia. Shouldn't that be exposed to all containers after the installer is done via nvidia-gpu-device-plugin?

rohitagarwal003 · 2018-07-17T22:05:54Z

Not to all containers. Only those that request GPUs. See https://cloud.google.com/kubernetes-engine/docs/concepts/gpus

…

On Tue, Jul 17, 2018 at 3:02 PM Amit Kumar ***@***.***> wrote: @mindprince <https://github.com/mindprince> The container I am using on this node doesn't have a /usr/local/nvidia. Shouldn't that be exposed to all containers after the installer is done via nvidia-gpu-device-plugin? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#80 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAqb-rypyMgvVfT4gDiRkbRWqkIclSV-ks5uHl7dgaJpZM4VTnQJ> .

heroic · 2018-07-17T22:08:13Z

Yep. I am looking in the GPU pod itself. Here's the pod's YAML

apiVersion: v1
kind: Pod
metadata:
  annotations:
    prometheus.io.scrape: "false"
  creationTimestamp: 2018-07-17T21:29:07Z
  generateName: ultron-776bb98fbb-
  labels:
    faas_function: ultron
    pod-template-hash: "3326654966"
    uid: "861317054"
  name: ultron-776bb98fbb-7zxhv
  namespace: openfaas-fn
  ownerReferences:
  - apiVersion: extensions/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: ultron-776bb98fbb
    uid: 0688f010-8a06-11e8-a53c-42010a80009d
  resourceVersion: "50748"
  selfLink: /api/v1/namespaces/openfaas-fn/pods/ultron-776bb98fbb-7zxhv
  uid: 6fc5e972-8a08-11e8-a53c-42010a80009d
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: faas_function
            operator: In
            values:
            - ultron
        topologyKey: kubernetes.io/hostname
  containers:
  - env:
    - name: read_timeout
      value: 300s
    - name: write_timeout
      value: 300s
    - name: ack_wait
      value: 300s
    - name: exec_timeout
      value: 300s
    image: asia.gcr.io/galaxycard-490d9/ultron
    imagePullPolicy: Always
    livenessProbe:
      exec:
        command:
        - cat
        - /tmp/.lock
      failureThreshold: 3
      initialDelaySeconds: 3
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: ultron
    ports:
    - containerPort: 8080
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - cat
        - /tmp/.lock
      failureThreshold: 3
      initialDelaySeconds: 3
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-fwk55
      readOnly: true
  dnsPolicy: ClusterFirst
  nodeName: gke-faas-pool-4-cpu-8-ram-26f8627d-g1ms
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: default-token-fwk55
    secret:
      defaultMode: 420
      secretName: default-token-fwk55

and here's what's contained in /usr/local

root@ultron-776bb98fbb-7zxhv:/home/app# ls /usr/local
bin  etc  games  include  lib  man  sbin  share  src

heroic · 2018-07-17T22:10:29Z

@mindprince Found the issue! Closing this! Thanks for bearing with me!

rohitagarwal003 · 2018-07-17T22:17:10Z

Why do you think this is a GPU pod? This pod is not requesting any GPUs.

resources: {}

You want to specify something like: ``` resources: limits: nvidia.com/gpu: 2 ``` In your pod spec, then you will see the libraries in /usr/local/nvidia/lib64 — Rohit

…

On Tue, Jul 17, 2018 at 3:08 PM Amit Kumar ***@***.***> wrote: Yep. I am looking in the GPU pod itself. Here's the pod's YAML apiVersion: v1 kind: Pod metadata: annotations: prometheus.io.scrape: "false" creationTimestamp: 2018-07-17T21:29:07Z generateName: ultron-776bb98fbb- labels: faas_function: ultron pod-template-hash: "3326654966" uid: "861317054" name: ultron-776bb98fbb-7zxhv namespace: openfaas-fn ownerReferences: - apiVersion: extensions/v1beta1 blockOwnerDeletion: true controller: true kind: ReplicaSet name: ultron-776bb98fbb uid: 0688f010-8a06-11e8-a53c-42010a80009d resourceVersion: "50748" selfLink: /api/v1/namespaces/openfaas-fn/pods/ultron-776bb98fbb-7zxhv uid: 6fc5e972-8a08-11e8-a53c-42010a80009d spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: faas_function operator: In values: - ultron topologyKey: kubernetes.io/hostname containers: - env: - name: read_timeout value: 300s - name: write_timeout value: 300s - name: ack_wait value: 300s - name: exec_timeout value: 300s image: asia.gcr.io/galaxycard-490d9/ultron imagePullPolicy: Always livenessProbe: exec: command: - cat - /tmp/.lock failureThreshold: 3 initialDelaySeconds: 3 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 name: ultron ports: - containerPort: 8080 protocol: TCP readinessProbe: exec: command: - cat - /tmp/.lock failureThreshold: 3 initialDelaySeconds: 3 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: default-token-fwk55 readOnly: true dnsPolicy: ClusterFirst nodeName: gke-faas-pool-4-cpu-8-ram-26f8627d-g1ms restartPolicy: Always schedulerName: default-scheduler securityContext: {} serviceAccount: default serviceAccountName: default terminationGracePeriodSeconds: 30 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 volumes: - name: default-token-fwk55 secret: defaultMode: 420 secretName: default-token-fwk55 and here's what's contained in /usr/local ***@***.***:/home/app# ls /usr/local bin etc games include lib man sbin share src — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#80 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAqb-lhVntOEUVUIt_ox3TZPTeDp6g_0ks5uHmBNgaJpZM4VTnQJ> .

heroic closed this as completed Jul 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Installer freezes node on updating_container_ld_cache #80

Installer freezes node on updating_container_ld_cache #80

heroic commented Jul 17, 2018

heroic commented Jul 17, 2018

rohitagarwal003 commented Jul 17, 2018 via email

heroic commented Jul 17, 2018

rohitagarwal003 commented Jul 17, 2018 via email

heroic commented Jul 17, 2018

heroic commented Jul 17, 2018

rohitagarwal003 commented Jul 17, 2018 via email

Installer freezes node on updating_container_ld_cache #80

Installer freezes node on updating_container_ld_cache #80

Comments

heroic commented Jul 17, 2018

heroic commented Jul 17, 2018

rohitagarwal003 commented Jul 17, 2018 via email

heroic commented Jul 17, 2018

rohitagarwal003 commented Jul 17, 2018 via email

heroic commented Jul 17, 2018

heroic commented Jul 17, 2018

rohitagarwal003 commented Jul 17, 2018 via email