-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Installer freezes node on updating_container_ld_cache #80
Comments
@mindprince Any clues? |
What's the problem here. The installation seems to be complete in the log.
…--
Rohit
On Tue, Jul 17, 2018 at 2:47 PM Amit Kumar ***@***.***> wrote:
@mindprince <https://github.com/mindprince> Any clues?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#80 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAqb-i8Q0CLj2ibbP41Gf-CqauMMXTUxks5uHltmgaJpZM4VTnQJ>
.
|
@mindprince The container I am using on this node doesn't have a /usr/local/nvidia. Shouldn't that be exposed to all containers after the installer is done via nvidia-gpu-device-plugin? |
Not to all containers. Only those that request GPUs. See
https://cloud.google.com/kubernetes-engine/docs/concepts/gpus
…On Tue, Jul 17, 2018 at 3:02 PM Amit Kumar ***@***.***> wrote:
@mindprince <https://github.com/mindprince> The container I am using on
this node doesn't have a /usr/local/nvidia. Shouldn't that be exposed to
all containers after the installer is done via nvidia-gpu-device-plugin?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#80 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAqb-rypyMgvVfT4gDiRkbRWqkIclSV-ks5uHl7dgaJpZM4VTnQJ>
.
|
Yep. I am looking in the GPU pod itself. Here's the pod's YAML
and here's what's contained in /usr/local
|
@mindprince Found the issue! Closing this! Thanks for bearing with me! |
Why do you think this is a GPU pod?
This pod is not requesting any GPUs.
resources: {}
You want to specify something like:
```
resources:
limits:
nvidia.com/gpu: 2
```
In your pod spec, then you will see the libraries in /usr/local/nvidia/lib64
—
Rohit
…On Tue, Jul 17, 2018 at 3:08 PM Amit Kumar ***@***.***> wrote:
Yep. I am looking in the GPU pod itself. Here's the pod's YAML
apiVersion: v1
kind: Pod
metadata:
annotations:
prometheus.io.scrape: "false"
creationTimestamp: 2018-07-17T21:29:07Z
generateName: ultron-776bb98fbb-
labels:
faas_function: ultron
pod-template-hash: "3326654966"
uid: "861317054"
name: ultron-776bb98fbb-7zxhv
namespace: openfaas-fn
ownerReferences:
- apiVersion: extensions/v1beta1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: ultron-776bb98fbb
uid: 0688f010-8a06-11e8-a53c-42010a80009d
resourceVersion: "50748"
selfLink: /api/v1/namespaces/openfaas-fn/pods/ultron-776bb98fbb-7zxhv
uid: 6fc5e972-8a08-11e8-a53c-42010a80009d
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: faas_function
operator: In
values:
- ultron
topologyKey: kubernetes.io/hostname
containers:
- env:
- name: read_timeout
value: 300s
- name: write_timeout
value: 300s
- name: ack_wait
value: 300s
- name: exec_timeout
value: 300s
image: asia.gcr.io/galaxycard-490d9/ultron
imagePullPolicy: Always
livenessProbe:
exec:
command:
- cat
- /tmp/.lock
failureThreshold: 3
initialDelaySeconds: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: ultron
ports:
- containerPort: 8080
protocol: TCP
readinessProbe:
exec:
command:
- cat
- /tmp/.lock
failureThreshold: 3
initialDelaySeconds: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-fwk55
readOnly: true
dnsPolicy: ClusterFirst
nodeName: gke-faas-pool-4-cpu-8-ram-26f8627d-g1ms
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: default-token-fwk55
secret:
defaultMode: 420
secretName: default-token-fwk55
and here's what's contained in /usr/local
***@***.***:/home/app# ls /usr/local
bin etc games include lib man sbin share src
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#80 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAqb-lhVntOEUVUIt_ox3TZPTeDp6g_0ks5uHmBNgaJpZM4VTnQJ>
.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Same as #71. Running on
cos
. Cluster version:1.10.5-gke.0
. Here's the log dump: https://gist.github.com/heroic/5bdc756732a8ec5d5081227d1cbb2048The text was updated successfully, but these errors were encountered: