Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running TF training job on KF 0.3.5 using GPUs #686

Closed
sagravat opened this issue Jan 15, 2019 · 3 comments
Closed

Error running TF training job on KF 0.3.5 using GPUs #686

sagravat opened this issue Jan 15, 2019 · 3 comments

Comments

@sagravat
Copy link

I'm suddenly having an issue running a TF/Keras training job on KF using GPUs (this was not happening before the new year). My docker container runs fine on the Google Deep Learning VM but it just dies without any obvious errors when running on KF. I suspect it's because my container is referencing a different Cuda driver than the one deployed on the node pool.

I've tried building my container with both tensorflow/tensorflow-latest-gpu and gcr.io/ml-pipeline/ml-pipeline-kubeflow-tf-trainer-gpu:d3c4add0a95e930c70a330466d0923827784eb9a

Here's the output of my job (there is no error but the KF job says failed with exit code 139).

`nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
Downloading data from
https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5

2019-01-15 05:49:31.189450: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-01-15 05:49:31.297537: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-01-15 05:49:31.298239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2019-01-15 05:49:31.298281: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2019-01-15 05:49:31.630491: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-15 05:49:31.630598: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0
2019-01-15 05:49:31.630616: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N
2019-01-15 05:49:31.631019: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10757 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
`

Is there an example of a docker container that runs a Keras model on GPU that is known to work?

@sagravat
Copy link
Author

The issue was due to an out of memory error on the GPU device. I was using a K80 on the K8s cluster but used a V100 on the Deep Learning VM so I wasn't able to notice it with my local testing. It would be good to make this error more obvious to the user.

@paveldournov
Copy link
Contributor

@sagravat - can you please share how did you diagnose and resolve the issue? What logs did you analyze? This would be helpful in figuring out the way to expose more error details.

@vicaire
Copy link
Contributor

vicaire commented Mar 26, 2019

resolving in favor of #677

@vicaire vicaire closed this as completed Mar 26, 2019
Linchin pushed a commit to Linchin/pipelines that referenced this issue Apr 11, 2023
* The CLI code isn't actually doing anything when invoked with "junit-and-upload". This was a bug introduced by kubeflow#683

Related to kubeflow#685
HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this issue Mar 11, 2024
* update code to limit loop resource group name

* add license
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants