Error running TF training job on KF 0.3.5 using GPUs #686

sagravat · 2019-01-15T21:53:38Z

I'm suddenly having an issue running a TF/Keras training job on KF using GPUs (this was not happening before the new year). My docker container runs fine on the Google Deep Learning VM but it just dies without any obvious errors when running on KF. I suspect it's because my container is referencing a different Cuda driver than the one deployed on the node pool.

I've tried building my container with both tensorflow/tensorflow-latest-gpu and gcr.io/ml-pipeline/ml-pipeline-kubeflow-tf-trainer-gpu:d3c4add0a95e930c70a330466d0923827784eb9a

Here's the output of my job (there is no error but the KF job says failed with exit code 139).

`nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
Downloading data from
https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5

2019-01-15 05:49:31.189450: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-01-15 05:49:31.297537: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-01-15 05:49:31.298239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2019-01-15 05:49:31.298281: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2019-01-15 05:49:31.630491: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-15 05:49:31.630598: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0
2019-01-15 05:49:31.630616: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N
2019-01-15 05:49:31.631019: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10757 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
`

Is there an example of a docker container that runs a Keras model on GPU that is known to work?

sagravat · 2019-01-19T16:38:14Z

The issue was due to an out of memory error on the GPU device. I was using a K80 on the K8s cluster but used a V100 on the Deep Learning VM so I wasn't able to notice it with my local testing. It would be good to make this error more obvious to the user.

paveldournov · 2019-02-04T03:18:54Z

@sagravat - can you please share how did you diagnose and resolve the issue? What logs did you analyze? This would be helpful in figuring out the way to expose more error details.

vicaire · 2019-03-26T05:08:41Z

resolving in favor of #677

* The CLI code isn't actually doing anything when invoked with "junit-and-upload". This was a bug introduced by kubeflow#683 Related to kubeflow#685

* update code to limit loop resource group name * add license

vicaire closed this as completed Mar 26, 2019

Linchin pushed a commit to Linchin/pipelines that referenced this issue Apr 11, 2023

Fix a bug in tekton_client.py (kubeflow#686)

e72b7f5

* The CLI code isn't actually doing anything when invoked with "junit-and-upload". This was a bug introduced by kubeflow#683 Related to kubeflow#685

HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this issue Mar 11, 2024

Update code to limit loop resource group name (kubeflow#686)

186a9e3

* update code to limit loop resource group name * add license

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error running TF training job on KF 0.3.5 using GPUs #686

Error running TF training job on KF 0.3.5 using GPUs #686

sagravat commented Jan 15, 2019

sagravat commented Jan 19, 2019

paveldournov commented Feb 4, 2019

vicaire commented Mar 26, 2019

Error running TF training job on KF 0.3.5 using GPUs #686

Error running TF training job on KF 0.3.5 using GPUs #686

Comments

sagravat commented Jan 15, 2019

sagravat commented Jan 19, 2019

paveldournov commented Feb 4, 2019

vicaire commented Mar 26, 2019