Stuck in 'start task failed' on Standard_NC6 nodes #231

hieuhc · 2018-07-20T08:26:29Z

Problem Description

I have two pools of Standard_NC6 low priority vm. They have been running fine for some time until today it got a start task failed error. I tried to reboot a few times but still same this error.

rmmod: ERROR: Module nouveau is not currently loaded

WARNING: nvidia-installer was forced to guess the X library path '/usr/lib'
         and X module path '/usr/lib/xorg/modules'; these paths were not
         queryable from the system.  If X fails to find the NVIDIA X driver
         module, please install the `pkg-config` utility and the X.Org
         SDK/development package for your distribution and reinstall the
         driver.

I have read this but to me it is not transient since I have tried reboot many times. Even tried delete and recreate the pools.

Steps to Reproduce

It seems random to me. They have been running fine until got stuck in this state.

Expected Results

The pools run fine and stable.

Actual Results

Suddenly stuck in start task failed

Additional Logs

[stdout.txt](https://github.com/Azure/batch-shipyard/files/2213111/stdout.txt)
[stderr.txt](https://github.com/Azure/batch-shipyard/files/2213112/stderr.txt)

Additonal Comments

Also, why these vm got restarted while running fine?

The text was updated successfully, but these errors were encountered:

alfpark · 2018-07-20T15:39:17Z

It appears that the new nvidia-docker2 18.06 package has broken the installation. Although nvidia-docker2 is pinned, the dependency is not:

$ apt-get install nvidia-docker2=2.0.3+docker18.03.1-1
Reading package lists... Done
Building dependency tree
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 nvidia-docker2 : Depends: nvidia-container-runtime (= 2.0.0+docker18.03.1-1) but 2.0.0+docker18.06.0-1 is to be installed

I'll have to pin the dependent package installation as well to work around this issue. I'll hotfix this and release as soon as possible.

hieuhc changed the title ~~Stuck in start task failed on Standard_NC6 nodes~~ Stuck in 'start task failed' on Standard_NC6 nodes Jul 20, 2018

alfpark added defect gpu labels Jul 20, 2018

alfpark closed this as completed in 85db792 Jul 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck in 'start task failed' on Standard_NC6 nodes #231

Stuck in 'start task failed' on Standard_NC6 nodes #231

hieuhc commented Jul 20, 2018 •

edited

Loading

alfpark commented Jul 20, 2018

Stuck in 'start task failed' on Standard_NC6 nodes #231

Stuck in 'start task failed' on Standard_NC6 nodes #231

Comments

hieuhc commented Jul 20, 2018 • edited Loading

Problem Description

Steps to Reproduce

Expected Results

Actual Results

Additional Logs

Additonal Comments

alfpark commented Jul 20, 2018

hieuhc commented Jul 20, 2018 •

edited

Loading