Skip to content
This repository has been archived by the owner on Mar 20, 2023. It is now read-only.

Stuck in 'start task failed' on Standard_NC6 nodes #231

Closed
hieuhc opened this issue Jul 20, 2018 · 1 comment
Closed

Stuck in 'start task failed' on Standard_NC6 nodes #231

hieuhc opened this issue Jul 20, 2018 · 1 comment

Comments

@hieuhc
Copy link
Contributor

hieuhc commented Jul 20, 2018

Problem Description

I have two pools of Standard_NC6 low priority vm. They have been running fine for some time until today it got a start task failed error. I tried to reboot a few times but still same this error.

rmmod: ERROR: Module nouveau is not currently loaded

WARNING: nvidia-installer was forced to guess the X library path '/usr/lib'
         and X module path '/usr/lib/xorg/modules'; these paths were not
         queryable from the system.  If X fails to find the NVIDIA X driver
         module, please install the `pkg-config` utility and the X.Org
         SDK/development package for your distribution and reinstall the
         driver.

I have read this but to me it is not transient since I have tried reboot many times. Even tried delete and recreate the pools.

Steps to Reproduce

It seems random to me. They have been running fine until got stuck in this state.

Expected Results

The pools run fine and stable.

Actual Results

Suddenly stuck in start task failed

Additional Logs

[stdout.txt](https://github.com/Azure/batch-shipyard/files/2213111/stdout.txt)
[stderr.txt](https://github.com/Azure/batch-shipyard/files/2213112/stderr.txt)

Additonal Comments

Also, why these vm got restarted while running fine?

@hieuhc hieuhc changed the title Stuck in start task failed on Standard_NC6 nodes Stuck in 'start task failed' on Standard_NC6 nodes Jul 20, 2018
@alfpark
Copy link
Collaborator

alfpark commented Jul 20, 2018

It appears that the new nvidia-docker2 18.06 package has broken the installation. Although nvidia-docker2 is pinned, the dependency is not:

$ apt-get install nvidia-docker2=2.0.3+docker18.03.1-1
Reading package lists... Done
Building dependency tree
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 nvidia-docker2 : Depends: nvidia-container-runtime (= 2.0.0+docker18.03.1-1) but 2.0.0+docker18.06.0-1 is to be installed

I'll have to pin the dependent package installation as well to work around this issue. I'll hotfix this and release as soon as possible.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants