Can not terminate running tasks #308

hieuhc · 2019-08-30T18:01:40Z

Problem Description

I can not terminate running tasks in a job with jobs tasks term. The command hangs for long time if I use --wait. When I terminate the task using Azure Portal, it is marked with Completed state, but when I log in to the node I can see the container is still running.

Batch Shipyard Version

3.7.1

Steps to Reproduce

Submit a long running task.
Try to terminate the task with command jobs tasks term.

Expected Results

The task is terminated with Completed state. There is no Docker container running when logging into the node.

Actual Results

Can not terminate using the command.

Additonal Comments

I wonder what should be expected when I specify max_task_retries being -1. To be able to terminate this kind of task, I had to manually terminate in Azure Portal, then log in to the node and docker rm -f

The text was updated successfully, but these errors were encountered:

alfpark · 2019-08-30T18:10:03Z

A few questions:

Are you using a native mode pool?
Did you try with --force?

Also, please see the pool nodes ps and pool nodes zap commands.

hieuhc · 2019-08-30T18:33:32Z

Hi @alfpark, thanks for your quick reply. The pool specification is below, which I think is not in native mode. Should I convert all the pools we are having so far to this mode and redeploy all the tasks?

vm_configuration:
    platform_image:
      publisher: Canonical
      offer: UbuntuServer
      sku: 16.04-LTS
      version: latest
  vm_size: Standard_NC6

Also I tried with --force and got the same result.

If I understand correctly pool nodes zap kills all container, but I only want to kill some specific containers. In general, can you recommend a way to redeploy the task when we have a new version of Docker image? Currently we are killing all tasks using current version, then use jobs add to redeploy. Any plan to integrate with Azure DevOps pipeline for CI/CD?

alfpark · 2019-08-30T18:40:49Z

It's possible that the termination signal is not being properly propagated within the running container. Do you have a provisioned SSH user? If not, then Shipyard cannot kill these containers properly in non-native mode (even with --force).

It may be cleaner to use a native mode pool for your workload. native mode pools inherently understand Docker tasks so the task/job termination experience is cleaner for containers like yours where the termination signal is not sent properly to child processes in the container. Take a look here to see if it's applicable to your use case: https://github.com/Azure/batch-shipyard/blob/master/docs/97-faq.md#what-is-native-under-pool-platform_image-and-custom_image

For redeploying on a new image - I assume you accidentally left out the pool images update command in between the task kill and jobs add? That is typically the recommended pattern (unless you use rolling pools and utilize job live migration).

The DevOps task is intriguing... please open a new issue for that so we can track as a proper feature request.

hieuhc · 2019-08-30T19:49:25Z

You are right that we use pool images update in between, but I am curious do we need that if we use allow_run_on_missing_image: true?

I can open a new support feature request for DevOps after gathering some requirements.

Thanks for your hint on native mode. We must consider to recreate out pools with this for a better stability. In the meantime, we have a SSH user when creating the pool in the above case, I can issue pool ssh and pool images update --ssh, also a pool user add gives user X already exists on node. But still I can not jobs tasks term, maybe there is something I am missing here?

alfpark · 2019-08-30T20:23:39Z

You are right that we use pool images update in between, but I am curious do we need that if we use allow_run_on_missing_image: true?

Yes, allow_run_on_missing_image does not re-pull an image. The behavior is the same as if you had a local image and ran docker run on it. It only allows an image to be pulled if it isn't present.

When you issue the jobs tasks term (with or without --force) , do you get some INFO logs about a command being executed on node (e.g., 2019-08-30 20:21:21.988 INFO convoy.crypto:connect_or_exec_ssh_command:211 executing command on node x:y with key id_rsa_shipyard)?

hieuhc · 2019-08-30T20:44:33Z

I can only see this

2019-08-30 19:56:39.067 INFO - Terminating task: task-taskID
2019-08-30 19:56:39.068 DEBUG - waiting for task task-taskID in job job-dev to terminate

alfpark · 2019-08-30T20:47:28Z

Thanks for confirming - this is most likely a regression. I have a fix that should be landing shortly in develop. You can test either by pulling the develop branch or using the develop-cli Docker image once the DevOps build completes.

Note that if you're moving from 3.7.1 to develop, you'll need to upgrade (if not using the CLI Docker image): https://github.com/Azure/batch-shipyard/blob/master/docs/01-batch-shipyard-installation.md#upgrading-to-new-releases

- SSH side-channel docker kill signal was not being sent as Docker tasks were not being detected properly - Also fix issue with pool images update not executing if block on images is false - Resolves #308

hieuhc · 2019-08-31T09:16:58Z

@alfpark Can you please confirm the new version of docker mcr.microsoft.com/azure-batch/shipyard:develop-cli has been deployed. I have just tested and it seemed like I still couldn't terminate the task in non-native mode.

alfpark · 2019-08-31T20:47:56Z

Did you redeploy your pool and jobs using the develop-cli image? Also, were you able to observe any INFO logs as per above comment about a command over ssh (i.e., convoy.crypto:connect_or_exec_ssh_command)?

hieuhc · 2019-09-02T10:46:01Z

Sorry I should have elaborated better. I used develop-cli to redeploy a test pool. An attempt to terminate the task could end the task in the portal UI, but the actual container was still running when I ssh to the node. I could not see the INFO log as you suggested.

alfpark · 2019-09-02T20:04:46Z

Sorry, I'm still a bit unclear here. I understand you re-deployed your test pool with develop-cli. Did you:

Submit your job with the develop-cli?
Execute jobs tasks term with the develop-cli?

If yes to both, can you please elaborate about your job (or post a redacted jobs.yaml) and also a screenshot from the portal UI or Batch explorer of your task command line and environment variables (redact as necessary)?

hieuhc · 2019-09-03T07:16:40Z

My answer is yes for both. Below is the jobs.yaml file

job_specifications:
- allow_run_on_missing_image: true
  id: job-id
  tasks:
  - command: bash -c "cd /workspace && python -m service.main.processor"
    max_task_retries: -1
    environment_variables:
      COMPUTE_CONTEXT: GPU
      CONFIG_FILE_RUNNER: service/config/job_processor.config.dev.yml
    gpus: all
    docker_image: some_docker_image:latest
    remove_container_after_exit: true

I attach here some screenshot images for our sample job and task

redacted

- Resolves #308

alfpark · 2019-09-03T15:14:56Z

@hieuhc Thanks for the detailed report. You may want to consider rotating your ACR credentials as part of the credential leaked in the screenshot above (I edited out the screenshot).

It looks like there was another defect with infinite retry tasks and termination. Please try the new develop-cli image once the DevOps build completes.

hieuhc · 2019-09-04T14:17:51Z

Hi. I confirm the newest fix has resolved the issue. Much appreciated for helping.

alfpark added the question label Aug 30, 2019

alfpark added defect and removed question labels Aug 30, 2019

alfpark self-assigned this Aug 30, 2019

alfpark added a commit that referenced this issue Sep 3, 2019

Fix task termination for infinite retry tasks

cbf1374

- Resolves #308

hieuhc closed this as completed Sep 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not terminate running tasks #308

Can not terminate running tasks #308

hieuhc commented Aug 30, 2019

alfpark commented Aug 30, 2019

hieuhc commented Aug 30, 2019

alfpark commented Aug 30, 2019 •

edited

Loading

hieuhc commented Aug 30, 2019 •

edited

Loading

alfpark commented Aug 30, 2019 •

edited

Loading

hieuhc commented Aug 30, 2019

alfpark commented Aug 30, 2019 •

edited

Loading

hieuhc commented Aug 31, 2019

alfpark commented Aug 31, 2019 •

edited

Loading

hieuhc commented Sep 2, 2019

alfpark commented Sep 2, 2019 •

edited

Loading

hieuhc commented Sep 3, 2019 •

edited by alfpark

Loading

alfpark commented Sep 3, 2019

hieuhc commented Sep 4, 2019

Can not terminate running tasks #308

Can not terminate running tasks #308

Comments

hieuhc commented Aug 30, 2019

Problem Description

Batch Shipyard Version

Steps to Reproduce

Expected Results

Actual Results

Additonal Comments

alfpark commented Aug 30, 2019

hieuhc commented Aug 30, 2019

alfpark commented Aug 30, 2019 • edited Loading

hieuhc commented Aug 30, 2019 • edited Loading

alfpark commented Aug 30, 2019 • edited Loading

hieuhc commented Aug 30, 2019

alfpark commented Aug 30, 2019 • edited Loading

hieuhc commented Aug 31, 2019

alfpark commented Aug 31, 2019 • edited Loading

hieuhc commented Sep 2, 2019

alfpark commented Sep 2, 2019 • edited Loading

hieuhc commented Sep 3, 2019 • edited by alfpark Loading

alfpark commented Sep 3, 2019

hieuhc commented Sep 4, 2019

alfpark commented Aug 30, 2019 •

edited

Loading

hieuhc commented Aug 30, 2019 •

edited

Loading

alfpark commented Aug 30, 2019 •

edited

Loading

alfpark commented Aug 30, 2019 •

edited

Loading

alfpark commented Aug 31, 2019 •

edited

Loading

alfpark commented Sep 2, 2019 •

edited

Loading

hieuhc commented Sep 3, 2019 •

edited by alfpark

Loading