Skip to content
This repository has been archived by the owner on Mar 20, 2023. It is now read-only.

Can not terminate running tasks #308

Closed
hieuhc opened this issue Aug 30, 2019 · 14 comments
Closed

Can not terminate running tasks #308

hieuhc opened this issue Aug 30, 2019 · 14 comments
Assignees
Labels

Comments

@hieuhc
Copy link
Contributor

hieuhc commented Aug 30, 2019

Problem Description

I can not terminate running tasks in a job with jobs tasks term. The command hangs for long time if I use --wait. When I terminate the task using Azure Portal, it is marked with Completed state, but when I log in to the node I can see the container is still running.

Batch Shipyard Version

3.7.1

Steps to Reproduce

  • Submit a long running task.
  • Try to terminate the task with command jobs tasks term.

Expected Results

The task is terminated with Completed state. There is no Docker container running when logging into the node.

Actual Results

Can not terminate using the command.

Additonal Comments

I wonder what should be expected when I specify max_task_retries being -1. To be able to terminate this kind of task, I had to manually terminate in Azure Portal, then log in to the node and docker rm -f

@alfpark
Copy link
Collaborator

alfpark commented Aug 30, 2019

A few questions:

  1. Are you using a native mode pool?
  2. Did you try with --force?

Also, please see the pool nodes ps and pool nodes zap commands.

@hieuhc
Copy link
Contributor Author

hieuhc commented Aug 30, 2019

Hi @alfpark, thanks for your quick reply. The pool specification is below, which I think is not in native mode. Should I convert all the pools we are having so far to this mode and redeploy all the tasks?

vm_configuration:
    platform_image:
      publisher: Canonical
      offer: UbuntuServer
      sku: 16.04-LTS
      version: latest
  vm_size: Standard_NC6

Also I tried with --force and got the same result.

If I understand correctly pool nodes zap kills all container, but I only want to kill some specific containers. In general, can you recommend a way to redeploy the task when we have a new version of Docker image? Currently we are killing all tasks using current version, then use jobs add to redeploy. Any plan to integrate with Azure DevOps pipeline for CI/CD?

@alfpark
Copy link
Collaborator

alfpark commented Aug 30, 2019

It's possible that the termination signal is not being properly propagated within the running container. Do you have a provisioned SSH user? If not, then Shipyard cannot kill these containers properly in non-native mode (even with --force).

It may be cleaner to use a native mode pool for your workload. native mode pools inherently understand Docker tasks so the task/job termination experience is cleaner for containers like yours where the termination signal is not sent properly to child processes in the container. Take a look here to see if it's applicable to your use case: https://github.com/Azure/batch-shipyard/blob/master/docs/97-faq.md#what-is-native-under-pool-platform_image-and-custom_image

For redeploying on a new image - I assume you accidentally left out the pool images update command in between the task kill and jobs add? That is typically the recommended pattern (unless you use rolling pools and utilize job live migration).

The DevOps task is intriguing... please open a new issue for that so we can track as a proper feature request.

@hieuhc
Copy link
Contributor Author

hieuhc commented Aug 30, 2019

You are right that we use pool images update in between, but I am curious do we need that if we use allow_run_on_missing_image: true?

I can open a new support feature request for DevOps after gathering some requirements.

Thanks for your hint on native mode. We must consider to recreate out pools with this for a better stability. In the meantime, we have a SSH user when creating the pool in the above case, I can issue pool ssh and pool images update --ssh, also a pool user add gives user X already exists on node. But still I can not jobs tasks term, maybe there is something I am missing here?

@alfpark
Copy link
Collaborator

alfpark commented Aug 30, 2019

You are right that we use pool images update in between, but I am curious do we need that if we use allow_run_on_missing_image: true?

Yes, allow_run_on_missing_image does not re-pull an image. The behavior is the same as if you had a local image and ran docker run on it. It only allows an image to be pulled if it isn't present.

When you issue the jobs tasks term (with or without --force) , do you get some INFO logs about a command being executed on node (e.g., 2019-08-30 20:21:21.988 INFO convoy.crypto:connect_or_exec_ssh_command:211 executing command on node x:y with key id_rsa_shipyard)?

@hieuhc
Copy link
Contributor Author

hieuhc commented Aug 30, 2019

I can only see this

2019-08-30 19:56:39.067 INFO - Terminating task: task-taskID
2019-08-30 19:56:39.068 DEBUG - waiting for task task-taskID in job job-dev to terminate

@alfpark
Copy link
Collaborator

alfpark commented Aug 30, 2019

Thanks for confirming - this is most likely a regression. I have a fix that should be landing shortly in develop. You can test either by pulling the develop branch or using the develop-cli Docker image once the DevOps build completes.

Note that if you're moving from 3.7.1 to develop, you'll need to upgrade (if not using the CLI Docker image): https://github.com/Azure/batch-shipyard/blob/master/docs/01-batch-shipyard-installation.md#upgrading-to-new-releases

@alfpark alfpark added defect and removed question labels Aug 30, 2019
@alfpark alfpark self-assigned this Aug 30, 2019
alfpark added a commit that referenced this issue Aug 30, 2019
- SSH side-channel docker kill signal was not being sent as Docker tasks
were not being detected properly
- Also fix issue with pool images update not executing if block on
images is false
- Resolves #308
alfpark added a commit that referenced this issue Aug 30, 2019
- SSH side-channel docker kill signal was not being sent as Docker tasks
were not being detected properly
- Also fix issue with pool images update not executing if block on
images is false
- Resolves #308
@hieuhc
Copy link
Contributor Author

hieuhc commented Aug 31, 2019

@alfpark Can you please confirm the new version of docker mcr.microsoft.com/azure-batch/shipyard:develop-cli has been deployed. I have just tested and it seemed like I still couldn't terminate the task in non-native mode.

@alfpark
Copy link
Collaborator

alfpark commented Aug 31, 2019

Did you redeploy your pool and jobs using the develop-cli image? Also, were you able to observe any INFO logs as per above comment about a command over ssh (i.e., convoy.crypto:connect_or_exec_ssh_command)?

@hieuhc
Copy link
Contributor Author

hieuhc commented Sep 2, 2019

Sorry I should have elaborated better. I used develop-cli to redeploy a test pool. An attempt to terminate the task could end the task in the portal UI, but the actual container was still running when I ssh to the node. I could not see the INFO log as you suggested.

@alfpark
Copy link
Collaborator

alfpark commented Sep 2, 2019

Sorry, I'm still a bit unclear here. I understand you re-deployed your test pool with develop-cli. Did you:

  1. Submit your job with the develop-cli?
  2. Execute jobs tasks term with the develop-cli?

If yes to both, can you please elaborate about your job (or post a redacted jobs.yaml) and also a screenshot from the portal UI or Batch explorer of your task command line and environment variables (redact as necessary)?

@hieuhc
Copy link
Contributor Author

hieuhc commented Sep 3, 2019

My answer is yes for both. Below is the jobs.yaml file

job_specifications:
- allow_run_on_missing_image: true
  id: job-id
  tasks:
  - command: bash -c "cd /workspace && python -m service.main.processor"
    max_task_retries: -1
    environment_variables:
      COMPUTE_CONTEXT: GPU
      CONFIG_FILE_RUNNER: service/config/job_processor.config.dev.yml
    gpus: all
    docker_image: some_docker_image:latest
    remove_container_after_exit: true

I attach here some screenshot images for our sample job and task

redacted

alfpark added a commit that referenced this issue Sep 3, 2019
@alfpark
Copy link
Collaborator

alfpark commented Sep 3, 2019

@hieuhc Thanks for the detailed report. You may want to consider rotating your ACR credentials as part of the credential leaked in the screenshot above (I edited out the screenshot).

It looks like there was another defect with infinite retry tasks and termination. Please try the new develop-cli image once the DevOps build completes.

@hieuhc
Copy link
Contributor Author

hieuhc commented Sep 4, 2019

Hi. I confirm the newest fix has resolved the issue. Much appreciated for helping.

@hieuhc hieuhc closed this as completed Sep 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants