diff --git a/contributing/RUNNER-AND-SHIM.md b/contributing/RUNNER-AND-SHIM.md index 460901782..edccdc14a 100644 --- a/contributing/RUNNER-AND-SHIM.md +++ b/contributing/RUNNER-AND-SHIM.md @@ -8,21 +8,35 @@ If the cloud provider has VM capabilities, `dstack` runs `dstack-shim` on the ho ## dstack-shim -`dstack-shim` works in cycles, allowing to run a different container once the job is finished. +`dstack-shim` works with _tasks_. Essentially, a task is a `dstack-shim`-specific part of `dstack`'s job, namely a Docker container with its associated data. `dstack-shim` is able to process multiple tasks in parallel. -- STEP 1: Wait for Docker image info (+ registry auth credentials if needed) from the dstack server -- STEP 2: Pull the Docker image -- STEP 3: Run the container - - Wait for container exit - - Or wait for the interruption signal from the dstack server -- STEP 4: Go to STEP 1 +A task is identified by a unique ID assigned by the `dstack` server. A task has a state: a status, allocated resources, data on disk (container, even if stopped, runner logs), etc. `dstack-shim` keeps a task in memory and its data on disk until the `dstack` server requests removal. The `dstack` server should periodically remove old tasks to clean up storage. Currently, the server removes a task right after termination request, but this is subject to change. + +A lifecycle of a task is as follows: + +- Wait for a task submission from the `dstack` server (image ref, registry credentials if needed, user, resource constraints, network mode, etc.) +- Allocate GPU resources, find and mount volumes, pull the image +- Run the container and + - either wait for container to exit + - or wait for the termination request from the `dstack` server +- Deallocate GPU resources, unmount volumes + +A container is started in either `host` or `bridge` network mode depending on the instance and the job: + +- If the instance is shared (split into GPU blocks), network mode is set to `bridge` to avoid port conflicts +- …unless there are multiple jobs in multinode mode — in that case, the instance is never shared (the jobs takes the whole instance), and network mode is set `host` + + **NOTE**: `host` networking mode would allow jobs to use any port at any moment for internal communication. For example, during distributed PyTorch training. +- If the instance is not shared by multiple jobs (i.e. GPU blocks feature is not used), network mode is `host` to avoid unnecessary overhead + +In `bridge` mode, container ports are mapped to ephemeral host ports. `dstack-shim` stores port mapping as a part of task's state. Currently, the default `bridge` network is used for all containers, but this could be changed in the future to improve container isolation. All communication between the `dstack` server and `dstack-shim` happens via REST API through an SSH tunnel. `dstack-shim` doesn't collect logs. Usually, it is run from a `cloud-init` user-data script. + The entrypoint for the container: - Installs `openssh-server` - Adds project and user public keys to `~/.ssh/authorized_keys` -- Downloads `dstack-runner` from the public S3 bucket -- Starts `dstack-runner` +- Starts `sshd` and `dstack-runner` ## dstack-runner @@ -32,10 +46,10 @@ The entrypoint for the container: - STEP 2: Wait for the code (tarball or diff) - STEP 3: Prepare the repo (clone git repo and apply the diff, or extract the archive) - STEP 4: Run the commands from the job spec - - Wait for the commands to exit - - Serve logs to the `dstack` server via HTTP - - Serve real-time logs to the CLI via WebSocket - - Wait for the signal to terminate the commands + - Wait for the commands to exit + - Serve logs to the `dstack` server via HTTP + - Serve real-time logs to the CLI via WebSocket + - Wait for the signal to terminate the commands - STEP 5: Wait until all logs are read by the server and the CLI. Or exit after a timeout All communication between the `dstack` server and `dstack-runner` happens via REST API through an SSH tunnel. `dstack-runner` collects the job logs and its own logs. Only the job logs are served via WebSocket. @@ -44,6 +58,4 @@ All communication between the `dstack` server and `dstack-runner` happens via RE `dstack` expects a running SSH server right next to the `dstack-runner`. It provides a secure channel for communication with the runner API and forwarding any ports without listening for `0.0.0.0`. The `dstack-gateway` also uses this SSH server for forwarding requests from public endpoints. -`dstack-shim` must also be running next to the SSH server. The `dstack` server connects to this SSH server for interacting with both `dstack-shim` and `dstack-runner` since we use `host` networking mode for the Docker container. The CLI uses this SSH server as a jump host because the user wants to connect to the container. - -> `host` networking mode would allow jobs to use any port at any moment for internal communication. For example, during distributed PyTorch training. +`dstack-shim` must also be running next to the SSH server. The `dstack` server connects to this SSH server for interacting with both `dstack-shim` and `dstack-runner`. The CLI uses this SSH server as a jump host because the user wants to connect to the container. diff --git a/runner/README.md b/runner/README.md index 1734783f8..d1c218ac4 100644 --- a/runner/README.md +++ b/runner/README.md @@ -92,6 +92,8 @@ These are nonexhaustive lists of external dependencies (executables, libraries) * `lsblk` * `mkfs.ext4` * (NVIDIA GPU SSH fleet instances only) `nvidia-smi` +* (AMD SSH fleet instances only) `docker` (used for `amd-smi` container) +* (Intel Gaudi SSH fleet instances only) `hl-smi` * ... Debian/Ubuntu packages: `mount` (`mount`, `umount`), `util-linux` (`mountpoint`, `lsblk`), `e2fsprogs` (`mkfs.ext4`)