Skip to content

Commit

Permalink
Update contributing guide on shim (#2255)
Browse files Browse the repository at this point in the history
  • Loading branch information
un-def authored Feb 3, 2025
1 parent 1d2ae34 commit 1f74e37
Show file tree
Hide file tree
Showing 2 changed files with 30 additions and 16 deletions.
44 changes: 28 additions & 16 deletions contributing/RUNNER-AND-SHIM.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,21 +8,35 @@ If the cloud provider has VM capabilities, `dstack` runs `dstack-shim` on the ho

## dstack-shim

`dstack-shim` works in cycles, allowing to run a different container once the job is finished.
`dstack-shim` works with _tasks_. Essentially, a task is a `dstack-shim`-specific part of `dstack`'s job, namely a Docker container with its associated data. `dstack-shim` is able to process multiple tasks in parallel.

- STEP 1: Wait for Docker image info (+ registry auth credentials if needed) from the dstack server
- STEP 2: Pull the Docker image
- STEP 3: Run the container
- Wait for container exit
- Or wait for the interruption signal from the dstack server
- STEP 4: Go to STEP 1
A task is identified by a unique ID assigned by the `dstack` server. A task has a state: a status, allocated resources, data on disk (container, even if stopped, runner logs), etc. `dstack-shim` keeps a task in memory and its data on disk until the `dstack` server requests removal. The `dstack` server should periodically remove old tasks to clean up storage. Currently, the server removes a task right after termination request, but this is subject to change.

A lifecycle of a task is as follows:

- Wait for a task submission from the `dstack` server (image ref, registry credentials if needed, user, resource constraints, network mode, etc.)
- Allocate GPU resources, find and mount volumes, pull the image
- Run the container and
- either wait for container to exit
- or wait for the termination request from the `dstack` server
- Deallocate GPU resources, unmount volumes

A container is started in either `host` or `bridge` network mode depending on the instance and the job:

- If the instance is shared (split into GPU blocks), network mode is set to `bridge` to avoid port conflicts
- …unless there are multiple jobs in multinode mode — in that case, the instance is never shared (the jobs takes the whole instance), and network mode is set `host`

**NOTE**: `host` networking mode would allow jobs to use any port at any moment for internal communication. For example, during distributed PyTorch training.
- If the instance is not shared by multiple jobs (i.e. GPU blocks feature is not used), network mode is `host` to avoid unnecessary overhead

In `bridge` mode, container ports are mapped to ephemeral host ports. `dstack-shim` stores port mapping as a part of task's state. Currently, the default `bridge` network is used for all containers, but this could be changed in the future to improve container isolation.

All communication between the `dstack` server and `dstack-shim` happens via REST API through an SSH tunnel. `dstack-shim` doesn't collect logs. Usually, it is run from a `cloud-init` user-data script.

The entrypoint for the container:
- Installs `openssh-server`
- Adds project and user public keys to `~/.ssh/authorized_keys`
- Downloads `dstack-runner` from the public S3 bucket
- Starts `dstack-runner`
- Starts `sshd` and `dstack-runner`

## dstack-runner

Expand All @@ -32,10 +46,10 @@ The entrypoint for the container:
- STEP 2: Wait for the code (tarball or diff)
- STEP 3: Prepare the repo (clone git repo and apply the diff, or extract the archive)
- STEP 4: Run the commands from the job spec
- Wait for the commands to exit
- Serve logs to the `dstack` server via HTTP
- Serve real-time logs to the CLI via WebSocket
- Wait for the signal to terminate the commands
- Wait for the commands to exit
- Serve logs to the `dstack` server via HTTP
- Serve real-time logs to the CLI via WebSocket
- Wait for the signal to terminate the commands
- STEP 5: Wait until all logs are read by the server and the CLI. Or exit after a timeout

All communication between the `dstack` server and `dstack-runner` happens via REST API through an SSH tunnel. `dstack-runner` collects the job logs and its own logs. Only the job logs are served via WebSocket.
Expand All @@ -44,6 +58,4 @@ All communication between the `dstack` server and `dstack-runner` happens via RE

`dstack` expects a running SSH server right next to the `dstack-runner`. It provides a secure channel for communication with the runner API and forwarding any ports without listening for `0.0.0.0`. The `dstack-gateway` also uses this SSH server for forwarding requests from public endpoints.

`dstack-shim` must also be running next to the SSH server. The `dstack` server connects to this SSH server for interacting with both `dstack-shim` and `dstack-runner` since we use `host` networking mode for the Docker container. The CLI uses this SSH server as a jump host because the user wants to connect to the container.

> `host` networking mode would allow jobs to use any port at any moment for internal communication. For example, during distributed PyTorch training.
`dstack-shim` must also be running next to the SSH server. The `dstack` server connects to this SSH server for interacting with both `dstack-shim` and `dstack-runner`. The CLI uses this SSH server as a jump host because the user wants to connect to the container.
2 changes: 2 additions & 0 deletions runner/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,8 @@ These are nonexhaustive lists of external dependencies (executables, libraries)
* `lsblk`
* `mkfs.ext4`
* (NVIDIA GPU SSH fleet instances only) `nvidia-smi`
* (AMD SSH fleet instances only) `docker` (used for `amd-smi` container)
* (Intel Gaudi SSH fleet instances only) `hl-smi`
* ...

Debian/Ubuntu packages: `mount` (`mount`, `umount`), `util-linux` (`mountpoint`, `lsblk`), `e2fsprogs` (`mkfs.ext4`)
Expand Down

0 comments on commit 1f74e37

Please sign in to comment.