Update contributing guide on shim (#2255)

dstackai · Feb 3, 2025 · 1f74e37 · 1f74e37
1 parent 1d2ae34
commit 1f74e37
Show file tree

Hide file tree

Showing 2 changed files with 30 additions and 16 deletions.
diff --git a/contributing/RUNNER-AND-SHIM.md b/contributing/RUNNER-AND-SHIM.md
@@ -8,21 +8,35 @@ If the cloud provider has VM capabilities, `dstack` runs `dstack-shim` on the ho
 
 ## dstack-shim
 
-`dstack-shim` works in cycles, allowing to run a different container once the job is finished.
+`dstack-shim` works with _tasks_. Essentially, a task is a `dstack-shim`-specific part of `dstack`'s job, namely a Docker container with its associated data. `dstack-shim` is able to process multiple tasks in parallel.
 
-- STEP 1: Wait for Docker image info (+ registry auth credentials if needed) from the dstack server
-- STEP 2: Pull the Docker image
-- STEP 3: Run the container
-	- Wait for container exit
-	- Or wait for the interruption signal from the dstack server
-- STEP 4: Go to STEP 1
+A task is identified by a unique ID assigned by the `dstack` server. A task has a state: a status, allocated resources, data on disk (container, even if stopped, runner logs), etc. `dstack-shim` keeps a task in memory and its data on disk until the `dstack` server requests removal. The `dstack` server should periodically remove old tasks to clean up storage. Currently, the server removes a task right after termination request, but this is subject to change.
+
+A lifecycle of a task is as follows:
+
+- Wait for a task submission from the `dstack` server (image ref, registry credentials if needed, user, resource constraints, network mode, etc.)
+- Allocate GPU resources, find and mount volumes, pull the image
+- Run the container and
+  - either wait for container to exit
+  - or wait for the termination request from the `dstack` server
+- Deallocate GPU resources, unmount volumes
+
+A container is started in either `host` or `bridge` network mode depending on the instance and the job:
+
+- If the instance is shared (split into GPU blocks), network mode is set to `bridge` to avoid port conflicts
+- …unless there are multiple jobs in multinode mode — in that case, the instance is never shared (the jobs takes the whole instance), and network mode is set `host`
+
+  **NOTE**: `host` networking mode would allow jobs to use any port at any moment for internal communication. For example, during distributed PyTorch training.
+- If the instance is not shared by multiple jobs (i.e. GPU blocks feature is not used), network mode is `host` to avoid unnecessary overhead
+
+In `bridge` mode, container ports are mapped to ephemeral host ports. `dstack-shim` stores port mapping as a part of task's state. Currently, the default `bridge` network is used for all containers, but this could be changed in the future to improve container isolation.
 
 All communication between the `dstack` server and `dstack-shim` happens via REST API through an SSH tunnel. `dstack-shim` doesn't collect logs. Usually, it is run from a `cloud-init` user-data script.
+
 The entrypoint for the container:
 - Installs `openssh-server`
 - Adds project and user public keys to `~/.ssh/authorized_keys`
-- Downloads `dstack-runner` from the public S3 bucket
-- Starts `dstack-runner`
+- Starts `sshd` and `dstack-runner`
 
 ## dstack-runner
 
@@ -32,10 +46,10 @@ The entrypoint for the container:
 - STEP 2: Wait for the code (tarball or diff)
 - STEP 3: Prepare the repo (clone git repo and apply the diff, or extract the archive)
 - STEP 4: Run the commands from the job spec
-	- Wait for the commands to exit
-	- Serve logs to the `dstack` server via HTTP
-	- Serve real-time logs to the CLI via WebSocket
-	- Wait for the signal to terminate the commands
+  - Wait for the commands to exit
+  - Serve logs to the `dstack` server via HTTP
+  - Serve real-time logs to the CLI via WebSocket
+  - Wait for the signal to terminate the commands
 - STEP 5: Wait until all logs are read by the server and the CLI. Or exit after a timeout
 
 All communication between the `dstack` server and `dstack-runner` happens via REST API through an SSH tunnel. `dstack-runner` collects the job logs and its own logs. Only the job logs are served via WebSocket.
@@ -44,6 +58,4 @@ All communication between the `dstack` server and `dstack-runner` happens via RE
 
 `dstack` expects a running SSH server right next to the `dstack-runner`. It provides a secure channel for communication with the runner API and forwarding any ports without listening for `0.0.0.0`. The `dstack-gateway` also uses this SSH server for forwarding requests from public endpoints.
 
-`dstack-shim` must also be running next to the SSH server. The `dstack` server connects to this SSH server for interacting with both `dstack-shim` and `dstack-runner` since we use `host` networking mode for the Docker container. The CLI uses this SSH server as a jump host because the user wants to connect to the container.
-
-> `host` networking mode would allow jobs to use any port at any moment for internal communication. For example, during distributed PyTorch training.
+`dstack-shim` must also be running next to the SSH server. The `dstack` server connects to this SSH server for interacting with both `dstack-shim` and `dstack-runner`. The CLI uses this SSH server as a jump host because the user wants to connect to the container.
diff --git a/runner/README.md b/runner/README.md
@@ -92,6 +92,8 @@ These are nonexhaustive lists of external dependencies (executables, libraries)
 * `lsblk`
 * `mkfs.ext4`
 * (NVIDIA GPU SSH fleet instances only) `nvidia-smi`
+* (AMD SSH fleet instances only) `docker` (used for `amd-smi` container)
+* (Intel Gaudi SSH fleet instances only) `hl-smi`
 * ...
 
 Debian/Ubuntu packages: `mount` (`mount`, `umount`), `util-linux` (`mountpoint`, `lsblk`), `e2fsprogs` (`mkfs.ext4`)