update readme for tpu prov

GoogleCloudPlatform · May 22, 2024 · 8cf20af · 8cf20af
1 parent b653bcf
commit 8cf20af
Showing 1 changed file with 82 additions and 3 deletions.
diff --git a/tpu-provisioner/README.md b/tpu-provisioner/README.md
@@ -14,21 +14,54 @@ Node Pools are cleaned up when no Pods are currently running on them. NOTE: Syst
 
 ## Setup
 
+### Create a GKE Cluster with workload identity enabled
+
+The TPU Provisioner requires workload identity to be enabled. 
+
+Refer to the [public docs](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity) and follow
+the steps to create a cluster with workload identity enabled.
+
+Note if you plan to [preload container images via secondary boot disks](https://cloud.google.com/kubernetes-engine/docs/how-to/data-container-image-preloading#create-cluster-secondary-disk) to reduce pod startup latency, you'll
+need to COS node image family and enable image streaming, as described in the linked docs.
+
+### Install JobSet
+
+TPU Provisioner dynamically provisions TPU slices for [JobSets](https://jobset.sigs.k8s.io). JobSet is a k8s native API
+for running distributed ML training workloads, and is the recommended solution for TPU Multislice training. However, it
+is generic and can be used for any arbitrary batch workload as well (GPUs, CPUs, etc). 
+
+Follow the [installation steps](https://jobset.sigs.k8s.io/docs/installation/) to install the latest release of JobSet
+in your cluster.
+
 ### Permissions
 
-Create Service Account.
+Create TPU Provisioner Service Account, which will be the default service account used
+by the 
 
 ```sh
 gcloud iam service-accounts create tpu-provisioner
 export PROVISIONER_SERVICE_ACCOUNT=tpu-provisioner@${PROJECT_ID}.iam.gserviceaccount.com
 ```
 
-Give the Service Account permissions to administer GKE clusters.
+Create GKE Node Service Account, which will be used for node pool operations.
+
+```sh
+gcloud iam service-accounts create k8s-node
+export NODE_SERVICE_ACCOUNT=k8s-node@${PROJECT_ID}.iam.gserviceaccount.com
+```
+
+Give the Service Accounts permissions to administer GKE clusters.
 
 ```bash
 gcloud projects add-iam-policy-binding $PROJECT_ID --member="serviceAccount:${PROVISIONER_SERVICE_ACCOUNT}" --role='roles/container.clusterAdmin'
 ```
 
+And
+
+```bash
+gcloud projects add-iam-policy-binding $PROJECT_ID --member="serviceAccount:${NODE_SERVICE_ACCOUNT}" --role='roles/container.clusterAdmin'
+```
+
 Bind the GCP Service Account to the Kubernetes Service Account that will be attached to the controller Pod.
 
 ```sh
@@ -37,6 +70,25 @@ gcloud iam service-accounts add-iam-policy-binding ${PROVISIONER_SERVICE_ACCOUNT
     --member "serviceAccount:${PROJECT_ID}.svc.id.goog[tpu-provisioner-system/tpu-provisioner-controller-manager]"
 ```
 
+Add iam-policy-binding enabling TPU provisioner svc account to impersonate the `k8s-node` service account: 
+
+```bash
+gcloud iam service-accounts add-iam-policy-binding k8s-node@${PROJECT_ID}.iam.gserviceaccount.com --member="serviceAccount:${PROVISIONER_SERVICE_ACCOUNT}" --role='roles/iam.serviceAccountUser'
+```
+
+### Deployment directory setup
+
+TPU Provisioner deployment configurations are defined on a per cluster level, using config files which live in
+a directory structure like follows:
+
+`${REPO_ROOT}/deploy/${PROJECT_ID}/${CLUSTER_NAME}`
+
+You will need to create the `deploy/${PROJECT_ID}/${CLUSTER_NAME}` directory for each you cluster you deploy
+the provisioner on.
+
+Next, copy the files from `deploy/example-project/example-cluster` into your new `deploy/${PROJECT_ID}/${CLUSTER_NAME}`
+directory and update the templated values in the yaml files to match your own.
+
 ### Building and Deploying the Controller
 
 Build and push your image. For example:
@@ -67,11 +119,38 @@ Deploy controller.
 kubectl apply --server-side -k ./deploy/${PROJECT_ID}/${CLUSTER_NAME}
 ```
 
+
+## Run an example
+
+After deploying the TPU provisioner on your cluster following the steps above, you can run an example workload to
+test that the configurations are set up correctly.
+
+There are 2 things to keep in mind here:
+
+1. You need sufficient quota for whatever TPU machine type you intend to run your workload on.
+2. TPU Provisioner operates on [JobSets](https://jobset.sigs.k8s.io) so you'll need to deploy your workload as a JobSet.
+See these [JobSet examples](https://jobset.sigs.k8s.io/docs/tasks/) to get started.
+
+This repo includes a simple distributed Jax workload on TPU v4 machines which can be used to verify
+your setup is correct.
+
+To apply it, simply run: `k apply -f examples/jobset.yaml` (note: you can tweak JobSet configuration
+to define the TPU machine type, number of TPU slices, and their topology).
+
+Next, run `kubectl get pods` to ensure pods have been created - you should see some pending pods.
+
+These pending pods should trigger node pool creation requests for TPU v4 slices of 2x2x2 topology.
+
+Within a few minutes, the node pool creation operations should complete and you should see the pods
+transition from `Pending` to `Ready`. In the container logs, you should see the total TPU device count.
+
 ## Development
 
 This project is written in Go and uses the [Kubebuilder](https://book.kubebuilder.io/) tool.
 
-You’ll need a Kubernetes cluster to run against.
+For local development and quick manual testing, you can do the following:
+
+Note you’ll need a Kubernetes cluster to run against.
 
 Impersonate the Service Account created above, for example: