Skip to content

Commit

Permalink
Set the default GKE cluster type for ray to GKE Autopilot. (#618)
Browse files Browse the repository at this point in the history
Also add instructions to use a standard cluster if preferred.
  • Loading branch information
roberthbailey committed Apr 29, 2024
1 parent b4203e5 commit 68d8589
Show file tree
Hide file tree
Showing 3 changed files with 56 additions and 15 deletions.
58 changes: 47 additions & 11 deletions applications/ray/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,22 +3,58 @@
This repository contains a Terraform template for running [Ray](https://www.ray.io/) on Google Kubernetes Engine.
See the [Ray on GKE](/ray-on-gke/) directory to see additional guides and references.

## Prerequisites

1. GCP Project with following APIs enabled
- container.googleapis.com
- iap.googleapis.com (required when using authentication with Identity Aware Proxy)

2. A functional GKE cluster.
- To create a new standard or autopilot cluster, follow the instructions in [`infrastructure/README.md`](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/infrastructure/README.md)
- Alternatively, you can set the `create_cluster` variable to true in `workloads.tfvars` to provision a new GKE cluster. This will default to creating a GKE Autopilot cluster; if you want to provision a standard cluster you must also set `autopilot_cluster` to false.

3. This module is configured to optionally use Identity Aware Proxy (IAP) to protect access to the Ray dashboard. It expects the brand & the OAuth consent configured in your org. You can check the details here: [OAuth consent screen](https://console.cloud.google.com/apis/credentials/consent)

4. Preinstall the following on your computer:
* Terraform
* Gcloud CLI

## Installation

Preinstall the following on your computer:
* Terraform
* Gcloud
### Configure Inputs

> **_NOTE:_** Terraform keeps state metadata in a local file called `terraform.tfstate`. Deleting the file may cause some resources to not be cleaned up correctly even if you delete the cluster. We suggest using `terraform destory` before reapplying/reinstalling.
1. If needed, clone the repo
```
git clone https://github.com/GoogleCloudPlatform/ai-on-gke
cd ai-on-gke/applications/ray
```

1. If needed, git clone https://github.com/GoogleCloudPlatform/ai-on-gke
2. Edit `workloads.tfvars` with your GCP settings.

**Important Note:**
If using this with the Jupyter module (`applications/jupyter/`), it is recommended to use the same k8s namespace
for both i.e. set this to the same namespace as `applications/jupyter/workloads.tfvars`.

| Variable | Description | Required |
|-----------------------------|----------------------------------------------------------------------------------------------------------------|:--------:|
| project_id | GCP Project Id | Yes |
| cluster_name | GKE Cluster Name | Yes |
| cluster_location | GCP Region | Yes |
| kubernetes_namespace | The namespace that Ray and rest of the other resources will be installed in. | Yes |
| gcs_bucket | GCS bucket to be used for Ray storage | Yes |
| create_service_account | Create service accounts used for Workload Identity mapping | Yes |


### Install

> **_NOTE:_** Terraform keeps state metadata in a local file called `terraform.tfstate`. Deleting the file may cause some resources to not be cleaned up correctly even if you delete the cluster. We suggest using `terraform destory` before reapplying/reinstalling.
2. `cd applications/ray`
3. Ensure your gcloud application default credentials are in place.
```
gcloud auth application-default login
```

3. Find the name and location of the GKE cluster you want to use.
Run `gcloud container clusters list --project=<your GCP project>` to see all the available clusters.
_Note: If you created the GKE cluster via the infrastructure repo, you can get the cluster info from `platform.tfvars`_
4. Run `terraform init`

4. Edit `workloads.tfvars` with your environment specific variables and configurations.
5. Run `terraform apply --var-file=./workloads.tfvars`.

5. Run `terraform init && terraform apply --var-file workloads.tfvars`
4 changes: 2 additions & 2 deletions applications/ray/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ variable "ray_version" {
variable "kubernetes_namespace" {
type = string
description = "Kubernetes namespace where resources are deployed"
default = "myray"
default = "ml"
}

variable "enable_grafana_on_ray_dashboard" {
Expand Down Expand Up @@ -105,7 +105,7 @@ variable "private_cluster" {

variable "autopilot_cluster" {
type = bool
default = false
default = true
}

variable "cpu_pools" {
Expand Down
9 changes: 7 additions & 2 deletions applications/ray/workloads.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,16 @@
## Need to pull this variables from tf output from previous platform stage
project_id = "<your project ID>"

## this is required for terraform to connect to GKE master and deploy workloads
create_cluster = false # this flag will create a new standard public gke cluster in default network
## This is required for terraform to connect to GKE cluster and deploy workloads.
cluster_name = "<cluster name>"
cluster_location = "us-central1"

## If terraform should create a new GKE cluster, fill in this section as well.
## By default, a public autopilot GKE cluster will be created in the default network.
## Set the autopilot_cluster variable to false to create a standard cluster instead.
create_cluster = false
autopilot_cluster = true

#######################################################
#### APPLICATIONS
#######################################################
Expand Down

0 comments on commit 68d8589

Please sign in to comment.