Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve][Doc] Add loadbalancer and monitoring section for K8s #36775

Merged
merged 8 commits into from
Jun 29, 2023

Conversation

sihanwang41
Copy link
Contributor

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
@sihanwang41 sihanwang41 marked this pull request as ready for review June 23, 2023 21:58
doc/source/serve/production-guide/kubernetes.md Outdated Show resolved Hide resolved
You can setup ingress to expose your Serve application to the internet. See [example](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray_v1alpha1_rayservice-alb-ingress.yaml)

:::{note}
- Ray Serve provisions an HTTP proxy on every node, allowing you to use `/-/routes` as the endpoint for node health checks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Ray Serve provisions an HTTP proxy on every node, allowing you to use `/-/routes` as the endpoint for node health checks.
- Ray Serve runs an HTTP proxy on every node. Each proxy exposes contains a `/healthz` endpoint that you can query to check its health.

doc/source/serve/production-guide/kubernetes.md Outdated Show resolved Hide resolved
doc/source/serve/production-guide/kubernetes.md Outdated Show resolved Hide resolved
doc/source/serve/production-guide/kubernetes.md Outdated Show resolved Hide resolved
doc/source/serve/production-guide/kubernetes.md Outdated Show resolved Hide resolved

:::{note}
- Ray Serve provisions an HTTP proxy on every node, allowing you to use `/-/routes` as the endpoint for node health checks.
- Ray Serve uses port 8000 as the default HTTP proxy traffic port. You can change the port by setting `http_options` in the Serve config. Learn more details [here](serve-multi-application)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Ray Serve uses port 8000 as the default HTTP proxy traffic port. You can change the port by setting `http_options` in the Serve config. Learn more details [here](serve-multi-application)
- Ray Serve uses port 8000 as the default HTTP proxy traffic port. You can change the port by setting `http_options` in the Serve config. See [Deploy Multiple Applications](serve-multi-application) for more details.

doc/source/serve/production-guide/kubernetes.md Outdated Show resolved Hide resolved
doc/source/serve/production-guide/kubernetes.md Outdated Show resolved Hide resolved
doc/source/serve/production-guide/kubernetes.md Outdated Show resolved Hide resolved
doc/source/serve/production-guide/kubernetes.md Outdated Show resolved Hide resolved
doc/source/serve/production-guide/kubernetes.md Outdated Show resolved Hide resolved
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
@@ -255,6 +255,21 @@ To enable autoscaling in a KubeRay Cluster, you need to set `enableInTreeAutosca
In most use cases, it is recommended to enable Kubernetes autoscaling to fully utilize the resources in your cluster. If you are using GKE, you can utilize the AutoPilot Kubernetes cluster. For instructions, see [Create an Autopilot Cluster]((https://cloud.google.com/kubernetes-engine/docs/how-to/creating-an-autopilot-cluster)). For EKS, you can enable Kubernetes cluster autoscaling by utilizing the Cluster Autoscaler. For detailed information, see [Cluster Autoscaler on AWS](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md). To understand the relationship between Kubernetes autoscaling and Ray autoscaling, see [Ray Autoscaler with Kubernetes Cluster Autoscaler](kuberay-autoscaler-with-ray-autoscaler).
:::

## Load balancer
Set up ingress to expose your Serve application to the internet. See [this configuration](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray_v1alpha1_rayservice-alb-ingress.yaml)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you want an internal service (not exposed to the internet)?

Copy link
Contributor Author

@sihanwang41 sihanwang41 Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased. Basically it depends on network config whether your ingress can be connected by external. (not related to the ingress itself)

Copy link
Contributor

@akshay-anyscale akshay-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ray_v1alpha1_rayservice.yaml needs to be updated to use serveConfigV2

@sihanwang41
Copy link
Contributor Author

ray_v1alpha1_rayservice.yaml

Hi @akshay-anyscale, the corresponding link can be updated after the new kuberay version is released. Currently ray_v1alpha1_rayservice.yaml in latest master is already using service v2 config. Normally we will have doc version audit in the ray doc everytime we have a new version of kuberay release.

Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
@akshay-anyscale
Copy link
Contributor

ray_v1alpha1_rayservice.yaml

Hi @akshay-anyscale, the corresponding link can be updated after the new kuberay version is released. Currently ray_v1alpha1_rayservice.yaml in latest master is already using service v2 config. Normally we will have doc version audit in the ray doc everytime we have a new version of kuberay release.

Got it, don't those changes need to be done before Ray 2.6 release? CC: @kevin85421

- Ray Serve uses port 8000 as the default HTTP proxy traffic port. You can change the port by setting `http_options` in the Serve config. Learn more details [here](serve-multi-application).
:::

## Monitoring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add the links for how to hook up 3rd party observability providers here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add how to get to the kuberay logs for troubleshooting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added logging with Loki link and trouble shooting notes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for updating! Do we have anything we recommend for metrics too? Btw I noticed this guide as well - https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/logging.html . @scottsun94 which one do you usually recommend?

Copy link
Contributor

@scottsun94 scottsun94 Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For log aggregation tools, we list a few here: Fluent Bit, Vector, Fluentd, Filebeat, and Promtail. We use Fluent Bit as an example but any of them should work and it depends on the users' preferences.

Serve doc have instructions for how to use Promtail with Loki which is good. It doesn't expand on how to persist the logs using Promtail. Here is my suggestion for how to reorganize the links in this section:

  • people need to learn how to export metrics first (if they want to see them in ray dashboard, Grafana, CW, or whatever monitoring tools they want)
  • learn how to configure and use dashboard
  • lastly, log persistence
Learn how to export metrics and set up Grafana [here](https://ray-project.github.io/kuberay/guidance/prometheus-grafana/)
Learn more about how to configure and manage Dashboard [here](observability-configure-manage-dashboard).
Learn about the Ray Serve Dashboard [here](serve-monitoring).
Learn about the [Ray Serve logs](serve-logging) and how to [persistent logs](https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/logging.html)on Kubernetes clusters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
@sihanwang41
Copy link
Contributor Author

ray_v1alpha1_rayservice.yaml

Hi @akshay-anyscale, the corresponding link can be updated after the new kuberay version is released. Currently ray_v1alpha1_rayservice.yaml in latest master is already using service v2 config. Normally we will have doc version audit in the ray doc everytime we have a new version of kuberay release.

Got it, don't those changes need to be done before Ray 2.6 release? CC: @kevin85421

Yeah, we can update the doc link after the kuberay 0.6.0 version is out. (Don't want to use nightly link in 2.6 release)

- Learn more about how to configure and manage Dashboard [here](observability-configure-manage-dashboard).
- Learn about the Ray Serve Dashboard [here](serve-monitoring).
- Learn how to set up the Grafana for Dashboard [here](https://ray-project.github.io/kuberay/guidance/prometheus-grafana/)
- Learn about the [Ray Serve logs](serve-logging) and how to [persistent logs](https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/logging.html) on Kubernetes clusters.
Copy link
Contributor

@scottsun94 scottsun94 Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Learn about the [Ray Serve logs](serve-logging) and how to [persistent logs](https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/logging.html) on Kubernetes clusters.
- Learn about the [Ray Serve logs](serve-logging) and how to [persistent logs](kuberay-logging) on Kubernetes.

I shouldn't use the absolute URLs. Does this anchor work?

Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
@rkooo567 rkooo567 merged commit cf0bdd6 into ray-project:master Jun 29, 2023
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…ject#36775)

Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants