diff --git a/doc/source/_toc.yml b/doc/source/_toc.yml index e8efe1c2da94..a9e6993a7482 100644 --- a/doc/source/_toc.yml +++ b/doc/source/_toc.yml @@ -256,11 +256,13 @@ parts: sections: - file: serve/getting_started - file: serve/key-concepts + - file: serve/develop-and-deploy - file: serve/model_composition - file: serve/deploy-many-models/index sections: - file: serve/deploy-many-models/multi-app - file: serve/deploy-many-models/model-multiplexing + - file: serve/configure-serve-deployment - file: serve/http-guide - file: serve/production-guide/index title: Production Guide diff --git a/doc/source/serve/advanced-guides/deploy-vm.md b/doc/source/serve/advanced-guides/deploy-vm.md index d03700875108..e86bf8650294 100644 --- a/doc/source/serve/advanced-guides/deploy-vm.md +++ b/doc/source/serve/advanced-guides/deploy-vm.md @@ -36,27 +36,7 @@ The message `Sent deploy request successfully!` means: * It will start a new Serve application if one hasn't already started. * The Serve application will deploy the deployments from your deployment graph, updated with the configurations from your config file. -It does **not** mean that your Serve application, including your deployments, has already started running successfully. This happens asynchronously as the Ray cluster attempts to update itself to match the settings from your config file. Check out the [next section](serve-in-production-inspecting) to learn more about how to get the current status. - -## Adding a runtime environment - -The import path (e.g., `fruit:deployment_graph`) must be importable by Serve at runtime. -When running locally, this might be in your current working directory. -However, when running on a cluster you also need to make sure the path is importable. -You can achieve this either by building the code into the cluster's container image (see [Cluster Configuration](kuberay-config) for more details) or by using a `runtime_env` with a [remote URI](remote-uris) that hosts the code in remote storage. - -As an example, we have [pushed a copy of the FruitStand deployment graph to GitHub](https://github.com/ray-project/test_dag/blob/40d61c141b9c37853a7014b8659fc7f23c1d04f6/fruit.py). You can use this config file to deploy the `FruitStand` deployment graph to your own Ray cluster even if you don't have the code locally: - -```yaml -import_path: fruit:deployment_graph - -runtime_env: - working_dir: "https://github.com/ray-project/serve_config_examples/archive/HEAD.zip" -``` - -:::{note} -As a side note, you could also package your deployment graph into a standalone Python package that can be imported using a [PYTHONPATH](https://docs.python.org/3.10/using/cmdline.html#envvar-PYTHONPATH) to provide location independence on your local machine. However, it's still best practice to use a `runtime_env`, to ensure consistency across all machines in your cluster. -::: +It does **not** mean that your Serve application, including your deployments, has already started running successfully. This happens asynchronously as the Ray cluster attempts to update itself to match the settings from your config file. See [Inspect an application](serve-in-production-inspecting) for how to get the current status. (serve-in-production-remote-cluster)= @@ -74,7 +54,11 @@ As an example, the address for the local cluster started by `ray start --head` i $ serve deploy config_file.yaml -a http://127.0.0.1:52365 ``` -The Ray dashboard agent's default port is 52365. You can set it to a different value using the `--dashboard-agent-listen-port` argument when running `ray start`." +The Ray Dashboard agent's default port is 52365. To set it to a different value, use the `--dashboard-agent-listen-port` argument when running `ray start`. + +:::{note} +When running on a remote cluster, you need to ensure that the import path is accessible. See [Handle Dependencies](serve-handling-dependencies) for how to add a runtime environment. +::: :::{note} If the port 52365 (or whichever port you specify with `--dashboard-agent-listen-port`) is unavailable when Ray starts, the dashboard agent’s HTTP server will fail. However, the dashboard agent and Ray will continue to run. @@ -107,84 +91,6 @@ $ unset RAY_AGENT_ADDRESS Check for this variable in your environment to make sure you're using your desired Ray agent address. ::: -(serve-in-production-inspecting)= - -## Inspecting the application with `serve config` and `serve status` - -The Serve CLI also offers two commands to help you inspect your Serve application in production: `serve config` and `serve status`. -If you're working with a remote cluster, `serve config` and `serve status` also offer an `--address/-a` argument to access your cluster. Check out [the previous section](serve-in-production-remote-cluster) for more info on this argument. - -`serve config` gets the latest config file the Ray cluster received. This config file represents the Serve application's goal state. The Ray cluster will constantly attempt to reach and maintain this state by deploying deployments, recovering failed replicas, and more. - -Using the `fruit_config.yaml` example from [an earlier section](fruit-config-yaml): - -```console -$ ray start --head -$ serve deploy fruit_config.yaml -... - -$ serve config -import_path: fruit:deployment_graph - -runtime_env: {} - -deployments: - -- name: MangoStand - num_replicas: 2 - route_prefix: null -... -``` - -`serve status` gets your Serve application's current status. It's divided into two parts: the `app_status` and the `deployment_statuses`. - -The `app_status` contains three fields: -* `status`: a Serve application has four possible statuses: - * `"NOT_STARTED"`: no application has been deployed on this cluster. - * `"DEPLOYING"`: the application is currently carrying out a `serve deploy` request. It is deploying new deployments or updating existing ones. - * `"RUNNING"`: the application is at steady-state. It has finished executing any previous `serve deploy` requests, and it is attempting to maintain the goal state set by the latest `serve deploy` request. - * `"DEPLOY_FAILED"`: the latest `serve deploy` request has failed. -* `message`: provides context on the current status. -* `deployment_timestamp`: a unix timestamp of when Serve received the last `serve deploy` request. This is calculated using the `ServeController`'s local clock. - -The `deployment_statuses` contains a list of dictionaries representing each deployment's status. Each dictionary has three fields: -* `name`: the deployment's name. -* `status`: a Serve deployment has three possible statuses: - * `"UPDATING"`: the deployment is updating to meet the goal state set by a previous `deploy` request. - * `"HEALTHY"`: the deployment is at the latest requests goal state. - * `"UNHEALTHY"`: the deployment has either failed to update, or it has updated and has become unhealthy afterwards. This may be due to an error in the deployment's constructor, a crashed replica, or a general system or machine error. -* `message`: provides context on the current status. - -You can use the `serve status` command to inspect your deployments after they are deployed and throughout their lifetime. - -Using the `fruit_config.yaml` example from [an earlier section](fruit-config-yaml): - -```console -$ ray start --head -$ serve deploy fruit_config.yaml -... - -$ serve status -app_status: - status: RUNNING - message: '' - deployment_timestamp: 1655771534.835145 -deployment_statuses: -- name: MangoStand - status: HEALTHY - message: '' -- name: OrangeStand - status: HEALTHY - message: '' -- name: PearStand - status: HEALTHY - message: '' -- name: FruitMarket - status: HEALTHY - message: '' -- name: DAGDriver - status: HEALTHY - message: '' -``` +To inspect the status of the Serve application in production, see [Inspect an application](serve-in-production-inspecting). -`serve status` can also be used with KubeRay ({ref}`kuberay-index`), a Kubernetes operator for Ray Serve, to help deploy your Serve applications with Kubernetes. There's also work in progress to provide closer integrations between some of the features from this document, like `serve status`, with Kubernetes to provide a clearer Serve deployment story. +Make heavyweight code updates (like `runtime_env` changes) by starting a new Ray Cluster, updating your Serve config file, and deploying the file with `serve deploy` to the new cluster. Once the new deployment is finished, switch your traffic to the new cluster. diff --git a/doc/source/serve/advanced-guides/dyn-req-batch.md b/doc/source/serve/advanced-guides/dyn-req-batch.md index bff57f7c2d1e..666434eec9d3 100644 --- a/doc/source/serve/advanced-guides/dyn-req-batch.md +++ b/doc/source/serve/advanced-guides/dyn-req-batch.md @@ -44,7 +44,7 @@ end-before: __batch_params_update_end__ --- ``` -Use these methods in the `reconfigure` [method](serve-in-production-reconfigure) to control the `@serve.batch` parameters through your Serve configuration file. +Use these methods in the `reconfigure` [method](serve-user-config) to control the `@serve.batch` parameters through your Serve configuration file. ::: ## Streaming batched requests diff --git a/doc/source/serve/advanced-guides/inplace-updates.md b/doc/source/serve/advanced-guides/inplace-updates.md index c16fe89803cf..7e3a4d36c52f 100644 --- a/doc/source/serve/advanced-guides/inplace-updates.md +++ b/doc/source/serve/advanced-guides/inplace-updates.md @@ -14,8 +14,10 @@ Lightweight config updates modify running deployment replicas without tearing th Lightweight config updates are only possible for deployments that are included as entries under `deployments` in the config file. If a deployment is not included in the config file, replicas of that deployment will be torn down and brought up again each time you redeploy with `serve deploy`. ::: +(serve-updating-user-config)= + ## Updating User Config -Let's use the `FruitStand` deployment graph [from an earlier section](fruit-config-yaml) as an example. All the individual fruit deployments contain a `reconfigure()` method. This method allows us to issue lightweight updates to our deployments by updating the `user_config`. +Let's use the `FruitStand` deployment graph [from the production guide](fruit-config-yaml) as an example. All the individual fruit deployments contain a `reconfigure()` method. This method allows us to issue lightweight updates to our deployments by updating the `user_config`. First let's deploy the graph. Make sure to stop any previous Ray cluster using the CLI command `ray stop` for this example: diff --git a/doc/source/serve/configure-serve-deployment.md b/doc/source/serve/configure-serve-deployment.md new file mode 100644 index 000000000000..1fa8324b87d0 --- /dev/null +++ b/doc/source/serve/configure-serve-deployment.md @@ -0,0 +1,139 @@ +(serve-configure-deployment)= + +# Configure Ray Serve deployments + +These parameters are configurable on a Ray Serve deployment. Documentation is also in the [API reference](../serve/api/doc/ray.serve.deployment_decorator.rst). + +Configure the following parameters either in the Serve config file, or on the `@serve.deployment` decorator: + +- `name` - Name uniquely identifying this deployment within the application. If not provided, the name of the class or function is used. +- `num_replicas` - Number of replicas to run that handle requests to this deployment. Defaults to 1. +- `route_prefix` - Requests to paths under this HTTP path prefix are routed to this deployment. Defaults to ‘/{name}’. This can only be set for the ingress (top-level) deployment of an application. +- `ray_actor_options` - Options to pass to the Ray Actor decorator, such as resource requirements. Valid options are: `accelerator_type`, `memory`, `num_cpus`, `num_gpus`, `object_store_memory`, `resources`, and `runtime_env` For more details - [Resource management in Serve](serve-cpus-gpus) +- `max_concurrent_queries` - Maximum number of queries that are sent to a replica of this deployment without receiving a response. Defaults to 100. This may be an important parameter to configure for [performance tuning](serve-perf-tuning). +- `autoscaling_config` - Parameters to configure autoscaling behavior. If this is set, num_replicas cannot be set. For more details on configurable parameters for autoscaling - [Ray Serve Autoscaling](ray-serve-autoscaling). +- `user_config` - Config to pass to the reconfigure method of the deployment. This can be updated dynamically without restarting the replicas of the deployment. The user_config must be fully JSON-serializable. For more details - [Serve User Config](serve-user-config). +- `health_check_period_s` - Duration between health check calls for the replica. Defaults to 10s. The health check is by default a no-op Actor call to the replica, but you can define your own health check using the "check_health" method in your deployment that raises an exception when unhealthy. +- `health_check_timeout_s` - Duration in seconds, that replicas wait for a health check method to return before considering it as failed. Defaults to 30s. +- `graceful_shutdown_wait_loop_s` - Duration that replicas wait until there is no more work to be done before shutting down. Defaults to 2s. +- `graceful_shutdown_timeout_s` - Duration to wait for a replica to gracefully shut down before being forcefully killed. Defaults to 20s. +- `is_driver_deployment` - [EXPERIMENTAL] when set, exactly one replica of this deployment runs on every node (like a daemon set). + +There are 3 ways of specifying parameters: + + - In the `@serve.deployment` decorator - + +```{literalinclude} ../serve/doc_code/configure_serve_deployment/model_deployment.py +:start-after: __deployment_start__ +:end-before: __deployment_end__ +:language: python +``` + + - Through `options()` - + +```{literalinclude} ../serve/doc_code/configure_serve_deployment/model_deployment.py +:start-after: __deployment_end__ +:end-before: __options_end__ +:language: python +``` + + - Using the YAML [Serve Config file](serve-in-production-config-file) - + +```yaml +applications: + +- name: app1 + + route_prefix: / + + import_path: configure_serve:translator_app + + runtime_env: {} + + deployments: + + - name: Translator + num_replicas: 2 + max_concurrent_queries: 100 + graceful_shutdown_wait_loop_s: 2.0 + graceful_shutdown_timeout_s: 20.0 + health_check_period_s: 10.0 + health_check_timeout_s: 30.0 + ray_actor_options: + num_cpus: 0.2 + num_gpus: 0.0 +``` + +## Overriding deployment settings + +The order of priority is (from highest to lowest): + +1. Serve Config file +2. `.options()` call in python code referenced above +3. `@serve.deployment` decorator in python code +4. Serve defaults + +For example, if a deployment's `num_replicas` is specified in the config file and their graph code, Serve will use the config file's value. If it's only specified in the code, Serve will use the code value. If the user doesn't specify it anywhere, Serve will use a default (which is `num_replicas=1`). + +Keep in mind that this override order is applied separately to each individual parameter. +For example, if a user has a deployment `ExampleDeployment` with the following decorator: + +```python +@serve.deployment( + num_replicas=2, + max_concurrent_queries=15, +) +class ExampleDeployment: + ... +``` + +and the following config file: + +```yaml +... + +deployments: + + - name: ExampleDeployment + num_replicas: 5 + +... +``` + +Serve sets `num_replicas=5`, using the config file value, and `max_concurrent_queries=15`, using the code value (because `max_concurrent_queries` wasn't specified in the config file). All other deployment settings use Serve defaults because the user didn't specify them in the code or the config. + +:::{tip} +Remember that `ray_actor_options` counts as a single setting. The entire `ray_actor_options` dictionary in the config file overrides the entire `ray_actor_options` dictionary from the graph code. If there are individual options within `ray_actor_options` (e.g. `runtime_env`, `num_gpus`, `memory`) that are set in the code but not in the config, Serve still won't use the code settings if the config has a `ray_actor_options` dictionary. It treats these missing options as though the user never set them and uses defaults instead. This dictionary overriding behavior also applies to `user_config` and `autoscaling_config`. +::: + +(serve-user-config)= +## Dynamically changing parameters without restarting your replicas (`user_config`) + +You can use the `user_config` field to supply structured configuration for your deployment. You can pass arbitrary JSON serializable objects to the YAML configuration. Serve then applies it to all running and future deployment replicas. The application of user configuration *does not* restart the replica. This means you can use this field to dynamically: +- adjust model weights and versions without restarting the cluster. +- adjust traffic splitting percentage for your model composition graph. +- configure any feature flag, A/B tests, and hyper-parameters for your deployments. + +To enable the `user_config` feature, you need to implement a `reconfigure` method that takes a JSON-serializable object (e.g., a Dictionary, List or String) as its only argument: + +```python +@serve.deployment +class Model: + def reconfigure(self, config: Dict[str, Any]): + self.threshold = config["threshold"] +``` + +If the `user_config` is set when the deployment is created (e.g., in the decorator or the Serve config file), this `reconfigure` method is called right after the deployment's `__init__` method, and the `user_config` is passed in as an argument. You can also trigger the `reconfigure` method by updating your Serve config file with a new `user_config` and reapplying it to your Ray cluster. See [In-place Updates](serve-inplace-updates) for more information. + +The corresponding YAML snippet is: + +```yaml +... +deployments: + - name: Model + user_config: + threshold: 1.5 +``` + + + diff --git a/doc/source/serve/develop-and-deploy.md b/doc/source/serve/develop-and-deploy.md new file mode 100644 index 000000000000..2e575a02448e --- /dev/null +++ b/doc/source/serve/develop-and-deploy.md @@ -0,0 +1,163 @@ +(serve-develop-and-deploy)= + +# Develop and deploy an ML application + +The flow for developing a Ray Serve application locally and deploying it in production covers the following steps: + +* Converting a Machine Learning model into a Ray Serve application +* Testing the application locally +* Building Serve config files for production deployment +* Deploying applications using a config file + +## Convert a model into a Ray Serve application + +This example uses a text-translation model: + +```{literalinclude} ../serve/doc_code/getting_started/models.py +:start-after: __start_translation_model__ +:end-before: __end_translation_model__ +:language: python +``` + +The Python file, called `model.py`, uses the `Translator` class to translate English text to French. + +- The `self.model` variable inside the `Translator`'s `__init__` method + stores a function that uses the [t5-small](https://huggingface.co/t5-small) + model to translate text. +- When `self.model` is called on English text, it returns translated French text + inside a dictionary formatted as `[{"translation_text": "..."}]`. +- The `Translator`'s `translate` method extracts the translated text by indexing into the dictionary. + +Copy and paste the script and run it locally. It translates `"Hello world!"` +into `"Bonjour Monde!"`. + +```console +$ python model.py + +Bonjour Monde! +``` + +Converting this model into a Ray Serve application with FastAPI requires three changes: +1. Import Ray Serve and Fast API dependencies +2. Add decorators for Serve deployment with FastAPI: `@serve.deployment` and `@serve.ingress(app)` +3. `bind` the `Translator` deployment to the arguments that are passed into its constructor + +For other HTTP options, see [Set Up FastAPI and HTTP](serve-set-up-fastapi-http). + +```{literalinclude} ../serve/doc_code/develop_and_deploy/model_deployment_with_fastapi.py +:start-after: __deployment_start__ +:end-before: __deployment_end__ +:language: python +``` + +Note that the code configures parameters for the deployment, such as `num_replicas` and `ray_actor_options`. These parameters help configure the number of copies of the deployment and the resource requirements for each copy. In this case, we set up 2 replicas of the model that take 0.2 CPUs and 0 GPUs each. For a complete guide on the configurable parameters on a deployment, see [Configure a Serve deployment](serve-configure-deployment). + +## Test a Ray Serve application locally + +To test locally, run the script with the `serve run` CLI command. This command takes in an import path formatted as `module:application`. Run the command from a directory containing a local copy of the script saved as `model.py`, so it can import the application: + +```console +$ serve run model:translator_app +``` + +This command runs the `translator_app` application and then blocks streaming logs to the console. You can kill it with `Ctrl-C`, which tears down the application. + +Now test the model over HTTP. Reach it at the following default URL: + +``` +http://127.0.0.1:8000/ +``` + +Send a POST request with JSON data containing the English text. This client script requests a translation for "Hello world!": + +```{literalinclude} ../serve/doc_code/develop_and_deploy/model_deployment_with_fastapi.py +:start-after: __client_function_start__ +:end-before: __client_function_end__ +:language: python +``` + +While a Ray Serve application is deployed, use the `serve status` CLI command to check the status of the application and deployment. For more details on the output format of `serve status`, see [Inspect Serve in production](serve-in-production-inspecting). + +```console +$ serve status +name: default +app_status: + status: RUNNING + message: '' + deployment_timestamp: 1687415211.531879 +deployment_statuses: +- name: default_Translator + status: HEALTHY + message: '' +``` + +## Build Serve config files for production deployment + +To deploy Serve applications in production, you need to generate a Serve config YAML file. A Serve config file is the single source of truth for the cluster, allowing you to specify system-level configuration and your applications in one place. It also allows you to declaratively update your applications. The `serve build` CLI command takes as input the import path and saves to an output file using the `-o` flag. You can specify all deployment parameters in the Serve config files. + +```console +$ serve build model:translator_app -o config.yaml +``` + +The serve build command adds a default application name that can be modified. The resulting Serve config file is: + +``` +# This file was generated using the `serve build` command on Ray v2.5.1. + +proxy_location: EveryNode + +http_options: + + host: 0.0.0.0 + + port: 8000 + +applications: + +- name: app1 + + route_prefix: / + + import_path: model:translator_app + + runtime_env: {} + + deployments: + + - name: Translator + num_replicas: 2 + ray_actor_options: + num_cpus: 0.2 + num_gpus: 0.0 +``` + +You can also use the Serve config file with `serve run` for local testing. For example: + +```console +$ serve run config.yaml +``` + +```console +$ serve status +name: app1 +app_status: + status: RUNNING + message: '' + deployment_timestamp: 1687630567.0700073 +deployment_statuses: +- name: app1_Translator + status: HEALTHY + message: '' +``` + +For more details, see [Serve Config Files](serve-in-production-config-file). + +## Deploy Ray Serve in production + +Deploy the Ray Serve application in production on Kubernetes using the [KubeRay] operator. Copy the YAML file generated in the previous step directly into the Kubernetes configuration. KubeRay supports zero-downtime upgrades, status reporting, and fault tolerance for your production application. See [Deploying on Kubernetes](serve-in-production-kubernetes) for more information. For production usage, consider implementing the recommended practice of setting up [head node fault tolerance](serve-e2e-ft-guide-gcs). + +## Monitor Ray Serve + +Use the Ray Dashboard to get a high-level overview of your Ray Cluster and Ray Serve application's states. The Ray Dashboard is available both during local testing and on a remote cluster in production. Ray Serve provides some in-built metrics and logging as well as utilities for adding custom metrics and logs in your application. For production deployments, exporting logs and metrics to your observability platforms is recommended. See [Monitoring](serve-monitoring) for more details. + +[KubeRay]: https://ray-project.github.io/kuberay/ diff --git a/doc/source/serve/doc_code/configure_serve_deployment/model_deployment.py b/doc/source/serve/doc_code/configure_serve_deployment/model_deployment.py new file mode 100644 index 000000000000..bbc7e0503b7f --- /dev/null +++ b/doc/source/serve/doc_code/configure_serve_deployment/model_deployment.py @@ -0,0 +1,64 @@ +# flake8: noqa + +# __deployment_start__ +import ray +from ray import serve +from fastapi import FastAPI + +from transformers import pipeline + +app = FastAPI() + + +@serve.deployment( + name="Translator", + route_prefix="/", + num_replicas=2, + ray_actor_options={"num_cpus": 0.2, "num_gpus": 0}, + max_concurrent_queries=100, + # autoscaling_config={"min_replicas": 1, "initial_replicas": 2, "max_replicas": 5, "target_num_ongoing_requests_per_replica": 10}, + # user_config={}, + health_check_period_s=10, + health_check_timeout_s=30, + graceful_shutdown_timeout_s=20, + graceful_shutdown_wait_loop_s=2, +) +@serve.ingress(app) +class Translator: + def __init__(self): + # Load model + self.model = pipeline("translation_en_to_fr", model="t5-small") + + @app.post("/") + def translate(self, text: str) -> str: + # Run inference + model_output = self.model(text) + + # Post-process output to return only the translation text + translation = model_output[0]["translation_text"] + + return translation + + +translator_app = Translator.bind() +# __deployment_end__ + +translator_app = Translator.options(ray_actor_options={}).bind() + +# __options_end__ +serve.run(translator_app) + +# __client_function_start__ +# File name: model_client.py +import requests + +response = requests.post("http://127.0.0.1:8000/", params={"text": "Hello world!"}) +french_text = response.json() + +print(french_text) +# __client_function_end__ + +assert french_text == "Bonjour monde!" + +serve.shutdown() +ray.shutdown() diff --git a/doc/source/serve/doc_code/develop_and_deploy/model_deployment_with_fastapi.py b/doc/source/serve/doc_code/develop_and_deploy/model_deployment_with_fastapi.py new file mode 100644 index 000000000000..3d58185135b6 --- /dev/null +++ b/doc/source/serve/doc_code/develop_and_deploy/model_deployment_with_fastapi.py @@ -0,0 +1,50 @@ +# flake8: noqa + +# __deployment_start__ +import ray +from ray import serve +from fastapi import FastAPI + +from transformers import pipeline + +app = FastAPI() + + +@serve.deployment(num_replicas=2, ray_actor_options={"num_cpus": 0.2, "num_gpus": 0}) +@serve.ingress(app) +class Translator: + def __init__(self): + # Load model + self.model = pipeline("translation_en_to_fr", model="t5-small") + + @app.post("/") + def translate(self, text: str) -> str: + # Run inference + model_output = self.model(text) + + # Post-process output to return only the translation text + translation = model_output[0]["translation_text"] + + return translation + + +translator_app = Translator.bind() +# __deployment_end__ + +translator_app = Translator.options(ray_actor_options={}).bind() +serve.run(translator_app) + +# __client_function_start__ +# File name: model_client.py +import requests + +response = requests.post("http://127.0.0.1:8000/", params={"text": "Hello world!"}) +french_text = response.json() + +print(french_text) +# __client_function_end__ + +assert french_text == "Bonjour monde!" + +serve.shutdown() +ray.shutdown() diff --git a/doc/source/serve/index.md b/doc/source/serve/index.md index e80270bbe9e6..096c13d985bf 100644 --- a/doc/source/serve/index.md +++ b/doc/source/serve/index.md @@ -19,11 +19,11 @@ (rayserve-overview)= Ray Serve is a scalable model serving library for building online inference APIs. -Serve is framework-agnostic, so you can use a single toolkit to serve everything from deep learning models built with frameworks like PyTorch, Tensorflow, and Keras, to Scikit-Learn models, to arbitrary Python business logic. +Serve is framework-agnostic, so you can use a single toolkit to serve everything from deep learning models built with frameworks like PyTorch, Tensorflow, and Keras, to Scikit-Learn models, to arbitrary Python business logic. It has several features and performance optimizations for serving Large Language Models such as response streaming, dynamic request batching, multi-node/multi-GPU serving, etc. -Serve is particularly well suited for [model composition](serve-model-composition), enabling you to build a complex inference service consisting of multiple ML models and business logic all in Python code. +Ray Serve is particularly well suited for [model composition](serve-model-composition) and [many model serving](serve-deploy-many-models), enabling you to build a complex inference service consisting of multiple ML models and business logic all in Python code. -Serve is built on top of Ray, so it easily scales to many machines and offers flexible scheduling support such as fractional GPUs so you can share resources and serve many machine learning models at low cost. +Ray Serve is built on top of Ray, so it easily scales to many machines and offers flexible scheduling support such as fractional GPUs so you can share resources and serve many machine learning models at low cost. ## Quickstart @@ -155,7 +155,7 @@ Serve supports arbitrary Python code and therefore integrates well with the MLOp :::{dropdown} LLM developer :animate: fade-in-slide-down -Serve enables you to rapidly prototype, develop, and deploy scalable LLM applications to production. Many large language model (LLM) applications combine prompt preprocessing, vector database lookups, LLM API calls, and response validation. Because Serve supports any arbitrary Python code, you can write all these steps as a single Python module, enabling rapid development and easy testing. You can then quickly deploy your Ray Serve LLM application to production, and each application step can independently autoscale to efficiently accommodate user traffic without wasting resources. +Serve enables you to rapidly prototype, develop, and deploy scalable LLM applications to production. Many large language model (LLM) applications combine prompt preprocessing, vector database lookups, LLM API calls, and response validation. Because Serve supports any arbitrary Python code, you can write all these steps as a single Python module, enabling rapid development and easy testing. You can then quickly deploy your Ray Serve LLM application to production, and each application step can independently autoscale to efficiently accommodate user traffic without wasting resources. In order to improve performance of your LLM applications, Ray Serve has features for batching and can integrate with any model optimization technique. Ray Serve also supports streaming responses, a key feature for chatbot-like applications. ::: diff --git a/doc/source/serve/key-concepts.md b/doc/source/serve/key-concepts.md index 4ead551d44d4..13fe6ca91d72 100644 --- a/doc/source/serve/key-concepts.md +++ b/doc/source/serve/key-concepts.md @@ -30,6 +30,14 @@ handle = serve.run(my_first_deployment) print(ray.get(handle.remote())) # "Hello world!" ``` +(serve-key-concepts-application)= + +## Application + +An application is the unit of upgrade in a Ray Serve cluster. An application consists of one or more deployments. One of these deployments is considered the [“ingress” deployment](serve-key-concepts-ingress-deployment), which handles all inbound traffic. + +Applications can be called via HTTP at the specified route_prefix or in Python by retrieving a handle to the application by name. + (serve-key-concepts-query-deployment)= ## ServeHandle (composing deployments) @@ -112,7 +120,7 @@ class MostBasicIngress: ## Deployment Graph -Building on top of the deployment concept, Ray Serve also provides a first-class API for composing multiple models into a graph structure and orchestrating the calls to each deployment automatically. +Building on top of the deployment concept, Ray Serve also provides a first-class API for composing multiple models into a graph structure and orchestrating the calls to each deployment automatically. In this case, the `DAGDriver` is the ingress deployment. Here's a simple example combining a preprocess function and model. diff --git a/doc/source/serve/model_composition.md b/doc/source/serve/model_composition.md index 040b70acb2c6..f95c5b50d414 100644 --- a/doc/source/serve/model_composition.md +++ b/doc/source/serve/model_composition.md @@ -1,6 +1,6 @@ (serve-model-composition)= -# Deploy a Composition of Models +# Deploy Compositions of Models This section helps you: diff --git a/doc/source/serve/production-guide/best-practices.md b/doc/source/serve/production-guide/best-practices.md index ae1e74315e84..99a67bd7d771 100644 --- a/doc/source/serve/production-guide/best-practices.md +++ b/doc/source/serve/production-guide/best-practices.md @@ -7,9 +7,88 @@ This section summarizes the best practices when deploying to production using th * Use `serve run` to manually test and improve your deployment graph locally. * Use `serve build` to create a Serve config file for your deployment graph. * Put your deployment graph's code in a remote repository and manually configure the `working_dir` or `py_modules` fields in your Serve config file's `runtime_env` to point to that repository. -* Use `serve deploy` to deploy your graph and its deployments to your Ray cluster. After the deployment is finished, you can start serving traffic from your cluster. * Use `serve status` to track your Serve application's health and deployment progress. * Use `serve config` to check the latest config that your Serve application received. This is its goal state. * Make lightweight configuration updates (e.g. `num_replicas` or `user_config` changes) by modifying your Serve config file and redeploying it with `serve deploy`. -* Make heavyweight code updates (e.g. `runtime_env` changes) by starting a new Ray cluster, updating your Serve config file, and deploying the file with `serve deploy` to the new cluster. Once the new deployment is finished, switch your traffic to the new cluster. +(serve-in-production-inspecting)= + +## Inspect an application with `serve config` and `serve status` + +Two Serve CLI commands help you inspect a Serve application in production: `serve config` and `serve status`. +If you have a remote cluster, `serve config` and `serve status` also has an `--address/-a` argument to access the cluster. See [VM deployment](serve-in-production-remote-cluster) for more information on this argument. + +`serve config` gets the latest config file that the Ray Cluster received. This config file represents the Serve application's goal state. The Ray Cluster constantly strives to reach and maintain this state by deploying deployments, and recovering failed replicas, and performing other relevant actions. + +Using the `fruit_config.yaml` example from [an earlier section](fruit-config-yaml): + +```console +$ ray start --head +$ serve deploy fruit_config.yaml +... + +$ serve config +import_path: fruit:deployment_graph + +runtime_env: {} + +deployments: + +- name: MangoStand + num_replicas: 2 + route_prefix: null +... +``` + +`serve status` gets your Serve application's current status. The status has two parts per application: the `app_status` and the `deployment_statuses`. + +The `app_status` contains three fields: +* `status`: A Serve application has four possible statuses: + * `"NOT_STARTED"`: No application has been deployed on this cluster. + * `"DEPLOYING"`: The application is currently carrying out a `serve deploy` request. It is deploying new deployments or updating existing ones. + * `"RUNNING"`: The application is at steady-state. It has finished executing any previous `serve deploy` requests, and is attempting to maintain the goal state set by the latest `serve deploy` request. + * `"DEPLOY_FAILED"`: The latest `serve deploy` request has failed. +* `message`: Provides context on the current status. +* `deployment_timestamp`: A UNIX timestamp of when Serve received the last `serve deploy` request. The timestamp is calculated using the `ServeController`'s local clock. + +The `deployment_statuses` contains a list of dictionaries representing each deployment's status. Each dictionary has three fields: +* `name`: The deployment's name. +* `status`: A Serve deployment has three possible statuses: + * `"UPDATING"`: The deployment is updating to meet the goal state set by a previous `deploy` request. + * `"HEALTHY"`: The deployment achieved the latest requests goal state. + * `"UNHEALTHY"`: The deployment has either failed to update, or has updated and has become unhealthy afterwards. This condition may be due to an error in the deployment's constructor, a crashed replica, or a general system or machine error. +* `message`: Provides context on the current status. + +Use the `serve status` command to inspect your deployments after they are deployed and throughout their lifetime. + +Using the `fruit_config.yaml` example from [an earlier section](fruit-config-yaml): + +```console +$ ray start --head +$ serve deploy fruit_config.yaml +... + +$ serve status +app_status: + status: RUNNING + message: '' + deployment_timestamp: 1655771534.835145 +deployment_statuses: +- name: MangoStand + status: HEALTHY + message: '' +- name: OrangeStand + status: HEALTHY + message: '' +- name: PearStand + status: HEALTHY + message: '' +- name: FruitMarket + status: HEALTHY + message: '' +- name: DAGDriver + status: HEALTHY + message: '' +``` + +For Kubernetes deployments with KubeRay, tighter integrations of `serve status` with Kubernetes are available. See [Getting the status of Serve applications in Kubernetes](serve-getting-status-kubernetes). \ No newline at end of file diff --git a/doc/source/serve/production-guide/config.md b/doc/source/serve/production-guide/config.md index ae9ba35fa106..9a56a3ce3539 100644 --- a/doc/source/serve/production-guide/config.md +++ b/doc/source/serve/production-guide/config.md @@ -11,58 +11,77 @@ This config file can be used with the [serve deploy](serve-in-production-deployi The file is written in YAML and has the following format: ```yaml -import_path: ... +http_options: -runtime_env: ... + host: ... -host: ... + port: ... -port: ... +applications: + +- name: ... + + route_prefix: ... + + import_path: ... + + runtime_env: ... -deployments: + deployments: - - name: ... - num_replicas: ... - ... + - name: ... + num_replicas: ... + ... - - name: - ... + - name: + ... ... ``` -The file contains the following fields: +The file contains `http_options` and `applications`. These are the `http_options`: -- An `import_path`, which is the path to your top-level Serve deployment (or the same path passed to `serve run`). The most minimal config file consists of only an `import_path`. -- A `runtime_env` that defines the environment that the application will run in. This is used to package application dependencies such as `pip` packages (see {ref}`Runtime Environments ` for supported fields). The `import_path` must be available _within_ the `runtime_env` if it's specified. The Serve config's `runtime_env` can only use [remote URIs](remote-uris) in its `working_dir` and `py_modules`; it cannot use local zip files or directories. - `host` and `port` are HTTP options that determine the host IP address and the port for your Serve application's HTTP proxies. These are optional settings and can be omitted. By default, the `host` will be set to `0.0.0.0` to expose your deployments publicly, and the port will be set to `8000`. If you're using Kubernetes, setting `host` to `0.0.0.0` is necessary to expose your deployments outside the cluster. + +These are the fields per application: + +- `name` - The names for each application are auto-generated by `serve build`. The name per application must be unique. +- `route_prefix` - An application can be called via HTTP at the specified route prefix. It defaults to `/`. The route prefix for each application must be unique +- An `import_path`, which is the path to your top-level Serve deployment (or the same path passed to `serve run`). The most minimal config file consists of only an `import_path`. +- A `runtime_env` that defines the environment that the application will run in. This is used to package application dependencies such as `pip` packages (see {ref}`Runtime Environments ` for supported fields). The `import_path` must be available _within_ the `runtime_env` if it's specified. The Serve config's `runtime_env` can only use [remote URIs](remote-uris) in its `working_dir` and `py_modules`; it cannot use local zip files or directories. [More details on runtime env](serve-runtime-env). - A list of `deployments`. This is optional and allows you to override the `@serve.deployment` settings specified in the deployment graph code. Each entry in this list must include the deployment `name`, which must match one in the code. If this section is omitted, Serve launches all deployments in the graph with the settings specified in the code. Below is an equivalent config for the [`FruitStand` example](serve-in-production-example): ```yaml -import_path: fruit:deployment_graph +applications: -runtime_env: {} +- name: app1 -deployments: + route_prefix: / + + import_path: fruit:deployment_graph + + runtime_env: {} - - name: FruitMarket - num_replicas: 2 + deployments: - - name: MangoStand - user_config: - price: 3 + - name: MangoStand + user_config: + price: 3 - - name: OrangeStand - user_config: - price: 2 + - name: OrangeStand + user_config: + price: 2 - - name: PearStand - user_config: - price: 4 + - name: PearStand + user_config: + price: 4 - - name: DAGDriver + - name: FruitMarket + num_replicas: 2 + + - name: DAGDriver ``` The file uses the same `fruit:deployment_graph` import path that was used with `serve run` and it has five entries in the `deployments` list– one for each deployment. All the entries contain a `name` setting and some other configuration options such as `num_replicas` or `user_config`. @@ -91,112 +110,44 @@ fruit_config.yaml The `fruit_config.yaml` file contains: ```yaml -import_path: fruit:deployment_graph - -runtime_env: {} - -host: 0.0.0.0 +http_options: -port: 8000 + host: 0.0.0.0 -deployments: + port: 8000 -- name: MangoStand - user_config: - price: 3 +applications: -- name: OrangeStand - user_config: - price: 2 +- name: app1 -- name: PearStand - user_config: - price: 4 - -- name: FruitMarket - num_replicas: 2 - -- name: DAGDriver route_prefix: / -``` -Note that the `runtime_env` field will always be empty when using `serve build` and must be set manually. + import_path: fruit:deployment_graph -Additionally, `serve build` includes the default `host` and `port` in its -autogenerated files. You can modify these parameters to select a different host -and port. + runtime_env: {} -:::{tip} -You can use the `--kubernetes-format`/`-k` flag with `serve build` to print the Serve config in a format that can be copy-pasted directly into your [Kubernetes config](serve-in-production-kubernetes). -::: - -## Overriding deployment settings + deployments: -Settings from `@serve.deployment` can be overriden with this Serve config file. The order of priority is (from highest to lowest): + - name: MangoStand + user_config: + price: 3 -1. Config File -2. Deployment graph code (either through the `@serve.deployment` decorator or a `.set_options()` call) -3. Serve defaults + - name: OrangeStand + user_config: + price: 2 -For example, if a deployment's `num_replicas` is specified in the config file and their graph code, Serve will use the config file's value. If it's only specified in the code, Serve will use the code value. If the user doesn't specify it anywhere, Serve will use a default (which is `num_replicas=1`). + - name: PearStand + user_config: + price: 4 -Keep in mind that this override order is applied separately to each individual setting. -For example, if a user has a deployment `ExampleDeployment` with the following decorator: + - name: FruitMarket + num_replicas: 2 -```python -@serve.deployment( - num_replicas=2, - max_concurrent_queries=15, -) -class ExampleDeployment: - ... + - name: DAGDriver ``` -and the following config file: - -```yaml -... - -deployments: - - - name: ExampleDeployment - num_replicas: 5 - -... -``` - -Serve will set `num_replicas=5`, using the config file value, and `max_concurrent_queries=15`, using the code value (since `max_concurrent_queries` wasn't specified in the config file). All other deployment settings use Serve defaults since the user didn't specify them in the code or the config. - -:::{tip} -Remember that `ray_actor_options` counts as a single setting. The entire `ray_actor_options` dictionary in the config file overrides the entire `ray_actor_options` dictionary from the graph code. If there are individual options within `ray_actor_options` (e.g. `runtime_env`, `num_gpus`, `memory`) that are set in the code but not in the config, Serve still won't use the code settings if the config has a `ray_actor_options` dictionary. It will treat these missing options as though the user never set them and will use defaults instead. This dictionary overriding behavior also applies to `user_config`. -::: - -(serve-in-production-reconfigure)= - -## Dynamically adjusting parameters in deployment - -The `user_config` field can be used to supply structured configuration for your deployment. You can pass arbitrary JSON serializable objects to the YAML configuration. Serve will then apply it to all running and future deployment replicas. The application of user configuration *will not* restart the replica. This means you can use this field to dynamically: -- adjust model weights and versions without restarting the cluster. -- adjust traffic splitting percentage for your model composition graph. -- configure any feature flag, A/B tests, and hyper-parameters for your deployments. - -To enable the `user_config` feature, you need to implement a `reconfigure` method that takes a dictionary as its only argument: - -```python -@serve.deployment -class Model: - def reconfigure(self, config: Dict[str, Any]): - self.threshold = config["threshold"] -``` - -If the `user_config` is set when the deployment is created (e.g. in the decorator or the Serve config file), this `reconfigure` method is called right after the deployment's `__init__` method, and the `user_config` is passed in as an argument. You can also trigger the `reconfigure` method by updating your Serve config file with a new `user_config` and reapplying it to your Ray cluster. - -The corresponding YAML snippet is +Note that the `runtime_env` field will always be empty when using `serve build` and must be set manually. -```yaml -... -deployments: - - name: Model - user_config: - threshold: 1.5 -``` +Additionally, `serve build` includes the default `host` and `port` in its +autogenerated files. You can modify these parameters to select a different host +and port. \ No newline at end of file diff --git a/doc/source/serve/production-guide/handling-dependencies.md b/doc/source/serve/production-guide/handling-dependencies.md index a35ca351fd3f..b5de38dfb9d8 100644 --- a/doc/source/serve/production-guide/handling-dependencies.md +++ b/doc/source/serve/production-guide/handling-dependencies.md @@ -1,7 +1,30 @@ (serve-handling-dependencies)= # Handle Dependencies -Ray Serve supports serving deployments with different (possibly conflicting) +(serve-runtime-env)= +## Add a runtime environment + +The import path (e.g., `fruit:deployment_graph`) must be importable by Serve at runtime. +When running locally, this path might be in your current working directory. +However, when running on a cluster you also need to make sure the path is importable. +Build the code into the cluster's container image (see [Cluster Configuration](kuberay-config) for more details) or use a `runtime_env` with a [remote URI](remote-uris) that hosts the code in remote storage. + +As an example, we have [pushed a copy of the FruitStand deployment graph to GitHub](https://github.com/ray-project/test_dag/blob/40d61c141b9c37853a7014b8659fc7f23c1d04f6/fruit.py). You can use this config file to deploy the `FruitStand` deployment graph to your own Ray cluster even if you don't have the code locally: + +```yaml +import_path: fruit:deployment_graph + +runtime_env: + working_dir: "https://github.com/ray-project/serve_config_examples/archive/HEAD.zip" +``` + +:::{note} +You can also package a deployment graph into a standalone Python package that you can import using a [PYTHONPATH](https://docs.python.org/3.10/using/cmdline.html#envvar-PYTHONPATH) to provide location independence on your local machine. However, the best practice is to use a `runtime_env`, to ensure consistency across all machines in your cluster. +::: + +## Dependencies per deployment + +Ray Serve also supports serving deployments with different (and possibly conflicting) Python dependencies. For example, you can simultaneously serve one deployment that uses legacy Tensorflow 1 and another that uses Tensorflow 2. diff --git a/doc/source/serve/production-guide/kubernetes.md b/doc/source/serve/production-guide/kubernetes.md index 5aa2c2e3aa71..dd766c66a387 100644 --- a/doc/source/serve/production-guide/kubernetes.md +++ b/doc/source/serve/production-guide/kubernetes.md @@ -105,6 +105,7 @@ $ curl -X POST -H 'Content-Type: application/json' localhost:8000 -d '["MANGO", 6 ``` +(serve-getting-status-kubernetes)= ## Getting the status of the application As the `RayService` is running, the `KubeRay` controller continually monitors it and writes relevant status updates to the CR. diff --git a/doc/source/serve/tutorials/index.md b/doc/source/serve/tutorials/index.md index 087da8d4e941..22737b9423a2 100644 --- a/doc/source/serve/tutorials/index.md +++ b/doc/source/serve/tutorials/index.md @@ -10,12 +10,12 @@ Ray Serve functionality and how to integrate different modeling frameworks. :name: serve-tutorials serve-ml-models -batch +stable-diffusion +text-classification +object-detection rllib gradio-integration +batch gradio-dag-visualization java -stable-diffusion -text-classification -object-detection ``` diff --git a/python/ray/serve/api.py b/python/ray/serve/api.py index 527cf96445a9..8dc14d1e6a5d 100644 --- a/python/ray/serve/api.py +++ b/python/ray/serve/api.py @@ -274,34 +274,34 @@ class MyDeployment: Args: name: Name uniquely identifying this deployment within the application. If not provided, the name of the class or function is used. - num_replicas: The number of replicas to run that handle requests to + num_replicas: Number of replicas to run that handle requests to this deployment. Defaults to 1. autoscaling_config: Parameters to configure autoscaling behavior. If this is set, `num_replicas` cannot be set. init_args: [DEPRECATED] These should be passed to `.bind()` instead. init_kwargs: [DEPRECATED] These should be passed to `.bind()` instead. route_prefix: Requests to paths under this HTTP path prefix are routed - to this deployment. Defaults to '/{name}'. This can only be set for the + to this deployment. Defaults to '/'. This can only be set for the ingress (top-level) deployment of an application. - ray_actor_options: Options to be passed to the Ray actor decorator, such as - resource requirements. Valid options are `accelerator_type`, `memory`, + ray_actor_options: Options to pass to the Ray Actor decorator, such as + resource requirements. Valid options are: `accelerator_type`, `memory`, `num_cpus`, `num_gpus`, `object_store_memory`, `resources`, and `runtime_env`. user_config: Config to pass to the reconfigure method of the deployment. This can be updated dynamically without restarting the replicas of the deployment. The user_config must be fully JSON-serializable. - max_concurrent_queries: The maximum number of queries that are sent to a + max_concurrent_queries: Maximum number of queries that are sent to a replica of this deployment without receiving a response. Defaults to 100. - health_check_period_s: How often the health check is called on the replica. - Defaults to 10s. The health check is by default a no-op actor call to the - replica, but you can define your own as a "check_health" method that raises - an exception when unhealthy. - health_check_timeout_s: How long to wait for a health check method to return - before considering it failed. Defaults to 30s. + health_check_period_s: Duration between health check calls for the replica. + Defaults to 10s. The health check is by default a no-op Actor call to the + replica, but you can define your own health check using the "check_health" + method in your deployment that raises an exception when unhealthy. + health_check_timeout_s: Duration in seconds, that replicas wait for a health + check method to return before considering it as failed. Defaults to 30s. graceful_shutdown_wait_loop_s: Duration that replicas wait until there is - no more work to be done before shutting down. - graceful_shutdown_timeout_s: Duration that a replica can be gracefully shutting - down before being forcefully killed. + no more work to be done before shutting down. Defaults to 2s. + graceful_shutdown_timeout_s: Duration to wait for a replica to gracefully + shut down before being forcefully killed. Defaults to 20s. is_driver_deployment: [EXPERIMENTAL] when set, exactly one replica of this deployment runs on every node (like a daemon set).