ray-project · edoakes · Jun 29, 2023 · Jun 22, 2023
@@ -256,11 +256,13 @@ parts:
         sections:
           - file: serve/getting_started
           - file: serve/key-concepts
+          - file: serve/develop-and-deploy
           - file: serve/model_composition
           - file: serve/deploy-many-models/index
             sections:
             - file: serve/deploy-many-models/multi-app
             - file: serve/deploy-many-models/model-multiplexing
+          - file: serve/configure-serve-deployment
           - file: serve/http-guide
           - file: serve/production-guide/index
             title: Production Guide

@@ -36,27 +36,7 @@ The message `Sent deploy request successfully!` means:
 * It will start a new Serve application if one hasn't already started.
 * The Serve application will deploy the deployments from your deployment graph, updated with the configurations from your config file.
 
-It does **not** mean that your Serve application, including your deployments, has already started running successfully. This happens asynchronously as the Ray cluster attempts to update itself to match the settings from your config file. Check out the [next section](serve-in-production-inspecting) to learn more about how to get the current status.
-
-## Adding a runtime environment
-
-The import path (e.g., `fruit:deployment_graph`) must be importable by Serve at runtime.
-When running locally, this might be in your current working directory.
-However, when running on a cluster you also need to make sure the path is importable.
-You can achieve this either by building the code into the cluster's container image (see [Cluster Configuration](kuberay-config) for more details) or by using a `runtime_env` with a [remote URI](remote-uris) that hosts the code in remote storage.
-
-As an example, we have [pushed a copy of the FruitStand deployment graph to GitHub](https://github.com/ray-project/test_dag/blob/40d61c141b9c37853a7014b8659fc7f23c1d04f6/fruit.py). You can use this config file to deploy the `FruitStand` deployment graph to your own Ray cluster even if you don't have the code locally:
-
-```yaml
-import_path: fruit:deployment_graph
-
-runtime_env:
-    working_dir: "https://github.com/ray-project/serve_config_examples/archive/HEAD.zip"
-```
-
-:::{note}
-As a side note, you could also package your deployment graph into a standalone Python package that can be imported using a [PYTHONPATH](https://docs.python.org/3.10/using/cmdline.html#envvar-PYTHONPATH) to provide location independence on your local machine. However, it's still best practice to use a `runtime_env`, to ensure consistency across all machines in your cluster.
-:::
+It does **not** mean that your Serve application, including your deployments, has already started running successfully. This happens asynchronously as the Ray cluster attempts to update itself to match the settings from your config file. See [Inspect an application](serve-in-production-inspecting) for how to get the current status.
 
 (serve-in-production-remote-cluster)=
 
@@ -74,7 +54,11 @@ As an example, the address for the local cluster started by `ray start --head` i
 $ serve deploy config_file.yaml -a http://127.0.0.1:52365
 ```
 
-The Ray dashboard agent's default port is 52365. You can set it to a different value using the `--dashboard-agent-listen-port` argument when running `ray start`."
+The Ray Dashboard agent's default port is 52365. To set it to a different value, use the `--dashboard-agent-listen-port` argument when running `ray start`.
+
+:::{note}
+When running on a remote cluster, you need to ensure that the import path is accessible. See [Handle Dependencies](serve-handling-dependencies) for how to add a runtime environment.
+:::
 
 :::{note}
 If the port 52365 (or whichever port you specify with `--dashboard-agent-listen-port`) is unavailable when Ray starts, the dashboard agent’s HTTP server will fail. However, the dashboard agent and Ray will continue to run.
@@ -107,84 +91,6 @@ $ unset RAY_AGENT_ADDRESS
 Check for this variable in your environment to make sure you're using your desired Ray agent address.
 :::
 
-(serve-in-production-inspecting)=
-
-## Inspecting the application with `serve config` and `serve status`
-
-The Serve CLI also offers two commands to help you inspect your Serve application in production: `serve config` and `serve status`.
-If you're working with a remote cluster, `serve config` and `serve status` also offer an `--address/-a` argument to access your cluster. Check out [the previous section](serve-in-production-remote-cluster) for more info on this argument.
-
-`serve config` gets the latest config file the Ray cluster received. This config file represents the Serve application's goal state. The Ray cluster will constantly attempt to reach and maintain this state by deploying deployments, recovering failed replicas, and more.
-
-Using the `fruit_config.yaml` example from [an earlier section](fruit-config-yaml):
-
-```console
-$ ray start --head
-$ serve deploy fruit_config.yaml
-...
-
-$ serve config
-import_path: fruit:deployment_graph
-
-runtime_env: {}
-
-deployments:
-
-- name: MangoStand
-  num_replicas: 2
-  route_prefix: null
-...
-```
-
-`serve status` gets your Serve application's current status. It's divided into two parts: the `app_status` and the `deployment_statuses`.
-
-The `app_status` contains three fields:
-* `status`: a Serve application has four possible statuses:
-    * `"NOT_STARTED"`: no application has been deployed on this cluster.
-    * `"DEPLOYING"`: the application is currently carrying out a `serve deploy` request. It is deploying new deployments or updating existing ones.
-    * `"RUNNING"`: the application is at steady-state. It has finished executing any previous `serve deploy` requests, and it is attempting to maintain the goal state set by the latest `serve deploy` request.
-    * `"DEPLOY_FAILED"`: the latest `serve deploy` request has failed.
-* `message`: provides context on the current status.
-* `deployment_timestamp`: a unix timestamp of when Serve received the last `serve deploy` request. This is calculated using the `ServeController`'s local clock.
-
-The `deployment_statuses` contains a list of dictionaries representing each deployment's status. Each dictionary has three fields:
-* `name`: the deployment's name.
-* `status`: a Serve deployment has three possible statuses:
-    * `"UPDATING"`: the deployment is updating to meet the goal state set by a previous `deploy` request.
-    * `"HEALTHY"`: the deployment is at the latest requests goal state.
-    * `"UNHEALTHY"`: the deployment has either failed to update, or it has updated and has become unhealthy afterwards. This may be due to an error in the deployment's constructor, a crashed replica, or a general system or machine error.
-* `message`: provides context on the current status.
-
-You can use the `serve status` command to inspect your deployments after they are deployed and throughout their lifetime.
-
-Using the `fruit_config.yaml` example from [an earlier section](fruit-config-yaml):
-
-```console
-$ ray start --head
-$ serve deploy fruit_config.yaml
-...
-
-$ serve status
-app_status:
-  status: RUNNING
-  message: ''
-  deployment_timestamp: 1655771534.835145
-deployment_statuses:
-- name: MangoStand
-  status: HEALTHY
-  message: ''
-- name: OrangeStand
-  status: HEALTHY
-  message: ''
-- name: PearStand
-  status: HEALTHY
-  message: ''
-- name: FruitMarket
-  status: HEALTHY
-  message: ''
-- name: DAGDriver
-  status: HEALTHY
-  message: ''
-```
+To inspect the status of the Serve application in production, see [Inspect an application](serve-in-production-inspecting).
 
-`serve status` can also be used with KubeRay ({ref}`kuberay-index`), a Kubernetes operator for Ray Serve, to help deploy your Serve applications with Kubernetes. There's also work in progress to provide closer integrations between some of the features from this document, like `serve status`, with Kubernetes to provide a clearer Serve deployment story.
+Make heavyweight code updates (like `runtime_env` changes) by starting a new Ray Cluster, updating your Serve config file, and deploying the file with `serve deploy` to the new cluster. Once the new deployment is finished, switch your traffic to the new cluster.
@@ -44,7 +44,7 @@ end-before: __batch_params_update_end__
 ---
 ```
 
-Use these methods in the `reconfigure` [method](serve-in-production-reconfigure) to control the `@serve.batch` parameters through your Serve configuration file.
+Use these methods in the `reconfigure` [method](serve-user-config) to control the `@serve.batch` parameters through your Serve configuration file.
 :::
 
 ## Streaming batched requests

@@ -14,8 +14,10 @@ Lightweight config updates modify running deployment replicas without tearing th
 Lightweight config updates are only possible for deployments that are included as entries under `deployments` in the config file. If a deployment is not included in the config file, replicas of that deployment will be torn down and brought up again each time you redeploy with `serve deploy`.
 :::
 
+(serve-updating-user-config)=
+
 ## Updating User Config
-Let's use the `FruitStand` deployment graph [from an earlier section](fruit-config-yaml) as an example. All the individual fruit deployments contain a `reconfigure()` method. This method allows us to issue lightweight updates to our deployments by updating the `user_config`.
+Let's use the `FruitStand` deployment graph [from the production guide](fruit-config-yaml) as an example. All the individual fruit deployments contain a `reconfigure()` method. This method allows us to issue lightweight updates to our deployments by updating the `user_config`.
 
 First let's deploy the graph. Make sure to stop any previous Ray cluster using the CLI command `ray stop` for this example:
 

@@ -0,0 +1,139 @@
+(serve-configure-deployment)=
+
+# Configure Ray Serve deployments
+
+These parameters are configurable on a Ray Serve deployment. Documentation is also in the [API reference](../serve/api/doc/ray.serve.deployment_decorator.rst).
+
+Configure the following parameters either in the Serve config file, or on the `@serve.deployment` decorator:
+
+- `name` - Name uniquely identifying this deployment within the application. If not provided, the name of the class or function is used.
+- `num_replicas` - Number of replicas to run that handle requests to this deployment. Defaults to 1.
+- `route_prefix` - Requests to paths under this HTTP path prefix are routed to this deployment. Defaults to ‘/{name}’. This can only be set for the ingress (top-level) deployment of an application.
+- `ray_actor_options` - Options to pass to the Ray Actor decorator, such as resource requirements. Valid options are: `accelerator_type`, `memory`, `num_cpus`, `num_gpus`, `object_store_memory`, `resources`, and `runtime_env` For more details - [Resource management in Serve](serve-cpus-gpus)
+- `max_concurrent_queries` - Maximum number of queries that are sent to a replica of this deployment without receiving a response. Defaults to 100. This may be an important parameter to configure for [performance tuning](serve-perf-tuning).
+- `autoscaling_config` - Parameters to configure autoscaling behavior. If this is set, num_replicas cannot be set. For more details on configurable parameters for autoscaling - [Ray Serve Autoscaling](ray-serve-autoscaling). 
+- `user_config` -  Config to pass to the reconfigure method of the deployment. This can be updated dynamically without restarting the replicas of the deployment. The user_config must be fully JSON-serializable. For more details - [Serve User Config](serve-user-config). 
+- `health_check_period_s` - Duration between health check calls for the replica. Defaults to 10s. The health check is by default a no-op Actor call to the replica, but you can define your own health check using the "check_health" method in your deployment that raises an exception when unhealthy.
+- `health_check_timeout_s` - Duration in seconds, that replicas wait for a health check method to return before considering it as failed. Defaults to 30s.
+- `graceful_shutdown_wait_loop_s` - Duration that replicas wait until there is no more work to be done before shutting down. Defaults to 2s.
+- `graceful_shutdown_timeout_s` - Duration to wait for a replica to gracefully shut down before being forcefully killed. Defaults to 20s.
+- `is_driver_deployment` - [EXPERIMENTAL] when set, exactly one replica of this deployment runs on every node (like a daemon set).
+
+There are 3 ways of specifying parameters:
+
+  - In the `@serve.deployment` decorator -
+
+```{literalinclude} ../serve/doc_code/configure_serve_deployment/model_deployment.py
+:start-after: __deployment_start__
+:end-before: __deployment_end__
+:language: python
+```
+
+  - Through `options()` -
+
+```{literalinclude} ../serve/doc_code/configure_serve_deployment/model_deployment.py
+:start-after: __deployment_end__
+:end-before: __options_end__
+:language: python
+```
+
+  - Using the YAML [Serve Config file](serve-in-production-config-file) -
+
+```yaml
+applications:
+
+- name: app1
+
+  route_prefix: /
+
+  import_path: configure_serve:translator_app
+
+  runtime_env: {}
+
+  deployments:
+
+  - name: Translator
+    num_replicas: 2
+    max_concurrent_queries: 100
+    graceful_shutdown_wait_loop_s: 2.0
+    graceful_shutdown_timeout_s: 20.0
+    health_check_period_s: 10.0
+    health_check_timeout_s: 30.0
+    ray_actor_options:
+      num_cpus: 0.2
+      num_gpus: 0.0
+```
+
+## Overriding deployment settings
+
+The order of priority is (from highest to lowest):
+
+1. Serve Config file
+2. `.options()` call in python code referenced above
+3. `@serve.deployment` decorator in python code
+4. Serve defaults
+
+For example, if a deployment's `num_replicas` is specified in the config file and their graph code, Serve will use the config file's value. If it's only specified in the code, Serve will use the code value. If the user doesn't specify it anywhere, Serve will use a default (which is `num_replicas=1`).
+
+Keep in mind that this override order is applied separately to each individual parameter.
+For example, if a user has a deployment `ExampleDeployment` with the following decorator:
+
+```python
+@serve.deployment(
+    num_replicas=2,
+    max_concurrent_queries=15,
+)
+class ExampleDeployment:
+    ...
+```
+
+and the following config file:
+
+```yaml
+...
+
+deployments:
+
+    - name: ExampleDeployment
+      num_replicas: 5
+
+...
+```
+
+Serve sets `num_replicas=5`, using the config file value, and `max_concurrent_queries=15`, using the code value (because `max_concurrent_queries` wasn't specified in the config file). All other deployment settings use Serve defaults because the user didn't specify them in the code or the config.
+
+:::{tip}
+Remember that `ray_actor_options` counts as a single setting. The entire `ray_actor_options` dictionary in the config file overrides the entire `ray_actor_options` dictionary from the graph code. If there are individual options within `ray_actor_options` (e.g. `runtime_env`, `num_gpus`, `memory`) that are set in the code but not in the config, Serve still won't use the code settings if the config has a `ray_actor_options` dictionary. It treats these missing options as though the user never set them and uses defaults instead. This dictionary overriding behavior also applies to `user_config` and `autoscaling_config`.
+:::
+
+(serve-user-config)=
+## Dynamically changing parameters without restarting your replicas (`user_config`)
+
+You can use the `user_config` field to supply structured configuration for your deployment. You can pass arbitrary JSON serializable objects to the YAML configuration. Serve then applies it to all running and future deployment replicas. The application of user configuration *does not* restart the replica. This means you can use this field to dynamically:
+- adjust model weights and versions without restarting the cluster.
+- adjust traffic splitting percentage for your model composition graph.
+- configure any feature flag, A/B tests, and hyper-parameters for your deployments.
+
+To enable the `user_config` feature, you need to implement a `reconfigure` method that takes a JSON-serializable object (e.g., a Dictionary, List or String) as its only argument:
+
+```python
+@serve.deployment
+class Model:
+    def reconfigure(self, config: Dict[str, Any]):
+        self.threshold = config["threshold"]
+```
+
+If the `user_config` is set when the deployment is created (e.g., in the decorator or the Serve config file), this `reconfigure` method is called right after the deployment's `__init__` method, and the `user_config` is passed in as an argument. You can also trigger the `reconfigure` method by updating your Serve config file with a new `user_config` and reapplying it to your Ray cluster. See [In-place Updates](serve-inplace-updates) for more information.
+
+The corresponding YAML snippet is:
+
+```yaml
+...
+deployments:
+    - name: Model
+      user_config:
+        threshold: 1.5
+```
+
+
+
-Original file line number
+Diff line change
@@ Expand Up / @@ -44,7 +44,7 @@ end-before: __batch_params_update_end__ @@
     ---
     ```
-    Use these methods in the `reconfigure` [method](serve-in-production-reconfigure) to control the `@serve.batch` parameters through your Serve configuration file.
+    Use these methods in the `reconfigure` [method](serve-user-config) to control the `@serve.batch` parameters through your Serve configuration file.
     :::
     ## Streaming batched requests
@@ Expand Down @@