From 791291a5d11c092ef3be390b6153a60ecd0a36b2 Mon Sep 17 00:00:00 2001 From: Michele Pangrazzi Date: Thu, 5 Dec 2024 12:47:54 +0100 Subject: [PATCH 1/6] Draft deployment guidelines --- docs/deployment_guidelines.md | 69 +++++++++++++++++++++++++++++++++++ 1 file changed, 69 insertions(+) create mode 100644 docs/deployment_guidelines.md diff --git a/docs/deployment_guidelines.md b/docs/deployment_guidelines.md new file mode 100644 index 0000000..7230c6d --- /dev/null +++ b/docs/deployment_guidelines.md @@ -0,0 +1,69 @@ +# Hayhooks deployment guidelines + +This document describes how to deploy Hayhooks in a production environment. +Since Hayhooks is a FastAPI application, it can be deployed in a variety of ways as well described in [its documentation](https://fastapi.tiangolo.com/deployment/concepts/?h=deploy). + +Following are some guidelines about deploying and running Haystack pipelines. + +## TL;DR + +- Use a single worker environment if you have mainly I/O operations in your pipeline and/or a low number of concurrent requests. +- Use a multi-worker environment if you have mainly CPU-bound operations in your pipeline and/or a high number of concurrent requests. +- In any case, use `HAYHOOKS_PIPELINES_DIR` to share pipeline definitions across workers (if possible). + +## Single worker environment + +In a single worker environment, you typically run the application using: + +```bash +hayhooks run +``` + +command (or having a single Docker container running). This will launch a **single `uvicorn` worker** to serve the application. + +### Pipelines deployment (single worker) + +You can deploy a pipeline using: + +```bash +hayhooks deploy +``` + +command or do a `POST /deploy` request. + +### Handling concurrent requests (single worker) + +The `run()` method of the pipeline instance is synchronous code, and it's executed using `run_in_threadpool` to avoid blocking the main async event loop. + +- If your pipeline is doing **mainly I/O operations** (like making HTTP requests, reading/writing files, etc.), the single worker should be able to handle concurrent requests. +- If your pipeline is doing **mainly CPU-bound operations** (like computating embeddings), the GIL (Global Interpreter Lock) will prevent the worker from handling concurrent requests, so they will be queued. + +## Multiple workers environment + +### Single instance with multiple workers + +Currently, `hayhooks run` command does not support multiple `uvicorn` workers. However, you can run multiple instances of the application using directly the `uvicorn` command or [FastAPI CLI](https://fastapi.tiangolo.com/fastapi-cli/#fastapi-run) using `fastapi run` command. + +For example, if you enough cores to run 4 workers, you can use the following command: + +```bash +fastapi run src/hayhooks/server/app.py --workers 4 +``` + +This vertical scaling approach allows you to handle more concurrent requests (according the available resources). + +### Multiple single-worker instances behind a load balancer + +In a multi-worker environment (for example on a Kubernetes `Deployment`) you typically have a `LoadBalancer` Service which distributes the traffic to a number of `Pod`s running the application (using `hayhooks run` command). + +This horizontal scaling approach allows you to handle more concurrent requests. + +### Pipeline deployment (multiple workers) + +In both the above scenarios, **it's NOT recommended** to deploy a pipeline using the `hayhooks deploy` command (or `POST /deploy` request) as it will deploy the pipeline only on one of the workers, which is not ideal. + +Instead, you want to provide the env var `HAYHOOKS_PIPELINES_DIR` pointing to a shared folder where all the workers can read the pipeline definitions at startup and load them. This way, all the workers will have the same pipelines available and there will be no issues when calling the API to run a pipeline. + +### Handling concurrent requests (multiple workers) + +When having multiple workers and pipelines deployed using `HAYHOOKS_PIPELINES_DIR`, you will be able to handle concurrent requests as each worker will be able to run a pipeline independently. This should be enough to make your application scalable, according to your needs. From 0cb6de504cbfb6b2403831120be6f228a9f691a9 Mon Sep 17 00:00:00 2001 From: Michele Pangrazzi Date: Thu, 5 Dec 2024 14:12:06 +0100 Subject: [PATCH 2/6] Add note --- docs/deployment_guidelines.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/deployment_guidelines.md b/docs/deployment_guidelines.md index 7230c6d..41a8086 100644 --- a/docs/deployment_guidelines.md +++ b/docs/deployment_guidelines.md @@ -67,3 +67,5 @@ Instead, you want to provide the env var `HAYHOOKS_PIPELINES_DIR` pointing to a ### Handling concurrent requests (multiple workers) When having multiple workers and pipelines deployed using `HAYHOOKS_PIPELINES_DIR`, you will be able to handle concurrent requests as each worker will be able to run a pipeline independently. This should be enough to make your application scalable, according to your needs. + +Note that even in a multiple-workers environment the individual single workers will have the same GIL limitation discussed above, so if your pipeline is mainly CPU-bound, you will need to scale horizontally according to your needs. From b848684d2d73f982648b2824ce60be542f71b6cc Mon Sep 17 00:00:00 2001 From: Michele Pangrazzi Date: Fri, 6 Dec 2024 15:49:33 +0100 Subject: [PATCH 3/6] Update docs/deployment_guidelines.md Co-authored-by: Julian Risch --- docs/deployment_guidelines.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/deployment_guidelines.md b/docs/deployment_guidelines.md index 41a8086..f74d7af 100644 --- a/docs/deployment_guidelines.md +++ b/docs/deployment_guidelines.md @@ -36,7 +36,7 @@ command or do a `POST /deploy` request. The `run()` method of the pipeline instance is synchronous code, and it's executed using `run_in_threadpool` to avoid blocking the main async event loop. - If your pipeline is doing **mainly I/O operations** (like making HTTP requests, reading/writing files, etc.), the single worker should be able to handle concurrent requests. -- If your pipeline is doing **mainly CPU-bound operations** (like computating embeddings), the GIL (Global Interpreter Lock) will prevent the worker from handling concurrent requests, so they will be queued. +- If your pipeline is doing **mainly CPU-bound operations** (like computing embeddings), the GIL (Global Interpreter Lock) will prevent the worker from handling concurrent requests, so they will be queued. ## Multiple workers environment From 45f8b7deac9e898920301107f316305cfe5733f1 Mon Sep 17 00:00:00 2001 From: Michele Pangrazzi Date: Fri, 6 Dec 2024 15:49:39 +0100 Subject: [PATCH 4/6] Update docs/deployment_guidelines.md Co-authored-by: Julian Risch --- docs/deployment_guidelines.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/deployment_guidelines.md b/docs/deployment_guidelines.md index f74d7af..7d2f7a7 100644 --- a/docs/deployment_guidelines.md +++ b/docs/deployment_guidelines.md @@ -44,7 +44,7 @@ The `run()` method of the pipeline instance is synchronous code, and it's execut Currently, `hayhooks run` command does not support multiple `uvicorn` workers. However, you can run multiple instances of the application using directly the `uvicorn` command or [FastAPI CLI](https://fastapi.tiangolo.com/fastapi-cli/#fastapi-run) using `fastapi run` command. -For example, if you enough cores to run 4 workers, you can use the following command: +For example, if you have enough cores to run 4 workers, you can use the following command: ```bash fastapi run src/hayhooks/server/app.py --workers 4 From 8bf7e39d3b246b510bbe6de41b1564319a1f5b1c Mon Sep 17 00:00:00 2001 From: Michele Pangrazzi Date: Fri, 6 Dec 2024 15:49:45 +0100 Subject: [PATCH 5/6] Update docs/deployment_guidelines.md Co-authored-by: Julian Risch --- docs/deployment_guidelines.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/deployment_guidelines.md b/docs/deployment_guidelines.md index 7d2f7a7..116f858 100644 --- a/docs/deployment_guidelines.md +++ b/docs/deployment_guidelines.md @@ -50,7 +50,7 @@ For example, if you have enough cores to run 4 workers, you can use the followin fastapi run src/hayhooks/server/app.py --workers 4 ``` -This vertical scaling approach allows you to handle more concurrent requests (according the available resources). +This vertical scaling approach allows you to handle more concurrent requests (depending on available resources). ### Multiple single-worker instances behind a load balancer From e75099efa2a1c28c7ac1fde72cf4a367528fc061 Mon Sep 17 00:00:00 2001 From: Michele Pangrazzi Date: Fri, 6 Dec 2024 15:50:43 +0100 Subject: [PATCH 6/6] Update docs/deployment_guidelines.md Co-authored-by: Julian Risch --- docs/deployment_guidelines.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/deployment_guidelines.md b/docs/deployment_guidelines.md index 116f858..cf8698d 100644 --- a/docs/deployment_guidelines.md +++ b/docs/deployment_guidelines.md @@ -54,7 +54,7 @@ This vertical scaling approach allows you to handle more concurrent requests (de ### Multiple single-worker instances behind a load balancer -In a multi-worker environment (for example on a Kubernetes `Deployment`) you typically have a `LoadBalancer` Service which distributes the traffic to a number of `Pod`s running the application (using `hayhooks run` command). +In a multi-worker environment (for example on a Kubernetes `Deployment`) you typically have a `LoadBalancer` Service, which distributes the traffic to a number of `Pod`s running the application (using `hayhooks run` command). This horizontal scaling approach allows you to handle more concurrent requests.