diff --git a/packages/web/docs/src/pages/docs/gateway/other-features/_meta.ts b/packages/web/docs/src/pages/docs/gateway/other-features/_meta.ts index 6c6dccc646..ca1337d76c 100644 --- a/packages/web/docs/src/pages/docs/gateway/other-features/_meta.ts +++ b/packages/web/docs/src/pages/docs/gateway/other-features/_meta.ts @@ -1,5 +1,6 @@ export default { index: 'Overview', + 'upstream-reliability': 'Upstream Reliability', performance: 'Performance/Cache', security: 'Security', 'header-propagation': 'Header Propagation', diff --git a/packages/web/docs/src/pages/docs/gateway/other-features/upstream-reliability.mdx b/packages/web/docs/src/pages/docs/gateway/other-features/upstream-reliability.mdx new file mode 100644 index 0000000000..110591a4f8 --- /dev/null +++ b/packages/web/docs/src/pages/docs/gateway/other-features/upstream-reliability.mdx @@ -0,0 +1,212 @@ +--- +description: Improve reliability of Hive Gateway with timeout and retry policy for upstream requests +--- + +# Upstream reliability + +Hive Gateway offers Retry and Timeout features to improve the reliability of the traffic between the +gateway and the subgraphs. + +## Retry mechanism + +A large range of errors happening between the gateway and the upstream subgraphs are temporary +errors. Which means a lot of errors can be solved by issuing the same request again after a short +delay. + +The Retry feature allows to automatically retry failed upstream requests, improving the ratio of +successful responses from the client point of view. + +```mermaid +sequenceDiagram + Client->>Gateway: query { field } + Gateway->>Subgraph: query { field } + Subgraph->>Gateway: 503 Service Unavailable + Note over Gateway: Wait for a short delay before trying again + Gateway->>Subgraph: query { field } + Subgraph->>Gateway: 503 Service Unavailable + Note over Gateway: Wait again for short a delay + Gateway->>Subgraph: query { field } + Subgraph->>Gateway: { data: { field: "value" } } + Gateway->>Client: { data: { field: "value" } } +``` + +### Global configuration + +This feature can be enabled for all subgraphs at once by using `upstreamRetry` option. + +The only mandatory option to provide is the number of maximum retries allowed. + +Once enabled, any upstream request execution failing with an exception will be retried, with a delay +of 1 second between each try. + +If the upstream is using the `fetch`, the request will be retried only if the status code is `429`, +`5XX`, or if the `Retry-After` header is present in the response. If the `Retry-After` header is +present, the delay between retries will respect it, otherwise it defaults to 1 second. + +```ts filename="gateway.config.ts" +import { defineConfig } from '@graphql-hive/gateway' + +export const gatewayConfig = defineConfig({ + upstreamRetry: { + maxRetries: 3 + } +}) +``` + +### Per-subgraph configuration + +Hive Gateway lets you define different retry policy for each subgraph by providing a function to the +`upstreamRetry` option + +This function will be called for each upstream request, which allows you to have full control of the +retry policy. + +```ts filename="gateway.config.ts" +import { defineConfig } from '@graphql-hive/gateway' + +export const gatewayConfig = defineConfig({ + upstreamRetry: ({ subgraphName }) => { + if (subgraphName == 'very-unreliable-subgraph') { + return { + maxRetries: 10 + } + } + + return { + maxRetries: 3 + } + } +}) +``` + +### Retry delay configuration + +The default delay to wait between tries. + +If the subgraph is using HTTP protocol, the delay is replaced by the response's `Retry-After` header +if present. + +The `Retry-After` can either contain a number of second or a date. + +```ts filename="gateway.config.ts" +import { defineConfig } from '@graphql-hive/gateway' + +export const gatewayConfig = defineConfig({ + upstreamRetry: { + maxRetries: 3, + retryDelay: 1000 // delay in milliseconds + } +}) +``` + +### Exponential backoff configuration + +To avoid adding pressure on a subgraph that is already potentially struggling with load, a standard +method is to use an exponential backoff. + +A backoff consist of increasing the delay between tries after each try, to give more time to the +server to recover from the error state. + +An exponential backoff consist of increasing the delay exponentially after each tries, by +multiplying the delay by a given rate instead of increasing it by a fixed amount. This is the +industry standard backoff strategy, because it is the most effective way to release the pressure on +the upstream service without entirely giving up from receiving a response. + +Hive Gateway implements an exponential backoff by default. You can change the backoff rate using +`retryDelayFactor` option. You can use a factor of `1` to disable backoff and have a fix retry +delay. + +```ts filename="gateway.config.ts" +import { defineConfig } from '@graphql-hive/gateway' + +export const gatewayConfig = defineConfig({ + upstreamRetry: { + maxRetries: 3, + retryDelayFactor: 1.25 // default value. Use a value of 1 to disable backoff. + } +}) +``` + +### Conditional retry logic + +Hive Gateway offer the possibility to skip retrying a specific upstream request, based on its +context or returned value. + +For example, you can make Hive Gateway always retry any response with an error HTTP status code (>= +400). + +```ts filename="gateway.config.ts" +import { defineConfig } from '@graphql-hive/gateway' + +export const gatewayConfig = defineConfig({ + upstreamRetry: { + maxRetries: 3, + shouldRetry: ({ response }) => response?.status >= 400 // Always retry any HTTP errors + } +}) +``` + +## Timeout management + +An upstream request timeout limit is the maximum time the gateway waits for a response from an +upstream subgraph before aborting the request. This ensures the gateway remains responsive and +prevents it from waiting indefinitely for a slow or unresponsive subgraphs. + +### Global configuration + +Hive Gateway allows to configure a global timeout for all subgraphs at once. + +```ts filename="gateway.config.ts" +import { defineConfig } from '@graphql-hive/gateway' + +export const gatewayConfig = defineConfig({ + // The maximum time in milliseconds to wait for a response from the upstream. + upstreamTimeout: 1000 +}) +``` + +### Per-subgraph configuration + +You can finely grain configure the timeout for each upstream execution request by providing a +function instead of a number. + +This function is called for each upstream request, allowing for complete control over the timeout +strategy. + +```ts filename="gateway.config.ts" +import { defineConfig } from '@graphql-hive/gateway' + +export const gatewayConfig = defineConfig({ + upstreamTimeout: ({ subgraphName }) => { + if (subgraphName == 'very-slow-subgraph') { + return 60_000 // Allow for longer request for this very slow server + } + return 30_000 + } +}) +``` + +## Combining retry and timeout features + +If Retry and Timeout are both enabled at the same time (which is recommended), the timeout will be +applied to each try. + +It also means that a timed-out request will be retried, until the maximum retry count is reached, +respecting the retry delay between tries. + +If you prefer to have a timeout spanning over the retries, to ensure a maximum execution time for an +execution request, you can manually add the plugin to your configuration. + +```ts filename="gateway.config.ts" +import { defineConfig, useUpstreamRetry, useUpstreamTimeout } from '@graphql-hive/gateway' + +export const gatewayConfig = defineConfig({ + plugins: () => [ + useUpstreamRetry({ + maxRetries: 10 + retryDelay: 1_000, + }), + useUpstreamTimeout(10_000) + ] +}) +```