Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Analyze Performance and Stability #41930

Open
5 of 8 tasks
xuyifangreeneyes opened this issue Mar 3, 2023 · 0 comments
Open
5 of 8 tasks

Improve Analyze Performance and Stability #41930

xuyifangreeneyes opened this issue Mar 3, 2023 · 0 comments
Assignees
Labels
component/statistics sig/planner SIG: Planner type/enhancement The issue or PR belongs to an enhancement.

Comments

@xuyifangreeneyes
Copy link
Contributor

xuyifangreeneyes commented Mar 3, 2023

Enhancement

Currently, when we use the analyze command to collect statistics. There are several problems we have met, especially for large tables:

  • Analyze is slow. Since analyze needs to scan the full table, it may take hours even days to finish the analyze job for large tables.
  • Analyze may consume much resource. Some users may increase concurrency(like tidb_build_stats_concurrency and tidb_distsql_scan_concurrency) to speed up analyze. However, that may consume lots of cpu/mem/io for tikv(when scanning the table and sampling) and lots of cpu/mem for tidb(when merging samples and building stats).
  • When the table has many columns or some columns have large sizes(like text/blob/json type columns), the samples may take up lots of mem. When merging samples and building stats in tidb, tidb may OOM or analyze may be killed by global mem control mechanism. Maybe we can give up collecting statistics for some columns whose stats are barely used such as json columns.
  • The execution of analyze is not fault-tolerant. If one analyze request to some region fails(maybe due to region unavailable or other reasons), the whole analyze job would fail and we need to rerun the analyze job from the very beginning. It is unfriendly to users.

Here is the related issue in tikv repo:
tikv/tikv#14231

Tasks

Use faster murmur3 hash function for FMSketch calculation

Reduce encoding cost

Avoid FMSketch calculation for single-column index

Sample-based NDV calculation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/statistics sig/planner SIG: Planner type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

No branches or pull requests

1 participant