Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Meta][Metricbeat] - Collect additional Elasticsearch node metrics for enhanced dashboards #42131

Open
2 of 5 tasks
VimCommando opened this issue Dec 20, 2024 · 7 comments
Open
2 of 5 tasks
Assignees

Comments

@VimCommando
Copy link

VimCommando commented Dec 20, 2024

Metricbeat (as of 8.15.3) used for stack monitoring collection is still missing some helpful metrics for building comprehensive monitoring dashboards.

Here is a potential list of metrics to included from _node/stats:

jvm.threads.count
http.total_opened
process.open_file_descriptors
process.mem.total_virtual_in_bytes

transport.rx_count
transport.rx_size_in_bytes
transport.tx_count
transport.tx_size_in_bytes

ingest.total.count
ingest.total.time_in_millis
ingest.total.failed

indices.fielddata.evictions
indices.get.time_in_millis
indices.get.total
indices.merges.total
indices.merges.total_time_in_millis
indices.search.fetch_time_in_millis
indices.search.fetch_total
indices.search.query_time_in_millis
indices.search.query_total
indices.translog.operations
indices.translog.size_in_bytes

thread_pool.esql_worker.active
thread_pool.esql_worker.queue
thread_pool.esql_worker.rejected
thread_pool.flush.active
thread_pool.flush.queue
thread_pool.flush.rejected
thread_pool.force_merge.active
thread_pool.get.active
thread_pool.search.active
thread_pool.write.active
thread_pool.search_worker.active
thread_pool.search_worker.queue
thread_pool.search_worker.rejected
thread_pool.snapshot.active
thread_pool.snapshot.queue
thread_pool.snapshot.rejected
thread_pool.system_read.active
thread_pool.system_read.queue
thread_pool.system_read.rejected
thread_pool.system_write.active
thread_pool.system_write.queue
thread_pool.system_write.rejected

Some newer features such as ES|QL (esql_worker) and intra-segment search parallelism (search_worker) have been introduced in 8.x and Metricbeat monitoring isn't capturing the relevant thread pools yet.

The average service time can also be helpful, for example the write time per document or query time per search. This is usually just a simple division like indices.write.time_in_millis / indices.write.total, but if it is calculated at ingest time, it is possible to sort by this metric in visualizations.

Tasks

Preview Give feedback
@VimCommando
Copy link
Author

The indices stats don't currently capture all ES|QL activity: elastic/elasticsearch#109673

@consulthys
Copy link
Contributor

@VimCommando I've started tackling this to make sure it goes into 8.18.
I can see that some of the fields above are already in, for instance:

@consulthys
Copy link
Contributor

Modify the elasticsearch-2* index templates

Regarding the above task, the production and qa templates will need to be updated by the Control Plane team specifically as they decide to upgrade their internal MB to 8.18. So this is not a pre-requisite for 8.18 FF

@consulthys
Copy link
Contributor

The average service time can also be helpful, for example the write time per document or query time per search. This is usually just a simple division like indices.write.time_in_millis / indices.write.total, but if it is calculated at ingest time, it is possible to sort by this metric in visualizations.

@VimCommando can you list which ratios you'd like to have pre-computed?

@VimCommando
Copy link
Author

@VimCommando I've started tackling this to make sure it goes into 8.18. I can see that some of the fields above are already in, for instance:

Yes, those are correct. I may've included it based on the dashboard I was looking at, not the code.

@VimCommando can you list which ratios you'd like to have pre-computed?

Each of these totals has a corresponding *_time_in_millis to divide with to get averages:

indices.flush.total
indices.get.total
indices.indexing.index_total
indices.merges.total
indices.refresh.total
indices.search.fetch_total
indices.search.query_total

If we are not also capturing the bulk metrics, I'd also add those. It already reports avg_time_in_millis and avg_size_in_bytes:

        "bulk": {
          "total_operations": 2456026837,
          "total_time_in_millis": 4086790047,
          "total_size_in_bytes": 57411051045801,
          "avg_time_in_millis": 0,
          "avg_size_in_bytes": 25162
        },

The indices.bulk.total_size_in_bytes is incredibly useful when trying to understand the raw, uncompressed ingest volume.

@consulthys
Copy link
Contributor

Each of these totals has a corresponding *_time_in_millis to divide with to get averages:

indices.flush.total
indices.get.total
indices.indexing.index_total
indices.merges.total
indices.refresh.total
indices.search.fetch_total
indices.search.query_total

We'll add all of these averages

If we are not also capturing the bulk metrics, I'd also add those. It already reports avg_time_in_millis and avg_size_in_bytes:

        "bulk": {
          "total_operations": 2456026837,
          "total_time_in_millis": 4086790047,
          "total_size_in_bytes": 57411051045801,
          "avg_time_in_millis": 0,
          "avg_size_in_bytes": 25162
        },

The indices.bulk.total_size_in_bytes is incredibly useful when trying to understand the raw, uncompressed ingest volume.

Agreed, we're already capturing all bulk metrics, indices.bulk.total_size_in_bytes is captured in the field indices.bulk.total_size.bytes

@consulthys
Copy link
Contributor

@VimCommando by the way, it looks like bulk.avg_time_in_millis is always 0, are you seeing the same?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants