[Meta][Metricbeat] - Collect additional Elasticsearch node metrics for enhanced dashboards #42131

VimCommando · 2024-12-20T02:40:15Z

Metricbeat (as of 8.15.3) used for stack monitoring collection is still missing some helpful metrics for building comprehensive monitoring dashboards.

Here is a potential list of metrics to included from _node/stats:

jvm.threads.count
http.total_opened
process.open_file_descriptors
process.mem.total_virtual_in_bytes

transport.rx_count
transport.rx_size_in_bytes
transport.tx_count
transport.tx_size_in_bytes

ingest.total.count
ingest.total.time_in_millis
ingest.total.failed

indices.fielddata.evictions
indices.get.time_in_millis
indices.get.total
indices.merges.total
indices.merges.total_time_in_millis
indices.search.fetch_time_in_millis
indices.search.fetch_total
indices.search.query_time_in_millis
indices.search.query_total
indices.translog.operations
indices.translog.size_in_bytes

thread_pool.esql_worker.active
thread_pool.esql_worker.queue
thread_pool.esql_worker.rejected
thread_pool.flush.active
thread_pool.flush.queue
thread_pool.flush.rejected
thread_pool.force_merge.active
thread_pool.get.active
thread_pool.search.active
thread_pool.write.active
thread_pool.search_worker.active
thread_pool.search_worker.queue
thread_pool.search_worker.rejected
thread_pool.snapshot.active
thread_pool.snapshot.queue
thread_pool.snapshot.rejected
thread_pool.system_read.active
thread_pool.system_read.queue
thread_pool.system_read.rejected
thread_pool.system_write.active
thread_pool.system_write.queue
thread_pool.system_write.rejected

Some newer features such as ES|QL (esql_worker) and intra-segment search parallelism (search_worker) have been introduced in 8.x and Metricbeat monitoring isn't capturing the relevant thread pools yet.

The average service time can also be helpful, for example the write time per document or query time per search. This is usually just a simple division like indices.write.time_in_millis / indices.write.total, but if it is calculated at ingest time, it is possible to sort by this metric in visualizations.

Tasks

Give feedback

Modify the node_stats metric set of the elasticsearch Metricbeat module
Modify the default .monitoring-es-8-mb index template
Modify the node_stats data stream of the elasticsearch agent integration
Ask Control Plane to upgrade beats-runner to 8.18
Modify the elasticsearch-2* index templates
Options

The text was updated successfully, but these errors were encountered:

VimCommando · 2025-01-15T17:04:53Z

The indices stats don't currently capture all ES|QL activity: elastic/elasticsearch#109673

consulthys · 2025-01-22T07:52:00Z

@VimCommando I've started tackling this to make sure it goes into 8.18.
I can see that some of the fields above are already in, for instance:

indices.search.query_time_in_millis which is named indices.search.query_time.ms
indices.search.query_total which is named indices.search.query_total.count
Can you confirm that we're looking at the same fields?

consulthys · 2025-01-27T16:23:33Z

Modify the elasticsearch-2* index templates

Regarding the above task, the production and qa templates will need to be updated by the Control Plane team specifically as they decide to upgrade their internal MB to 8.18. So this is not a pre-requisite for 8.18 FF

consulthys · 2025-01-27T17:01:19Z

The average service time can also be helpful, for example the write time per document or query time per search. This is usually just a simple division like indices.write.time_in_millis / indices.write.total, but if it is calculated at ingest time, it is possible to sort by this metric in visualizations.

@VimCommando can you list which ratios you'd like to have pre-computed?

VimCommando · 2025-01-27T19:45:44Z

@VimCommando I've started tackling this to make sure it goes into 8.18. I can see that some of the fields above are already in, for instance:

indices.search.query_time_in_millis which is named indices.search.query_time.ms

indices.search.query_total which is named indices.search.query_total.count
Can you confirm that we're looking at the same fields?

Yes, those are correct. I may've included it based on the dashboard I was looking at, not the code.

@VimCommando can you list which ratios you'd like to have pre-computed?

Each of these totals has a corresponding *_time_in_millis to divide with to get averages:

indices.flush.total
indices.get.total
indices.indexing.index_total
indices.merges.total
indices.refresh.total
indices.search.fetch_total
indices.search.query_total

If we are not also capturing the bulk metrics, I'd also add those. It already reports avg_time_in_millis and avg_size_in_bytes:

        "bulk": {
          "total_operations": 2456026837,
          "total_time_in_millis": 4086790047,
          "total_size_in_bytes": 57411051045801,
          "avg_time_in_millis": 0,
          "avg_size_in_bytes": 25162
        },

The indices.bulk.total_size_in_bytes is incredibly useful when trying to understand the raw, uncompressed ingest volume.

consulthys · 2025-01-28T08:38:46Z

Each of these totals has a corresponding *_time_in_millis to divide with to get averages:

indices.flush.total
indices.get.total
indices.indexing.index_total
indices.merges.total
indices.refresh.total
indices.search.fetch_total
indices.search.query_total

We'll add all of these averages

If we are not also capturing the bulk metrics, I'd also add those. It already reports avg_time_in_millis and avg_size_in_bytes:
        "bulk": {
          "total_operations": 2456026837,
          "total_time_in_millis": 4086790047,
          "total_size_in_bytes": 57411051045801,
          "avg_time_in_millis": 0,
          "avg_size_in_bytes": 25162
        },
The indices.bulk.total_size_in_bytes is incredibly useful when trying to understand the raw, uncompressed ingest volume.

Agreed, we're already capturing all bulk metrics, indices.bulk.total_size_in_bytes is captured in the field indices.bulk.total_size.bytes

consulthys · 2025-01-28T13:40:37Z

@VimCommando by the way, it looks like bulk.avg_time_in_millis is always 0, are you seeing the same?

VimCommando added Feature:Stack Monitoring Team:Monitoring Stack Monitoring team labels Dec 20, 2024

VimCommando assigned consulthys Dec 20, 2024

consulthys mentioned this issue Jan 24, 2025

[metricbeat] Gather more fields from _node/stats #42421

Merged

6 tasks

consulthys mentioned this issue Jan 28, 2025

[Stack Monitoring] Update monitoring mappings to add some new fields elastic/elasticsearch#121062

Merged

mergify bot mentioned this issue Jan 28, 2025

[8.x](backport #42421) [metricbeat] Gather more fields from _node/stats #42463

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Meta][Metricbeat] - Collect additional Elasticsearch node metrics for enhanced dashboards #42131

[Meta][Metricbeat] - Collect additional Elasticsearch node metrics for enhanced dashboards #42131

VimCommando commented Dec 20, 2024 •

edited by consulthys

Loading

Tasks

VimCommando commented Jan 15, 2025

consulthys commented Jan 22, 2025

consulthys commented Jan 27, 2025

consulthys commented Jan 27, 2025

VimCommando commented Jan 27, 2025

consulthys commented Jan 28, 2025

consulthys commented Jan 28, 2025

[Meta][Metricbeat] - Collect additional Elasticsearch node metrics for enhanced dashboards #42131

[Meta][Metricbeat] - Collect additional Elasticsearch node metrics for enhanced dashboards #42131

Comments

VimCommando commented Dec 20, 2024 • edited by consulthys Loading

Tasks

VimCommando commented Jan 15, 2025

consulthys commented Jan 22, 2025

consulthys commented Jan 27, 2025

consulthys commented Jan 27, 2025

VimCommando commented Jan 27, 2025

consulthys commented Jan 28, 2025

consulthys commented Jan 28, 2025

VimCommando commented Dec 20, 2024 •

edited by consulthys

Loading