Skip to content
This repository was archived by the owner on Nov 15, 2023. It is now read-only.

Report tasks metrics to Prometheus #5619

Merged
merged 8 commits into from
Apr 15, 2020

Conversation

tomaka
Copy link
Contributor

@tomaka tomaka commented Apr 13, 2020

This PR adds Prometheus metrics to the sc_service::task_manager module.

There are four new metrics, each accepting a task name as label:

  • A counter incremented when we create a task.
  • A counter incremented when a task stops. The idea being to do started - stopped to know what's running.
  • A histogram with how long it took to run Future::poll. This can be used to figure out the CPU usage of the node.
  • A counter incremented every time we start polling a task. While this is kind of redundant with poll_duration_count, the idea is to do poll_started - poll_duration_count to detect tasks that are stuck.

I have consequently removed the futures-diagnose-exec tool, that was supposed to serve the same purpose but was totally undocumented and way more annoying to use on real nodes compared to Prometheus/Grafana.

@tomaka tomaka added A0-please_review Pull request needs code review. B2-breaksapi labels Apr 13, 2020
@tomaka tomaka added this to the 2.0 milestone Apr 13, 2020
@tomaka tomaka requested review from gnunicorn and mxinden April 13, 2020 18:27
@tomaka tomaka marked this pull request as ready for review April 13, 2020 18:29
Comment on lines +145 to +147
/// The task name is a `&'static str` as opposed to a `String`. The reason for that is that
/// in order to avoid memory consumption issues with the Prometheus metrics, the set of
/// possible task names has to be bounded.
Copy link
Contributor Author

@tomaka tomaka Apr 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be overly restrictive, but I think that if we don't enforce this, it's only a matter of time before someone creates a task with a dynamic name.

It is possible to create a &'static str with Box::leak(format!("foo").into_boxed_str()), but I assume that if someone does that, it's not by accident.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much in favor of this 👍

@tomaka
Copy link
Contributor Author

tomaka commented Apr 13, 2020

I was kind of worried about a performance degradation from having so many reports, but it seems fine.

Here's a CPU usage graph of my node. I deployed a new version of Polkadot with this PR around 18:40 UTC. You can't see any difference compared to the background noise.

Screenshot from 2020-04-13 20-49-55

@@ -312,8 +318,8 @@ impl<TBl, TCl, TSc, TNetStatus, TNet, TTxPool, TOc> Spawn for
&self,
future: FutureObj<'static, ()>
) -> Result<(), SpawnError> {
self.task_manager.scheduler().unbounded_send((Box::pin(future), From::from("unnamed")))
.map_err(|_| SpawnError::shutdown())
self.task_manager.spawn_handle().spawn("unnamed", future);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not about pr changes, but this is going to be used more in future, can we do better identifying what and from where spawned here?

Copy link
Contributor Author

@tomaka tomaka Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add an object that contains the task name in advance and implements Spawn, and pass it around.

For example, you'd do something like:

GrandPa::start(tasks_manager.build_spawner("grandpa"));

Copy link
Contributor

@gnunicorn gnunicorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Co-Authored-By: Bastian Köcher <bkchr@users.noreply.github.com>
@tomaka
Copy link
Contributor Author

tomaka commented Apr 14, 2020

Polkadot master seems broken. The error is unrelated to my changes, and the companion build was passing earlier.

Comment on lines +145 to +147
/// The task name is a `&'static str` as opposed to a `String`. The reason for that is that
/// in order to avoid memory consumption issues with the Prometheus metrics, the set of
/// possible task names has to be bounded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much in favor of this 👍

)?, registry)?,
tasks_started: register(CounterVec::new(
Opts::new(
"tasks_started_total",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"tasks_started_total",
"tasks_spawned_total",

Would make it easier to differentiate from the metric above.

Comment on lines 56 to 60
let before = Instant::now();
let outcome = Future::poll(this.inner, cx);
let after = Instant::now();

this.poll_duration.observe((after - before).as_secs_f64());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let before = Instant::now();
let outcome = Future::poll(this.inner, cx);
let after = Instant::now();
this.poll_duration.observe((after - before).as_secs_f64());
let _timer = this.poll_duratoin.start_timer();
let outcome = Future::poll(this.inner, cx);
// _timer is dropped and thus automatically recorded in the Histogram.

@gnunicorn
Copy link
Contributor

unused imports let the CI fail.

@gnunicorn gnunicorn added A8-mergeoncegreen and removed A0-please_review Pull request needs code review. labels Apr 14, 2020
@gnunicorn gnunicorn assigned gnunicorn and unassigned tomaka Apr 14, 2020
@gnunicorn
Copy link
Contributor

icing while releasing.

@gnunicorn gnunicorn merged commit 5fcd533 into master Apr 15, 2020
@gnunicorn gnunicorn deleted the tka-prometheus-friendly-tasks-manager branch April 15, 2020 09:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants