Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster support #147

Merged
merged 1 commit into from
Sep 4, 2017
Merged

Cluster support #147

merged 1 commit into from
Sep 4, 2017

Conversation

zbjornson
Copy link
Collaborator

@zbjornson zbjornson commented Jul 30, 2017

Hello! This is an opening gambit for supporting aggregated metrics for use with node's cluster module. It's not ready for merge; looking for feedback first as to whether or not this would go in prom-client or a separate module that wraps prom-client.

It's only lightly tested, and there are no unit tests yet.

I made the changes completely isolated in a separate file that is not loaded by default. From reading issues #82 and #80, I wasn't sure if a PR that provides this functionality would be accepted, so I wanted the changes to work if the functionality was provided as a separate Node.js module. However, that means that the metric.aggregate stuff is a bit ugly. If this is something you want in prom-client, I could definitely add the aggregation mode to the metrics classes.

Ref #82
Ref #80

/cc @goofballLogic and @tommiv in case you're interested

@tommiv
Copy link

tommiv commented Jul 30, 2017

We actually implemented aggregated metrics via https://github.com/prometheus/statsd_exporter and "push" model statsd driver for node, but it'll be nice to eliminate an additional entity. Thank you.

@zbjornson zbjornson force-pushed the cluster branch 2 times, most recently from 8a9893e to add8f4b Compare August 1, 2017 01:54
@siimon
Copy link
Owner

siimon commented Aug 1, 2017

Thanks for your PR :)

Im not against adding cluster support, the thing I am worried about is supporting it in the long term since I lack the experience and time to get up to speed with clustering node processes.

However, but I understand that it's something that probably should be in prom-client since a lot of people apparently cluster out the node processes, so I don't see any reason to not merge it when it looks ready.

A thought, would it be interesting to apply some kind of label to the metric to indicate which node in the cluster the metric is from?

@tommiv
Copy link

tommiv commented Aug 1, 2017

A thought, would it be interesting to apply some kind of label to the metric to indicate which node in the cluster the metric is from?

This is definetely a good idea, we used similar tag in all our setups.

@zbjornson
Copy link
Collaborator Author

Thanks for the feedback. I'll work on cleaning it up. I'm happy to help with maintenance of this; I'm more familiar with the cluster module than I ever wanted to be, and I think most prod node.js servers use it.

As far as metric labels: This acts as an aggregator, so all worker processes appear as one. This seemed to be the preferred approach for the Python (gunicorn) and Rails clients; the concern discussed there was that exposing all workers separately would massively increase the number of time series, and percentage metrics like CPU and memory usage would be fractions of an unknown denominator.

Nonetheless, the example file this PR shows setting a metric label to a worker's ID, and the result is that it doesn't get aggregated, so you can keep a metric separate if you want to.

Does that seem reasonable? @tommiv satisfy your use case?

@tommiv
Copy link

tommiv commented Aug 1, 2017

Does that seem reasonable? @tommiv satisfy your use case?

I thought we're talking about a physical node address (or VM), not the cluster node id. But now I see I misread the previous message. We don't split up metrics to the cluster nodes, it makes no sense indeed.

Copy link
Collaborator

@SimenB SimenB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing tests, but I think it's a great start, and belongs in this lib 😄

metricsServer.get('/cluster_metrics', (req, res) => {
promCluster.clusterMetrics((err, metrics) => {
//eslint-disable-next-line no-console
if (err) console.log(err);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh, it's an example

lib/cluster.js Outdated
process.send({
type: GET_METRICS_RES,
requestId: message.requestId,
metrics: Registry.globalRegistry.getMetricsAsJSON()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registry should probably not be hard-coded

lib/cluster.js Outdated
* `registry.getMetricsAsJSON()`.
* @return {Registry} aggregated registry.
*/
Registry.aggregate = function(metricsArr) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a merge, but that just includes different metrics and doesn't handle dupes?

If so, this function should be added next to merge, not live here

lib/cluster.js Outdated
* @param {Function} callback (err, metrics) => any
* @return {undefined} undefined
*/
Registry.prototype.clusterMetrics = function(callback) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't modify the prototype. Either add directly or use inheritance

lib/cluster.js Outdated
// define an `aggregate` method, or will be summed by default.
// TODO These are here for the sake of isolation, but obviously would be
// nicer if they were defined in the metrics/ files.
// TODO Might be nice to have multiple aggregates for some, like the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already export them, right?

lib/cluster.js Outdated
}
});
});
byLabels.forEach(values => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result.values = byLabels.map etc

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He's using a Map, I don't thing you can call .map directly,
he could convert it to an Array using Array.from an go from there.

lib/cluster.js Outdated
* @param {Function} aggregatorFn function to apply to values.
* @return {Function} aggregator function
*/
const AggregatorFactory = function(aggregatorFn) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this factory have to be within the if? Reviewing on mobile, so not too easy seeing if you close over anything

README.md Outdated

To solve this, you can aggregate all of the workers' metrics in the master process. See example folder for a sample usage. By default, custom metrics are summed across workers, and default metrics use sensible defaults. To use a different aggregation method, define an `aggregate()` function for the metric's name in the global registry in the *master* process. (See the `AggregatorFactory` in `lib/cluster.js` for details.)

Note: You must use the default global registry in cluster mode.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current arrangement, the cluster master is the one that fulfills the metrics request by querying all of the workers and aggregating their local metrics. The workers need to know which registry to send their local metrics from, so the global registry is easy. I'll try to figure out a nice way to support other registries.

});

metricsServer.listen(3001);
//eslint-disable-next-line no-console
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just add a eslintrc in examples/ which disables this rule

@@ -2,7 +2,8 @@
"plugins": ["prettier"],
"extends": "eslint:recommended",
"env": {
"node": true
"node": true,
"es6": true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Map?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! Based on the other new language features you're already using, I don't think this changes the oldest supported version of node.js.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We support node 4 and up, and Map has been in node since 0.12, so we're good 🙂

@zbjornson
Copy link
Collaborator Author

Thanks very much for the review. Will work this weekend on making it nice and removing the parts that were awkward as a result of trying to keep it isolated.

@zbjornson
Copy link
Collaborator Author

Most comments resolved and most tests added. Need a few more tests, and there are still some parts of this that I don't love and that I want to think about better ways of doing, so please hang on another day or so. (Namely: (1) The global registry still has to be used. (2) The way aggregators are looked up is sort of ugly.)

README.md Outdated
@@ -278,6 +288,10 @@ If you need to get a reference to a previously registered metric, you can use `r
You can remove all metrics by calling `register.clear()`. You can also remove a single metric by calling
`register.removeSingleMetric(*name of metric*)`.

##### Cluster metrics

You can get aggregated metrics for all worker processes in a cluster by running `register.clusterMetrics((err, metrics) => {...})`, which will be called with an error, or with the aggregated metrics string for prometheus to consume.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siimon do we want callback or promise api?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since cluster api is very much callback based, keeping the cb might be natural

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easy enough to do both: return a promise and accept a callback.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, haven't had the time to look at this in detail. Having callbacks is a must have imo, but I'm fine with both aswell.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer Promise these days because of async-await, which is in node 8 which will be LTS in a couple of months. Makes for cleaner code.

That said, I don't feel strongly about it

lib/cluster.js Outdated
}
}

module.exports = AggregatorRegistry;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the export to the bottom?

if (values[0].metricName) {
valObj.metricName = values[0].metricName;
}
// TODO how do we aggregate timestamps? Average? Omit?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either keep the max or omit, I think. I'm not sure what makes the most sense...

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah not sure either, I'm more leaning towards omitting, since I rather provide no data at all than misguiding data

lib/histogram.js Outdated
@@ -64,6 +65,7 @@ class Histogram {

this.name = config.name;
this.help = config.help;
this.aggregate = config.aggregator || AGGREGATORS.SUM;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you set this in the object assign step instead? Will be easier to see all the defaults if they are set on the same place

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I had it there, but then it has to be repeated where the 2nd constructor form (formal arguments) is handled:

if (isObject(name)) {
  // once here
} else {
  config = {
    // again here
  }
}

/**
* Functions that can be used to aggregate metrics from multiple registries.
*/
exports.AGGREGATORS = {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im thinking these might belong in a separate file? Not sure about the uppercase function names either

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they fit fine in here. I also find uppercase odd, though 🙂

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AGGREGATORS.sum or AGGREGATORS.Sum or aggregators.sum?

Copy link
Collaborator

@SimenB SimenB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is starting to look really good! 🙂

@@ -55,6 +56,12 @@
"no-shadow": "off",
"no-unused-expressions": "off"
}
},
{
"files": ["example/**/*.js"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

README.md Outdated
@@ -278,6 +288,10 @@ If you need to get a reference to a previously registered metric, you can use `r
You can remove all metrics by calling `register.clear()`. You can also remove a single metric by calling
`register.removeSingleMetric(*name of metric*)`.

##### Cluster metrics

You can get aggregated metrics for all worker processes in a cluster by running `register.clusterMetrics((err, metrics) => {...})`, which will be called with an error, or with the aggregated metrics string for prometheus to consume.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer Promise these days because of async-await, which is in node 8 which will be LTS in a couple of months. Makes for cleaner code.

That said, I don't feel strongly about it

lib/cluster.js Outdated
const DEFAULT_METRIC_AGGREGATORS = {
nodejs_version_info: AGGREGATORS.FIRST,
process_start_time_seconds: AGGREGATORS.OMIT,
nodejs_eventloop_lag_seconds: AGGREGATORS.AVERAGE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels sorta icky to hard-code these. what about custom metrics? Can you filter out all non-sum metrics by checking the metric's type at aggregation-time?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after doing Registry.globalRegistry.getSingleMetric(metricName); further down, figure out strategy. If type is not available, either do an instanceof or add a type function to the metrics

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Re: first comment) Yeah, this is my highest priority to clean up. These aggregation methods definitely shouldn't be defined twice; right now they're here and in lib/metrics/xxx.js. The factory functions exported from lib/metrics/ don't make it easy to retrieve the metric's aggregate/aggregator value, so I had been playing with calling those factory functions with a mocked registry, but chaos ensued. To be continued...

lib/cluster.js Outdated
const Registry = require('./registry');
const AGGREGATORS = require('./metricAggregators').AGGREGATORS;

const GET_METRICS_REQ = 'prom-client:getMetricsReq';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a Symbol instead of a string? Not sure how serialization works here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not as far as I know, without using Symbol.for(string) (which I don't think is any better than just using the string).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just the string is fine then 🙂

};
// Gather metrics by labels.
const byLabels = new Map();
metrics.forEach(metric => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part seems duplicated from aggregate. Can it be put in utils? (I realize the diff is .name vs .labels)

/**
* Functions that can be used to aggregate metrics from multiple registries.
*/
exports.AGGREGATORS = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they fit fine in here. I also find uppercase odd, though 🙂

@@ -16,7 +17,8 @@ module.exports = registry => {
const gauge = new Gauge({
name: NODEJS_EVENTLOOP_LAG,
help: 'Lag of event loop in seconds.',
registers: registry ? [registry] : undefined
registers: registry ? [registry] : undefined,
aggregator: AGGREGATORS.AVERAGE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the Gauge class itself set this? So all get it for free? Or is there some use-case where it makes sense to not do so?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose there are some special-cases (like nodejs_version_info), but can a sensible default be set?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything defaults to sum via code in the Gauge/Histogram/Summary/Counter classes (example), but there are three metrics with non-default methods:

  • nodejs_version_info uses the first value (all values will be the same)
  • process_starttime_seconds is omitted (all values will be slightly different -- open to suggestions)
  • nodejs_eventloop_lag averages

lib/cluster.js Outdated

if (request.failed) return; // Callback already run with Error.

const registry = Registry.aggregate(request.responses);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to refer to AggregatorRegistry.

}

metricsServer.get('/cluster_metrics', (req, res) => {
AggregatorRegistry.clusterMetrics((err, metrics) => {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clusterMetrics is not a static method and thus can only be called on an Instance - of which there is none in your example code

@D34THWINGS
Copy link

Can't wait to see this working, good job guys ! ❤️

@zbjornson
Copy link
Collaborator Author

Sorry for the delay folks. Will be pushing an update this weekend.

lib/cluster.js Outdated
type: GET_METRICS_RES,
requestId: message.requestId,
// TODO see if we can support the non-global registry also.
metrics: Registry.globalRegistry.getMetricsAsJSON()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the last remaining TODO. The worker processes need to know which registry to send metrics from for aggregation, so using the globalRegistry was easy.

Options I can think of:

  1. We send metrics from all registries. This seems like it would create a mess.
  2. We provide a way for users to specify which registry to use. Something like Registry.aggregationSource = new Registry(), then the above line becomes Registry.aggregationSource .getMetricsAsJSON().

I don't know what the use cases are for multiple registries, so I'm not sure what makes sense to do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We have no way of knowing about every registry
  2. I think we should do this. Unsure about the API, though. Maybe a AggregatorRegistry.setRegistries?

@zbjornson
Copy link
Collaborator Author

Alright, I think I've addressed all the comments, including adding a Promise interface in addition to the callback interface. All the ugly code around storing and looking up metric aggregator functions is much nicer now.

Thanks everyone for the continuing feedback!

Copy link
Collaborator

@SimenB SimenB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really starting to come together! Just missing supporting custom registries (as you've noted, #147 (comment)) and I think it's good to land 🙂

lib/cluster.js Outdated
});
});

// Aggregate gathered metrics. Default to summation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that default needed if all the metric types has a default?

@@ -89,3 +89,21 @@ exports.printDeprecationCollectDefaultMetricsNumber = timeout => {
`prom-client - A number to defaultMetrics is deprecated, please use \`collectDefaultMetrics({ timeout: ${timeout} })\`.`
);
};

class Grouper extends Map {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -0,0 +1,91 @@
'use strict';

xdescribe('aggregators', () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the x

@siimon
Copy link
Owner

siimon commented Aug 29, 2017

I agree, looks really good! Another small detail, update the changelog :)

Copy link
Collaborator

@SimenB SimenB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(wrong button lasts time...)

@zbjornson
Copy link
Collaborator Author

All remaining bits addressed.

Note that the test output says "Warning: Possible EventEmitter memory leak detected. 11 message listeners added. Use emitter.setMaxListeners() to increase limit". This is not a real leak (doesn't happen outside of jest); see jestjs/jest#4413. If you want I can add a jest config file and a setup file containing emitter.setMaxListeners(0).

Thank you!

Copy link
Collaborator

@SimenB SimenB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, great PR!

I'd like for @siimon to take a new look as well before merging 🙂

@siimon
Copy link
Owner

siimon commented Sep 4, 2017

I agree! Great work and it's highly appreciated! I'll merge and publish a new minor!

And thanks @SimenB et al for great reviews!

@siimon siimon merged commit 4c5b6c7 into siimon:master Sep 4, 2017
@zbjornson zbjornson deleted the cluster branch September 4, 2017 15:43
@zbjornson
Copy link
Collaborator Author

Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants