-
Notifications
You must be signed in to change notification settings - Fork 378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster support #147
Cluster support #147
Conversation
We actually implemented aggregated metrics via https://github.com/prometheus/statsd_exporter and "push" model statsd driver for node, but it'll be nice to eliminate an additional entity. Thank you. |
8a9893e
to
add8f4b
Compare
Thanks for your PR :) Im not against adding cluster support, the thing I am worried about is supporting it in the long term since I lack the experience and time to get up to speed with clustering node processes. However, but I understand that it's something that probably should be in prom-client since a lot of people apparently cluster out the node processes, so I don't see any reason to not merge it when it looks ready. A thought, would it be interesting to apply some kind of label to the metric to indicate which node in the cluster the metric is from? |
This is definetely a good idea, we used similar tag in all our setups. |
Thanks for the feedback. I'll work on cleaning it up. I'm happy to help with maintenance of this; I'm more familiar with the cluster module than I ever wanted to be, and I think most prod node.js servers use it. As far as metric labels: This acts as an aggregator, so all worker processes appear as one. This seemed to be the preferred approach for the Python (gunicorn) and Rails clients; the concern discussed there was that exposing all workers separately would massively increase the number of time series, and percentage metrics like CPU and memory usage would be fractions of an unknown denominator. Nonetheless, the example file this PR shows setting a metric label to a worker's ID, and the result is that it doesn't get aggregated, so you can keep a metric separate if you want to. Does that seem reasonable? @tommiv satisfy your use case? |
I thought we're talking about a physical node address (or VM), not the cluster node id. But now I see I misread the previous message. We don't split up metrics to the cluster nodes, it makes no sense indeed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing tests, but I think it's a great start, and belongs in this lib 😄
example/cluster.js
Outdated
metricsServer.get('/cluster_metrics', (req, res) => { | ||
promCluster.clusterMetrics((err, metrics) => { | ||
//eslint-disable-next-line no-console | ||
if (err) console.log(err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throw it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meh, it's an example
lib/cluster.js
Outdated
process.send({ | ||
type: GET_METRICS_RES, | ||
requestId: message.requestId, | ||
metrics: Registry.globalRegistry.getMetricsAsJSON() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registry should probably not be hard-coded
lib/cluster.js
Outdated
* `registry.getMetricsAsJSON()`. | ||
* @return {Registry} aggregated registry. | ||
*/ | ||
Registry.aggregate = function(metricsArr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a merge
, but that just includes different metrics and doesn't handle dupes?
If so, this function should be added next to merge, not live here
lib/cluster.js
Outdated
* @param {Function} callback (err, metrics) => any | ||
* @return {undefined} undefined | ||
*/ | ||
Registry.prototype.clusterMetrics = function(callback) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't modify the prototype. Either add directly or use inheritance
lib/cluster.js
Outdated
// define an `aggregate` method, or will be summed by default. | ||
// TODO These are here for the sake of isolation, but obviously would be | ||
// nicer if they were defined in the metrics/ files. | ||
// TODO Might be nice to have multiple aggregates for some, like the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already export them, right?
lib/cluster.js
Outdated
} | ||
}); | ||
}); | ||
byLabels.forEach(values => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result.values = byLabels.map
etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
He's using a Map, I don't thing you can call .map directly,
he could convert it to an Array using Array.from an go from there.
lib/cluster.js
Outdated
* @param {Function} aggregatorFn function to apply to values. | ||
* @return {Function} aggregator function | ||
*/ | ||
const AggregatorFactory = function(aggregatorFn) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this factory have to be within the if
? Reviewing on mobile, so not too easy seeing if you close over anything
README.md
Outdated
|
||
To solve this, you can aggregate all of the workers' metrics in the master process. See example folder for a sample usage. By default, custom metrics are summed across workers, and default metrics use sensible defaults. To use a different aggregation method, define an `aggregate()` function for the metric's name in the global registry in the *master* process. (See the `AggregatorFactory` in `lib/cluster.js` for details.) | ||
|
||
Note: You must use the default global registry in cluster mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current arrangement, the cluster master is the one that fulfills the metrics request by querying all of the workers and aggregating their local metrics. The workers need to know which registry to send their local metrics from, so the global registry is easy. I'll try to figure out a nice way to support other registries.
example/cluster.js
Outdated
}); | ||
|
||
metricsServer.listen(3001); | ||
//eslint-disable-next-line no-console |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can just add a eslintrc in examples/ which disables this rule
@@ -2,7 +2,8 @@ | |||
"plugins": ["prettier"], | |||
"extends": "eslint:recommended", | |||
"env": { | |||
"node": true | |||
"node": true, | |||
"es6": true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Map
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! Based on the other new language features you're already using, I don't think this changes the oldest supported version of node.js.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We support node 4 and up, and Map
has been in node since 0.12, so we're good 🙂
Thanks very much for the review. Will work this weekend on making it nice and removing the parts that were awkward as a result of trying to keep it isolated. |
Most comments resolved and most tests added. Need a few more tests, and there are still some parts of this that I don't love and that I want to think about better ways of doing, so please hang on another day or so. (Namely: (1) The global registry still has to be used. (2) The way aggregators are looked up is sort of ugly.) |
README.md
Outdated
@@ -278,6 +288,10 @@ If you need to get a reference to a previously registered metric, you can use `r | |||
You can remove all metrics by calling `register.clear()`. You can also remove a single metric by calling | |||
`register.removeSingleMetric(*name of metric*)`. | |||
|
|||
##### Cluster metrics | |||
|
|||
You can get aggregated metrics for all worker processes in a cluster by running `register.clusterMetrics((err, metrics) => {...})`, which will be called with an error, or with the aggregated metrics string for prometheus to consume. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siimon do we want callback or promise api?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since cluster api is very much callback based, keeping the cb might be natural
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easy enough to do both: return a promise and accept a callback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, haven't had the time to look at this in detail. Having callbacks is a must have imo, but I'm fine with both aswell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer Promise these days because of async-await
, which is in node 8 which will be LTS in a couple of months. Makes for cleaner code.
That said, I don't feel strongly about it
lib/cluster.js
Outdated
} | ||
} | ||
|
||
module.exports = AggregatorRegistry; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move the export to the bottom?
lib/metricAggregators.js
Outdated
if (values[0].metricName) { | ||
valObj.metricName = values[0].metricName; | ||
} | ||
// TODO how do we aggregate timestamps? Average? Omit? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either keep the max or omit, I think. I'm not sure what makes the most sense...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah not sure either, I'm more leaning towards omitting, since I rather provide no data at all than misguiding data
lib/histogram.js
Outdated
@@ -64,6 +65,7 @@ class Histogram { | |||
|
|||
this.name = config.name; | |||
this.help = config.help; | |||
this.aggregate = config.aggregator || AGGREGATORS.SUM; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't you set this in the object assign step instead? Will be easier to see all the defaults if they are set on the same place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I had it there, but then it has to be repeated where the 2nd constructor form (formal arguments) is handled:
if (isObject(name)) {
// once here
} else {
config = {
// again here
}
}
lib/metricAggregators.js
Outdated
/** | ||
* Functions that can be used to aggregate metrics from multiple registries. | ||
*/ | ||
exports.AGGREGATORS = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im thinking these might belong in a separate file? Not sure about the uppercase function names either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they fit fine in here. I also find uppercase odd, though 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AGGREGATORS.sum
or AGGREGATORS.Sum
or aggregators.sum
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is starting to look really good! 🙂
@@ -55,6 +56,12 @@ | |||
"no-shadow": "off", | |||
"no-unused-expressions": "off" | |||
} | |||
}, | |||
{ | |||
"files": ["example/**/*.js"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
README.md
Outdated
@@ -278,6 +288,10 @@ If you need to get a reference to a previously registered metric, you can use `r | |||
You can remove all metrics by calling `register.clear()`. You can also remove a single metric by calling | |||
`register.removeSingleMetric(*name of metric*)`. | |||
|
|||
##### Cluster metrics | |||
|
|||
You can get aggregated metrics for all worker processes in a cluster by running `register.clusterMetrics((err, metrics) => {...})`, which will be called with an error, or with the aggregated metrics string for prometheus to consume. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer Promise these days because of async-await
, which is in node 8 which will be LTS in a couple of months. Makes for cleaner code.
That said, I don't feel strongly about it
lib/cluster.js
Outdated
const DEFAULT_METRIC_AGGREGATORS = { | ||
nodejs_version_info: AGGREGATORS.FIRST, | ||
process_start_time_seconds: AGGREGATORS.OMIT, | ||
nodejs_eventloop_lag_seconds: AGGREGATORS.AVERAGE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels sorta icky to hard-code these. what about custom metrics? Can you filter out all non-sum metrics by checking the metric's type at aggregation-time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after doing Registry.globalRegistry.getSingleMetric(metricName);
further down, figure out strategy. If type
is not available, either do an instanceof
or add a type
function to the metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Re: first comment) Yeah, this is my highest priority to clean up. These aggregation methods definitely shouldn't be defined twice; right now they're here and in lib/metrics/xxx.js
. The factory functions exported from lib/metrics/
don't make it easy to retrieve the metric's aggregate
/aggregator
value, so I had been playing with calling those factory functions with a mocked registry, but chaos ensued. To be continued...
lib/cluster.js
Outdated
const Registry = require('./registry'); | ||
const AGGREGATORS = require('./metricAggregators').AGGREGATORS; | ||
|
||
const GET_METRICS_REQ = 'prom-client:getMetricsReq'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be a Symbol
instead of a string? Not sure how serialization works here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not as far as I know, without using Symbol.for(string)
(which I don't think is any better than just using the string).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, just the string is fine then 🙂
lib/metricAggregators.js
Outdated
}; | ||
// Gather metrics by labels. | ||
const byLabels = new Map(); | ||
metrics.forEach(metric => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part seems duplicated from aggregate
. Can it be put in utils
? (I realize the diff is .name
vs .labels
)
lib/metricAggregators.js
Outdated
/** | ||
* Functions that can be used to aggregate metrics from multiple registries. | ||
*/ | ||
exports.AGGREGATORS = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they fit fine in here. I also find uppercase odd, though 🙂
lib/metrics/eventLoopLag.js
Outdated
@@ -16,7 +17,8 @@ module.exports = registry => { | |||
const gauge = new Gauge({ | |||
name: NODEJS_EVENTLOOP_LAG, | |||
help: 'Lag of event loop in seconds.', | |||
registers: registry ? [registry] : undefined | |||
registers: registry ? [registry] : undefined, | |||
aggregator: AGGREGATORS.AVERAGE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the Gauge
class itself set this? So all get it for free? Or is there some use-case where it makes sense to not do so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose there are some special-cases (like nodejs_version_info
), but can a sensible default be set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything defaults to sum
via code in the Gauge
/Histogram
/Summary
/Counter
classes (example), but there are three metrics with non-default methods:
nodejs_version_info
uses the first value (all values will be the same)process_starttime_seconds
is omitted (all values will be slightly different -- open to suggestions)nodejs_eventloop_lag
averages
lib/cluster.js
Outdated
|
||
if (request.failed) return; // Callback already run with Error. | ||
|
||
const registry = Registry.aggregate(request.responses); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to refer to AggregatorRegistry
.
example/cluster.js
Outdated
} | ||
|
||
metricsServer.get('/cluster_metrics', (req, res) => { | ||
AggregatorRegistry.clusterMetrics((err, metrics) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clusterMetrics
is not a static method and thus can only be called on an Instance - of which there is none in your example code
Can't wait to see this working, good job guys ! ❤️ |
Sorry for the delay folks. Will be pushing an update this weekend. |
lib/cluster.js
Outdated
type: GET_METRICS_RES, | ||
requestId: message.requestId, | ||
// TODO see if we can support the non-global registry also. | ||
metrics: Registry.globalRegistry.getMetricsAsJSON() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the last remaining TODO. The worker processes need to know which registry to send metrics from for aggregation, so using the globalRegistry
was easy.
Options I can think of:
- We send metrics from all registries. This seems like it would create a mess.
- We provide a way for users to specify which registry to use. Something like
Registry.aggregationSource = new Registry()
, then the above line becomesRegistry.aggregationSource .getMetricsAsJSON()
.
I don't know what the use cases are for multiple registries, so I'm not sure what makes sense to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We have no way of knowing about every registry
- I think we should do this. Unsure about the API, though. Maybe a
AggregatorRegistry.setRegistries
?
Alright, I think I've addressed all the comments, including adding a Promise interface in addition to the callback interface. All the ugly code around storing and looking up metric aggregator functions is much nicer now. Thanks everyone for the continuing feedback! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really starting to come together! Just missing supporting custom registries (as you've noted, #147 (comment)) and I think it's good to land 🙂
lib/cluster.js
Outdated
}); | ||
}); | ||
|
||
// Aggregate gathered metrics. Default to summation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that default needed if all the metric types has a default?
@@ -89,3 +89,21 @@ exports.printDeprecationCollectDefaultMetricsNumber = timeout => { | |||
`prom-client - A number to defaultMetrics is deprecated, please use \`collectDefaultMetrics({ timeout: ${timeout} })\`.` | |||
); | |||
}; | |||
|
|||
class Grouper extends Map { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
test/aggregatorsTest.js
Outdated
@@ -0,0 +1,91 @@ | |||
'use strict'; | |||
|
|||
xdescribe('aggregators', () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the x
I agree, looks really good! Another small detail, update the changelog :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(wrong button lasts time...)
All remaining bits addressed. Note that the test output says "Warning: Possible EventEmitter memory leak detected. 11 message listeners added. Use emitter.setMaxListeners() to increase limit". This is not a real leak (doesn't happen outside of jest); see jestjs/jest#4413. If you want I can add a jest config file and a setup file containing Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM, great PR!
I'd like for @siimon to take a new look as well before merging 🙂
I agree! Great work and it's highly appreciated! I'll merge and publish a new minor! And thanks @SimenB et al for great reviews! |
Thanks everyone! |
Hello! This is an opening gambit for supporting aggregated metrics for use with node's
cluster
module. It's not ready for merge; looking for feedback first as to whether or not this would go in prom-client or a separate module that wraps prom-client.It's only lightly tested, and there are no unit tests yet.
I made the changes completely isolated in a separate file that is not loaded by default. From reading issues #82 and #80, I wasn't sure if a PR that provides this functionality would be accepted, so I wanted the changes to work if the functionality was provided as a separate Node.js module. However, that means that the
metric.aggregate
stuff is a bit ugly. If this is something you want in prom-client, I could definitely add the aggregation mode to the metrics classes.Ref #82
Ref #80
/cc @goofballLogic and @tommiv in case you're interested