-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert client_java to OpenMetrics #615
Conversation
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
List<Sample> mungedSamples = samples; | ||
// Deal with _total from pre-OM automatically. | ||
if (type == Type.COUNTER) { | ||
if (name.endsWith("_total")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Worth making _total
a const? Then can name.endsWith(counterSuffix)
and name = name.substring(0, name.length() - counterSuffix.length())
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like indirection like that, it only makes the code harder to understand for me as I have to bounce around the file for something that could be understood one its own line.
if (name.endsWith("_total")) { | ||
name = name.substring(0, name.length() - 6); | ||
} | ||
String withTotal = name + "_total"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This could also use the const.
names.add(family.name + "_created"); | ||
names.add(family.name); | ||
break; | ||
case GAUGE_HISTOGRAM: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Believe gauge histogram should also have _created
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, gauges don't have a created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, should we change the proto then? HistogramValue can be used with GAUGE_HISTOGRAM:
https://github.com/OpenObservability/OpenMetrics/blob/master/proto/openmetrics_data_model.proto#L136-L138
I suppose implementors could enforce that's not combined together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should clarify that then.
if (name.endsWith("_total")) { | ||
name = name.substring(0, name.length() - 6); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: These could reference the _total
const?
/** | ||
* Get the created time of the counter in milliseconds. | ||
*/ | ||
public long created() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this maybe be a float in seconds? That way callers don't need to divide by 1000.0 to get the value they should expose, also since this accessor doesn't have the unit (seconds/milliseconds/etc) maybe it's best to return the unit that's expected to be exposed for exposition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java's current time function returns milliseconds, so that's the effective standard inside Java - and thus also what the client uses everywhere already.
@@ -209,6 +215,13 @@ public B namespace(String namespace) { | |||
this.namespace = namespace; | |||
return (B)this; | |||
} | |||
/** | |||
* Set the uit of the metric. Required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Believe uit
should be unit
?
Gauge g = Gauge.build().name("a").unit("seconds").help("h").create(); | ||
assertEquals("a_seconds", g.fullname); | ||
|
||
Gauge g2 = Gauge.build().name("a_seconds").unit("seconds").help("h").create(); | ||
assertEquals("a_seconds", g2.fullname); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth adding a test to observe the unit being added if the name doesn't correctly have the unit on it already when built?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what the test right above this is doing.
@@ -70,6 +70,7 @@ public void loadingCacheExposesMetricsForLoadsAndExceptions() throws Exception { | |||
} | |||
cache.get("user3"); | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Probably could be removed this line change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
|
||
for (String accepts : acceptHeader.split(",")) { | ||
if ("application/openmetrics-text".equals(accepts.split(";")[0].trim())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Worth making this a const? It's used both in the content-type + version string const as well as here when detecting the accept type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it'd aid readability to have a constant based on other constants, this is a well known string - I find it easier to have the raw text rather than digging through layers of indirection to see what this code is actually doing.
if (metricFamilySamples.type == Collector.Type.COUNTER) { | ||
writer.write("_total"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm is Prometheus text version output also meant to have the suffixes appended? I thought only OpenMetrics format would get that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, no matter which format is negotiated the exact same samples should end up inside Prometheus.
@@ -53,6 +124,10 @@ public static void write004(Writer writer, Enumeration<Collector.MetricFamilySam | |||
writer.write('\n'); | |||
} | |||
} | |||
// Write out any OM-specific samples. | |||
if (!omFamilies.isEmpty()) { | |||
write004(writer, Collections.enumeration(omFamilies.values())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, could this not be recursive since this is in write004
method itself? i.e. could it not each time collect into omFamilies
and recurse but not actually writing it in the child recursive call but simply recursing again infinitely?
Haven't looked to see if there's a test or not for this just yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if
is there for a reason. Java lacks inner functions, so this is the easy way to not repeat the rendering logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I see - and these will be gauges in the recurse function. Makes sense.
@@ -100,8 +175,101 @@ private static String typeString(Collector.Type t) { | |||
return "summary"; | |||
case HISTOGRAM: | |||
return "histogram"; | |||
case GAUGE_HISTOGRAM: | |||
return "histogram"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, isn't this written out as a gauge? As per detecting if gsum/gcount is present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Histogram is the closest structurally, even if it's technically a gauge. The Python client does the same.
Since the Prometheus server cannot be configured to not negotiate OpenMetrics, is it perhaps worth it to make OpenMetrics opt-in here, at least for a transition period? I assume this instrumentation library has loads of users, and many will naively update and then be taken by surprise by the changes in naming and the potential of name collisions. The opt-in has worked quite well in client_golang. People still run into surprising (for them) issues, but at least they have very explicitly activated OpenMetrics and can easily switch it off again if needed. |
This PR is at its core a fundamental data model change, the addition of additional output formats and new metric types are incidental to that. No matter what format is negotiated, Prometheus will ultimately ingest the same samples. Having which internal data model to use would be non-trivial, and thus it does not make sense to make this configurable. I'm taking the same approach here as with client_python, clearing calling out the change in the release notes and announcements which is fine as this library is not 1.0. If users wish to continue to use the old code, they can not upgrade. client_golang still uses the Prometheus data model internally, so it's not a good comparison. |
My argument is purely from the users' perspective. They don't care about the internal data model. I'm not disputing that we are technically entitled to any breaking changes in a pre-1.0 release. However, I think we should still try to be nice to our users if we can. And even if we are technically not breaking any promises, many users will still see it as "our fault" if things break unexpectedly (from their perspective). The Python client already caused some trouble with the many added If at least Prometheus could have a configurable priority for content negotiation, we could at least offer a quick mitigation strategy. |
Would cutting a 0.10 (or 0.9.1) before merging this PR be a compromise? |
It's technically a compromise, but it doesn't address the problem. Users will suffer the same fate with 0.11. Please also take into account that instrumentation happens in code that is not your own, where you are not in control of which version to use. (See the recent Caddy issues as an example from the Go world. At least, Caddy could fix it by adding a configuration option for their users. And zero blame was put on the Prometheus project.) I think the following migration path would be the least bad:
|
The problem with that suggested path is that it requires making a data model configurable, which is not a reasonable request as that's such a fundamental aspect of the code and would be a major amount of work. This PR was tricky enough as-is without having to duplicate all the relevant code and make it all cleanly and correctly interoperate somehow.
Once again this is primarily a data model change, which metric format(s) can be negotiated is irrelevant as we'll always end up with the exact same samples being ingested into Prometheus. The way you can opt in or opt out is by choosing the version of client_java to use. If you don't like the change, it's easy to downgrade again.
That's not a problem, as that's not how Java dependency resolution works. Whoever builds the final "binary" gets to choose which version is used, not the individual library developers. An end-user can also readily produce such a "binary" with a different version. You can't somehow end up with two different versions of the relevant code inside the binary claiming to both be the simpleclient. There's also no backwards incompatible ABI changes here (I'm always careful with that), so users can roll forwards/backwards as long as they don't write code that uses any of the new features. Ultimately this is a change which has been carefully considered and publicly communicated for well over a year, including how users can migrate in this grace period. This is the PR that ends that ends the grace period Java, so if users choose to upgrade then they get the OM data model and new metric names with that. If they want to stay as-is, they're free to do that too. |
Without having reviewed the code or knowing Java, I'd tend to agree with @brian-brazil here if it's true that the binary author is in control of all dependency versions in Java, and if it's indeed a lot of extra effort to make the behavior configurable within the same version. But again, this is just my 2c without having taken a deep look at things. |
Thanks for this pull request. By looking at the code, it seems that we still support prometheus text format. So it seems reasonable to add a feature to not check for _total, _unit and keep the original provided name. And that if you opt-in for OpenMetrics, we serve OM when negociated and run the checks about unit and total. It also seems that the pre-om functions are not marked as |
Thanks for the discussion everyone, rollouts like these are always a bit tricky but all things considered I think this is the best path. |
Prometheus is changing the naming scheme of some of their metrics in order to be compatible with the OpenMetrics standard. Part of this means that all counters will have `_total` or `_created` as a suffix. There is no way to create a counter without a `_total` suffix. See: prometheus/client_java#615
@@ -209,6 +215,13 @@ public B namespace(String namespace) { | |||
this.namespace = namespace; | |||
return (B)this; | |||
} | |||
/** | |||
* Set the unit of the metric. Required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't a unit optional, not required? From the OpenMetrics spec:
If non-empty, it MUST be a suffix of the MetricFamily name separated by an underscore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed this is a copy&paste error, would you like to send a PR to fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I've opened #619
It was incorrectly documented as required when initially introduced. See prometheus#615 (comment) Signed-off-by: Tommy Ludwig <8924140+shakuzen@users.noreply.github.com>
It was incorrectly documented as required when initially introduced. See #615 (comment) Signed-off-by: Tommy Ludwig <8924140+shakuzen@users.noreply.github.com>
Just hopping in to mention that this broke a few services in my company due to the backwards incompatible change. :( |
yup this broke stuff. |
Prometheus is changing the naming scheme of some of their metrics in order to be compatible with the OpenMetrics standard. Part of this means that all counters will have `_total` or `_created` as a suffix. There is no way to create a counter without a `_total` suffix. See: prometheus/client_java#615
We are blocked from upgrading to a newer version of Spring Boot because of this 🤷 Now our backlog involves fixing dozens, if not hundreds, of alerts and dashboards. |
We're also blocked from upgrading at Nubank. Fixing all of our alerts, dashboards, and integrations that consume prometheus data is going to be challenging. Would it be possible to provide a better migration path where the new behavior is configurable? |
Ack, thanks for the reports (feel free to drop more of them or +1 the issue, it lets us know how important this issue is). 🤗 We are aware of this and we are actively discussing the way to unblock you and the way forward with OM transitions here and with other Prometheus clients. |
hi folks! any updates on this? |
This switches the core data model to OM, adds support for negotiating and exposing OpenMetrics format, adds units, created time, and adds new direct instrumentation for Info and Enum.
This is done as gracefully as practical, but this will break users - in particular if a) they have counters lacking a
_total
suffix in which case the sample names will change or b) if they happen to be usingUNTYPED
which is now calledUNKNOWN
in which case they'll need up update their code.