-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22343][core] Add support for publishing Spark metrics into Prometheus #19775
[SPARK-22343][core] Add support for publishing Spark metrics into Prometheus #19775
Conversation
@matyix thanks for re-submitting! |
Do we have to put this in Spark, is it a necessary part of k8s? I think if we pull in that PR(#11994), then this can be stayed out of Spark as a package. Even without #11994 , I believe users can still add their own Metrics source/sink via exposed SparkEnv/MetricsSystem. My concern is that this unnecessarily increases the code base of spark core. |
@jerryshao this PR is not Kubernetes specific, it's an extension of the Spark Metrics system which is part of the core already. We could externalize if the PR #11994 above ever gets merged. Re-factoring and externalizing it after (for all the other metrics subsystem is not a big work). Although I submitted this PR first on the K8S fork, actually this feature might be beneficial for all using the (becoming) de-facto monitoring solution, Prometheus. |
I agree this is useful to have, @jerryshao is probably right though, it is likely better to add extensibility into the Metrics system. @jerryshao I'll review your PR We can then come back to this PR as needed. |
Hello @felixcheung @jerryshao The PR #11994 generally looks good for adding extensibility into the Metrics system. This PR (Prometheus) works with the changes proposed in PR #11994 out of the box, so no modification needed on our side. As Prometheus is (becoming) a widely used monitoring system personally I see a lot of value including this into the Spark code base, same as the current metrics. I see PR #11994 rather a solution for those who build custom specialised metrics sink. |
Can one of the admins verify this patch? |
Hello @erikerlandson @felixcheung @jerryshao - any feedback on this PR? Shall I close it and not worry about this being merged upstream anymore? We've been using this in production for the last 3 months and it's a bit awkward that our CI/CD system needs to |
* This uses the PUT HTTP method. | ||
* @deprecated use {@link #push(CollectorRegistry, String, Map)} | ||
*/ | ||
@Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new class, why should we include these deprecated methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed deprecated methods.
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
|
||
public class TextFormatWithTimestamp { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added doc.
delete(job, Collections.singletonMap("instance", instance)); | ||
} | ||
|
||
void doRequest(CollectorRegistry registry, String job, Map<String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to private.
|
||
private static StringBuilder jsonMessageLogBuilder = new StringBuilder(); | ||
|
||
public static void write004(Writer writer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added doc.
/* See http://prometheus.io/docs/instrumenting/exposition_formats/ | ||
* for the output format specification. */ | ||
while(mfs.hasMoreElements()) { | ||
Collector.MetricFamilySamples metricFamilySamples = mfs.nextElement(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for(Collector.MetricFamilySamples s: Collections.list(mfs)) {
would be nicer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, method body is not indented well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class has been refactored.
case _ => Map("role" -> role) | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: extra line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty line removed.
|
||
override def start(): Unit = { | ||
sparkMetricExports.register(pushRegistry) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: extra line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty line removed.
private[spark] class PrometheusSink( | ||
val property: Properties, | ||
val registry: MetricRegistry, | ||
securityMgr: SecurityManager) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
securityMgr
is never used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter list for a Sink
is imposed byMetricsSystem
which instantiates the configured sinks (see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L199). PrometheusSink
doesn't need SecurityManager
this is why securityMgr
is not used (similar to CsvSink, ConsoleSink).
} catch (Exception ex) { | ||
logger.error("Sending metrics failed due to: ", ex); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: extra line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed extra line.
} | ||
|
||
finally { | ||
connection.disconnect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
connection
can be null if new URL(url).openConnection()
at line 272 threw an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new URL(url).openConnection()
is outside of the try-catch-block
this in case it throws an exception it exist doRequest()
before reaching finally { connection.disconnect();
My original intention is to expose MetricsSystem related interface in #11994 , so that users can leverage such interface to build their own metrics sink/source out of Spark. Unfortunately I'm stuck on the #11994 , but still I think it is better to leave this as a package out of Spark, pulling to much dependencies for non-core functionalities seems not so reasonable (just my thoughts). |
Regardless of the discussion about whether this should live in Spark, does it need to live in core? Can it be kept in a separate module like the Ganglia stuff (even though that one is for licensing reasons)? |
@smurakozi thank you for reviewing. The PR has been updated based on your comments. |
I agree w/ @jerryshao that adding new deps to core isn't ideal. (Also that having #11994 would be really nice) |
Although this is not kube-specific, kubernetes deployment is a major prometheus use case. Has it been tested in a kube environment? |
…nfigurable as not all version of Pushgateway supports metrics timestamp
@erikerlandson we tested this on Kubernetes using https://github.com/prometheus/pushgateway/tree/v0.3.1 and https://github.com/prometheus/pushgateway/tree/v0.4.0 |
So, where did we land on the discussion for this PR. Is this change getting in or is it out? |
@GaalDornick @erikerlandson @jerryshao @felixcheung et all We gave up this - we have made the requested changes several times and I am not willing to put more time on this and get in the middle of a debate which is not my concern. Currently the Spark monitoring architecture it is how it is - and we have made the PR to align with the current architecture of the existing sinks and metrics subsystem. What did happen is that now the debate is not about that this is good, needed or not but whether it should be part of Spark core, be pluggable, we should refactor the whole metrics subsystem, etc. Most likely this will still be the case later and once these will be changed, nailed down or agreed by all parties I can rework and resend the PR... Anyways, we (and our customers) are using this in production for months - we have externalized this into a separate jar which we put it on the CP and does not need to be part of Spark (though it should I believe, as Prometheus is one of the best open source monitoring framework). Should anybody need help to use this sink drop me a mail at janos@banzaicloud.com happy to help if interested in using Prometheus with Spark. We do some advanced scenarios with this sink and the code is all open sourced - you can read more about Monitoring Spark with Prometheus and Federated monitoring of multiple Spark clusters. Thanks for all the support. |
For those who are still interested using Prometheus you can get the standalone package and source code from here: https://github.com/banzaicloud/spark-metrics . Happy monitoring, try to catch the issues and avoid those PagerDuty notifications beforehand :). |
You can also try using https://github.com/andrusha/dropwizard-prometheus, which implements pull instead of push. |
@andrusha Do you have a tutorial how to set this up. From my understanding if I have multiple executers pulling it is harder, as prometheus has to have all the hostnames. Am I right or wrong? |
You can use Kubernetes service discovery with conjunction with Prometheus to find all the available nodes for scraping. Then you just add in your annotation the host and port.
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Ckubernetes_sd_config%3E <https://prometheus.io/docs/prometheus/latest/configuration/configuration/#<kubernetes_sd_config>> is a good place to start.
I think it requires a blog post, but not sure if will ever do it.
… On 16 May 2018, at 13:10, Goetz Epperlein ***@***.***> wrote:
@andrusha <https://github.com/andrusha> Do you have a tutorial how to set this up. From my understanding if I have multiple executers pulling it is harder, as prometheus has to have all the hostnames. Am I right or wrong?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#19775 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAXqz9edXdn5rSat4-m9464k3NM-c2NJks5tzAkSgaJpZM4QiEPC>.
|
Did we really miss out on Prometheus metric functionality because people couldn't just be cool about it? |
What changes were proposed in this pull request?
Originally this PR was submitted to the Spark on K8S fork here but has been advised to resend it upstream by @erikerlandson and @foxish. K8S specific items were removed from the PR and been reworked for the Apache version.
Publishing Spark metrics into Prometheus - as highlighted in the JIRA. Implemented a metrics sink that publishes Spark metrics into Prometheus via Prometheus Pushgateway. Metrics data published by Spark is based on Dropwizard. The format of Spark metrics is not supported natively by Prometheus thus these are converted using DropwizardExports prior pushing metrics to the pushgateway.
Also the default Prometheus pushgateway client API implementation does not support metrics timestamp thus the client API has been ehanced to enrich metrics data with timestamp.
How was this patch tested?
This PR is not affecting the existing code base and not altering the functionality. Nevertheless, I have executed all
unit and integration
tests. Also this setup has been deployed and been monitored via Prometheus (Prometheus 1.7.1 + Pushgateway 0.3.1).Manual
testing through deploying a Spark cluster, Prometheus server, Pushgateway and ran SparkPi.