Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22343][core] Add support for publishing Spark metrics into Prometheus #19775

Closed
wants to merge 6 commits into from
Closed

[SPARK-22343][core] Add support for publishing Spark metrics into Prometheus #19775

wants to merge 6 commits into from

Conversation

matyix
Copy link
Member

@matyix matyix commented Nov 17, 2017

What changes were proposed in this pull request?

Originally this PR was submitted to the Spark on K8S fork here but has been advised to resend it upstream by @erikerlandson and @foxish. K8S specific items were removed from the PR and been reworked for the Apache version.

Publishing Spark metrics into Prometheus - as highlighted in the JIRA. Implemented a metrics sink that publishes Spark metrics into Prometheus via Prometheus Pushgateway. Metrics data published by Spark is based on Dropwizard. The format of Spark metrics is not supported natively by Prometheus thus these are converted using DropwizardExports prior pushing metrics to the pushgateway.

Also the default Prometheus pushgateway client API implementation does not support metrics timestamp thus the client API has been ehanced to enrich metrics data with timestamp.

How was this patch tested?

This PR is not affecting the existing code base and not altering the functionality. Nevertheless, I have executed all unit and integration tests. Also this setup has been deployed and been monitored via Prometheus (Prometheus 1.7.1 + Pushgateway 0.3.1).

Manual testing through deploying a Spark cluster, Prometheus server, Pushgateway and ran SparkPi.

@matyix matyix changed the title Add support for publishing Spark metrics into Prometheus [SPARK-22343][core] Add support for publishing Spark metrics into Prometheus Nov 17, 2017
@erikerlandson
Copy link
Contributor

@matyix thanks for re-submitting!

@jerryshao
Copy link
Contributor

Do we have to put this in Spark, is it a necessary part of k8s? I think if we pull in that PR(#11994), then this can be stayed out of Spark as a package. Even without #11994 , I believe users can still add their own Metrics source/sink via exposed SparkEnv/MetricsSystem. My concern is that this unnecessarily increases the code base of spark core.

@matyix
Copy link
Member Author

matyix commented Nov 24, 2017

@jerryshao this PR is not Kubernetes specific, it's an extension of the Spark Metrics system which is part of the core already. We could externalize if the PR #11994 above ever gets merged. Re-factoring and externalizing it after (for all the other metrics subsystem is not a big work). Although I submitted this PR first on the K8S fork, actually this feature might be beneficial for all using the (becoming) de-facto monitoring solution, Prometheus.

@felixcheung
Copy link
Member

I agree this is useful to have, @jerryshao is probably right though, it is likely better to add extensibility into the Metrics system.

@jerryshao I'll review your PR
@matyix could you also review #11994 to see if that suits the need to build everything you have here to connect to Prometheus, but external to Spark and on top of #11994? I think your feedback will be very valuable.

We can then come back to this PR as needed.

@matyix
Copy link
Member Author

matyix commented Nov 27, 2017

Hello @felixcheung @jerryshao

The PR #11994 generally looks good for adding extensibility into the Metrics system. This PR (Prometheus) works with the changes proposed in PR #11994 out of the box, so no modification needed on our side.

As Prometheus is (becoming) a widely used monitoring system personally I see a lot of value including this into the Spark code base, same as the current metrics. I see PR #11994 rather a solution for those who build custom specialised metrics sink.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@matyix
Copy link
Member Author

matyix commented Feb 8, 2018

Hello @erikerlandson @felixcheung @jerryshao - any feedback on this PR? Shall I close it and not worry about this being merged upstream anymore? We've been using this in production for the last 3 months and it's a bit awkward that our CI/CD system needs to patch the upstream version all the time but we can live with that (since it's automated). Please advise. Happy to help to get it merge or eventually just close it.

* This uses the PUT HTTP method.
* @deprecated use {@link #push(CollectorRegistry, String, Map)}
*/
@Deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new class, why should we include these deprecated methods?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed deprecated methods.

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class TextFormatWithTimestamp {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No doc.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added doc.

delete(job, Collections.singletonMap("instance", instance));
}

void doRequest(CollectorRegistry registry, String job, Map<String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be private.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to private.


private static StringBuilder jsonMessageLogBuilder = new StringBuilder();

public static void write004(Writer writer,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No doc

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added doc.

/* See http://prometheus.io/docs/instrumenting/exposition_formats/
* for the output format specification. */
while(mfs.hasMoreElements()) {
Collector.MetricFamilySamples metricFamilySamples = mfs.nextElement();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for(Collector.MetricFamilySamples s: Collections.list(mfs)) { would be nicer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, method body is not indented well.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class has been refactored.

case _ => Map("role" -> role)
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extra line

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty line removed.


override def start(): Unit = {
sparkMetricExports.register(pushRegistry)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extra line

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty line removed.

private[spark] class PrometheusSink(
val property: Properties,
val registry: MetricRegistry,
securityMgr: SecurityManager)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

securityMgr is never used

Copy link

@stoader stoader Feb 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter list for a Sink is imposed byMetricsSystem which instantiates the configured sinks (see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L199). PrometheusSink doesn't need SecurityManager this is why securityMgr is not used (similar to CsvSink, ConsoleSink).

} catch (Exception ex) {
logger.error("Sending metrics failed due to: ", ex);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extra line

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed extra line.

}

finally {
connection.disconnect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connection can be null if new URL(url).openConnection() at line 272 threw an exception.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new URL(url).openConnection() is outside of the try-catch-block this in case it throws an exception it exist doRequest() before reaching finally { connection.disconnect();

@jerryshao
Copy link
Contributor

My original intention is to expose MetricsSystem related interface in #11994 , so that users can leverage such interface to build their own metrics sink/source out of Spark. Unfortunately I'm stuck on the #11994 , but still I think it is better to leave this as a package out of Spark, pulling to much dependencies for non-core functionalities seems not so reasonable (just my thoughts).

@vanzin
Copy link
Contributor

vanzin commented Feb 9, 2018

Regardless of the discussion about whether this should live in Spark, does it need to live in core?

Can it be kept in a separate module like the Ganglia stuff (even though that one is for licensing reasons)?

@stoader
Copy link

stoader commented Feb 12, 2018

@smurakozi thank you for reviewing. The PR has been updated based on your comments.

@erikerlandson
Copy link
Contributor

I agree w/ @jerryshao that adding new deps to core isn't ideal. (Also that having #11994 would be really nice)
New deps on a sub-project seems more palatable, but interested in what other Sparkers think.

@erikerlandson
Copy link
Contributor

Although this is not kube-specific, kubernetes deployment is a major prometheus use case. Has it been tested in a kube environment?

@stoader
Copy link

stoader commented Feb 17, 2018

@GaalDornick
Copy link

So, where did we land on the discussion for this PR. Is this change getting in or is it out?

@matyix
Copy link
Member Author

matyix commented Mar 8, 2018

@GaalDornick @erikerlandson @jerryshao @felixcheung et all

We gave up this - we have made the requested changes several times and I am not willing to put more time on this and get in the middle of a debate which is not my concern. Currently the Spark monitoring architecture it is how it is - and we have made the PR to align with the current architecture of the existing sinks and metrics subsystem. What did happen is that now the debate is not about that this is good, needed or not but whether it should be part of Spark core, be pluggable, we should refactor the whole metrics subsystem, etc. Most likely this will still be the case later and once these will be changed, nailed down or agreed by all parties I can rework and resend the PR...

Anyways, we (and our customers) are using this in production for months - we have externalized this into a separate jar which we put it on the CP and does not need to be part of Spark (though it should I believe, as Prometheus is one of the best open source monitoring framework).

Should anybody need help to use this sink drop me a mail at janos@banzaicloud.com happy to help if interested in using Prometheus with Spark. We do some advanced scenarios with this sink and the code is all open sourced - you can read more about Monitoring Spark with Prometheus and Federated monitoring of multiple Spark clusters.

Thanks for all the support.
Janos

@matyix matyix closed this Mar 8, 2018
@matyix
Copy link
Member Author

matyix commented Mar 10, 2018

For those who are still interested using Prometheus you can get the standalone package and source code from here: https://github.com/banzaicloud/spark-metrics . Happy monitoring, try to catch the issues and avoid those PagerDuty notifications beforehand :).

@andrusha
Copy link
Contributor

You can also try using https://github.com/andrusha/dropwizard-prometheus, which implements pull instead of push.

@lony
Copy link

lony commented May 16, 2018

@andrusha Do you have a tutorial how to set this up. From my understanding if I have multiple executers pulling it is harder, as prometheus has to have all the hostnames. Am I right or wrong?

@andrusha
Copy link
Contributor

andrusha commented May 24, 2018 via email

@mageru
Copy link

mageru commented Aug 3, 2018

Did we really miss out on Prometheus metric functionality because people couldn't just be cool about it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.