Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: expose JMX metric that classifies an error state #5374

Merged
merged 2 commits into from
May 18, 2020

Conversation

agavra
Copy link
Contributor

@agavra agavra commented May 15, 2020

Description

This PR introduces JMX metrics that drill query errors down into more specific categories: UNKNOWN, USER and SYSTEM (open to better names) when an uncaught error is raised from KafkaStreams.

Review Guide

There's a lot in this PR, I can split it up if it helps the review but it's not that many lines of code so I figured it's digestible with a little guide:

  • QueryError a throwable paired with a classification of that error (see above)
  • ErrorStateListener extends StateListener to add an additional method that reacts to QueryError when its raised
  • QueryErrorClassifier classifies exceptions; it also supports chaining so we can easily introduce other ones
  • MissingTopicClassifier will check if any required topics are missing, and if so classifies the error as USER
  • I moved registering the uncaught exceptions into the QueryMetadata where I have more information than in the KafkaStreamsBuilder
  • The rest is wiring

Testing done

Caused a failure by deleting a necessary topic (foo) and then deleting the kafka streams local dir to trigger a faster error. In the logs:

[2020-05-15 10:47:35,201] ERROR Unhandled exception caught in streams thread _confluent-ksql-def
ault_query_CTAS_COUNTS_0-8d6e554e-a19b-4c68-abe7-eca385fc9630-StreamThread-3. (io.confluent.ksql
.engine.KsqlEngine:43)
org.apache.kafka.streams.errors.ProcessorStateException: task directory [/tmp/kafka-streams/_con
fluent-ksql-default_query_CTAS_COUNTS_0/0_0] doesn't exist and couldn't be created
...
[2020-05-15 10:47:35,880] WARN Query _confluent-ksql-default_query_CTAS_COUNTS_0 requires topic Aggregate-GroupBy-repartition which cannot be found. (io.confluent.ksql.query.MissingTopicClassifier:54)

I used jconsole to make sure that the JMX MBean metrics were properly set. Here's a picture of it below:

image

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@agavra agavra force-pushed the query_error branch 2 times, most recently from e494636 to f5b48da Compare May 18, 2020 17:47
@agavra agavra marked this pull request as ready for review May 18, 2020 18:32
@agavra agavra requested a review from a team as a code owner May 18, 2020 18:32
Copy link
Contributor

@rodesai rodesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. How would we implement more complex classifiers? Is the idea to add more ways of combining classifiers (for example, add an or method)?

@@ -56,7 +56,7 @@ ksql.logging.processing.stream.auto.create=true
#------ External service config -------

# The set of Kafka brokers to bootstrap Kafka cluster information from:
bootstrap.servers=localhost:9092
bootstrap.servers=localhost:29092
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meant to omit this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops I thought I removed that! thanks

* @param other the other classifier to chain
* @return a {@code QueryErrorClassifier} that chains both
*/
default QueryErrorClassifier and(QueryErrorClassifier other) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is currently not used right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's not used yet - I'm planning on introducing a regex classifier so that you can chain them (e.g. use MissingTopic.and(SomeRegexClassifier).and(AnotherRegexClassifier)) this isn't really a boolean "and" but rather a "andThen" - I can rename it

@agavra agavra merged commit 52271bf into confluentinc:master May 18, 2020
@agavra agavra deleted the query_error branch May 18, 2020 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants