Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(parsers.avro): Add Apache Avro parser plugin #11816

Merged
merged 103 commits into from
Mar 2, 2023

Conversation

athornton
Copy link
Contributor

@athornton athornton commented Sep 15, 2022

Required for all PRs

resolves #1630

This is a replacement for #7732 since the original author (@emanuele-falzone ) has gone silent.

This builds on Emanuele Falzone's work to allow ingestion from Avro serialized format. It can either connect to a schema registry or a schema can be specified in the parser.

@telegraf-tiger telegraf-tiger bot added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Sep 15, 2022
@athornton athornton mentioned this pull request Sep 15, 2022
3 tasks
@athornton athornton force-pushed the features/avro branch 14 times, most recently from b6222b0 to f6948dc Compare September 18, 2022 19:26
@athornton athornton changed the title feat(plugins/parser): add Apache Avro parsing feat(plugins.parser): add Apache Avro parsing Sep 19, 2022
@athornton athornton force-pushed the features/avro branch 4 times, most recently from 5ebe86e to abb157b Compare September 20, 2022 15:56
Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @athornton for reviving this parser! I have some comments, nothing too big. The only part that concerns me is the addition of time-formats. Please avoid adding those and rather add a round_timestamps_to option in your parser, avoiding the combinatorics of formats and rounding.

@srebhan srebhan self-assigned this Sep 20, 2022
@srebhan srebhan added the plugin/parser 1. Request for new parser plugins 2. Issues/PRs that are related to parser plugins label Sep 20, 2022
@srebhan srebhan changed the title feat(plugins.parser): add Apache Avro parsing feat(parsers.avro): Add Apache Avro parser plugin Sep 20, 2022
@srebhan
Copy link
Member

srebhan commented Sep 20, 2022

@athornton please also rebase to latest master to get CircleCI back functional.

@athornton
Copy link
Contributor Author

Thank you for the detailed review. I'll get to work on it. I have no objection to a round_timestamps_to config item rather than my initial implementation (the reason to round at all basically comes down to https://docs.influxdata.com/influxdb/v2.4/reference/faq/#does-the-precision-of-the-timestamp-matter , and I at least find it much easier to eyeball the data if all digits past the precision we care about are zero rather than the deterministic-but-basically-random stuff that the conversion gives us).

@athornton
Copy link
Contributor Author

@srebhan :

OK, I think I see conceptually what you're saying: all the convenience tools where I take the JSON representations of the schema and the message, and then call jsonToAvroMessage to generate the Avro format input, should be replaced by a binary input message (the output of jsonToAvroMessage) and a simple test of whether that works? Although I could put the schema or even both the schema and the message into telegraf.conf, in actual use the schema will be externally-given (almost always, it will come from a schema registry), and obviously the message is coming in over the wire.

So it feels like we want a much simpler test, of "Avro format binary data" as the input...but if we want more test cases at some point, I don't want to throw away the tooling to create those messages, because in practice, generating the test data will be done by matching a schema and message and generating the Avro data from them, rather than generating the wire protocol by hand. Where should that tooling go?

@srebhan
Copy link
Member

srebhan commented Feb 28, 2023

OK, I think I see conceptually what you're saying: all the convenience tools where I take the JSON representations of the schema and the message, and then call jsonToAvroMessage to generate the Avro format input, should be replaced by a binary input message (the output of jsonToAvroMessage) and a simple test of whether that works? Although I could put the schema or even both the schema and the message into telegraf.conf, in actual use the schema will be externally-given (almost always, it will come from a schema registry), and obviously the message is coming in over the wire.

Exactly. Put the binary messages there and let the file input read them.

So it feels like we want a much simpler test, of "Avro format binary data" as the input...but if we want more test cases at some point, I don't want to throw away the tooling to create those messages, because in practice, generating the test data will be done by matching a schema and message and generating the Avro data from them, rather than generating the wire protocol by hand. Where should that tooling go?

I don't think that tool should be in Telegraf. It's not Telegraf's task to create those messages. If we add further test-cases it will likely be based on bug-reports, so it would be nice if the parser could print the binary message it receives and maybe even the schema as debug messages on error. We have added this for a few other plugins, e.g. GNMI one to be able to reproduce problems in tests...

@athornton
Copy link
Contributor Author

OK. That's the approach I'll take, then. I'll make my own little tools repository to assemble the messages to binary format and put those in testcases. Something else I thought of and will probably add: since I allow the user, in telegraf.conf, to either specify the schema directly as a string, or as a schema registry endpoint, it's probably worth documenting that there's no reason the endpoint can't be a file:/// url if the user has an external schema file rather than an Avro schema registry.

@athornton
Copy link
Contributor Author

Hmm. It's not quite that simple: messages may arrive as raw Avro binary data, as Avro single-object-encoding data, or as Confluent wire format. So the parser will work on binary data, and if a parser registry is specified it will expect Confluent format. So no explanatory comment yet.

However, I think I have the test suite rewritten now.

Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome update @athornton! Just a few very small comments and then we are good to go I think...

Co-authored-by: Sven Rebhan <36194019+srebhan@users.noreply.github.com>

Apply review suggestions

Update plugins/parsers/avro/parser_test.go

Fail immediately if config or Init() error.

Co-authored-by: Sven Rebhan <36194019+srebhan@users.noreply.github.com>
@athornton
Copy link
Contributor Author

@srebhan I think we're ready.

Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for driving this PR @athornton! Good job!

@srebhan srebhan added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Mar 1, 2023
@srebhan srebhan assigned powersj and unassigned srebhan Mar 1, 2023
Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athornton - huge thank you for your persistence on this PR. I have some questions in line.

@powersj
Copy link
Contributor

powersj commented Mar 2, 2023

@athornton thanks for the updates, I think we are down to two open questions:

  1. The purpose of DefaultTags versus using the built-in method for defining tags for an input.
  2. If getSchemaAndCodec shoudl be run on every Parse

Thanks!

@athornton
Copy link
Contributor Author

athornton commented Mar 2, 2023

So, I think we may be done? The schema lookup (when you have a schema registry) has to be done at each Parse(), but after the initial retrieval it's just a map lookup, so shouldn't be too costly. Using toml:"tags" for the default tags looks like it should work.

@athornton athornton requested a review from powersj March 2, 2023 17:47
Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for driving and contributing the new parser!

@powersj powersj merged commit acd1500 into influxdata:master Mar 2, 2023
@athornton athornton deleted the features/avro branch March 2, 2023 18:26
@srebhan srebhan added this to the v1.26.0 milestone Jun 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/parser 1. Request for new parser plugins 2. Issues/PRs that are related to parser plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Avro format for kafka producer and consumer?
7 participants