Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for JSONConverter in sink connector #71

Merged
merged 15 commits into from
May 10, 2022

Conversation

tomblench
Copy link
Contributor

@tomblench tomblench commented Apr 13, 2022

Checklist

  • Tick to sign-off your agreement to the Developer Certificate of Origin (DCO) 1.1
  • Added tests for code changes or test/build only changes
  • Updated the change log file (CHANGES.md|CHANGELOG.md) or test/build only changes
  • Completed the PR template below:

Description

See #62

Approach

Support JSONConverter by expecting values from kafka sink tasks to either be a java Map or a kafka Struct.

  • If Map, pass through to batchWrite directly
  • If Struct, convert to Map using new StructToMapConverter

Add documentation in README

Schema & API Changes

  • "No change"

Security and Privacy

  • "No change"

Testing

See added StructToMapConverterTests

Monitoring and Logging

Added slf4j-simple to view log output when running tests

@tomblench tomblench force-pushed the 62-kafka-connect-data-format branch from 28305f7 to f03cd16 Compare April 26, 2022 07:35
@@ -116,6 +112,7 @@ public static JSONArray batchWrite(Map<String, String> props, JSONArray data)
result.put(jsonResult);
}
} catch (Exception e) {
LOG.error("Exception caught in batchWrite()", e);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may need to be revisited in another PR - the worrying thing is that we were just swallowing exceptions from the cloudant client which I had manage to trigger with a misconfigured test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, there needs to be a separate look at error handling to conform to the behaviours of the built-in errors.tolerance=all and none flags. (all implies silently ignore bad messages so I guess that's all we have right now!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we open a ticket specifically for investigating and improving error handling?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a note in my error handling epic, I think strictly speaking we should iterate the result and push to the DLQ or whatever error handling is configured specifically for each document/message, but I'm ok with us improving that later.

value.converter.schemas.enable=true
```

#### Converter configuration: sink connector
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source connector converter needs covering when we do the PR for that work - my intention is that we support JsonConverter on source and sink which simplifies things (as mentioned above it's the default anyway so no need to explicitly set it in config)

@@ -88,13 +92,17 @@ public void testReplicateAll() throws Exception {
// - no offset
for (SourceRecord record : records) {

// source task returns strings but sink task expects structs or maps
// in a real kafka instance this would be fixed by using appropriate converters
Map recordValue = gson.fromJson((String)record.value(), Map.class);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit awkward but is needed until we do the PR to support JsonConverter in the source connector - at which point this line can be removed since the source will return us a Map

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ricellis your idea of enabling schemas here (to return a struct/map instead of a string) didn't work because it caused various limits to be exceeded (memory, http request size), because the test payloads are complex resulting in a huge inline schema per doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test payloads are complex resulting in a huge inline schema per doc.

I'm fine with doing something other than the enable.schemas approach, but honestly this is a little concerning to me, yeah they do have a 100 or so properties, but they don't look that complex. Maybe this is something that we need to cover in QA though to get a better handle on what is stressing it.

@@ -60,7 +63,8 @@ protected void setUp() throws Exception {
data = new JSONArray(tokener);

// Load data into the source database (create if it does not exist)
JavaCloudantUtil.batchWrite(sourceProperties, data);
JavaCloudantUtil.batchWrite(sourceProperties,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few instances of this awkward mapping from org.json in test code - I've tried to make the changes as minimal as possible but would love to get rid of that library altogether at a later date.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed I've already got this noted downf or later work. There are 3 different JSON libs hanging around in various places and we should narrow that down to use only the one brought by Kafka itself or the cloudant-java-sdk or maybe both as they'll both be there anyway, but we definitely shouldn't rely on an extra third one.

@tomblench tomblench changed the title WIP - support for JSONConverter Support for JSONConverter in sink connector Apr 26, 2022
README.md Outdated

Assume these settings in a file `connect-standalone.properties` or `connect-distributed.properties`.
Usually the kafka distribution defaults (`connect-(standalone|distributed).properties`) are as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to expand on this if we say that the values below are usually the defaults? Would it be any better if we said:
"The Kafka distribution defaults are typically as follows:"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in ac8b55d

@@ -116,6 +112,7 @@ public static JSONArray batchWrite(Map<String, String> props, JSONArray data)
result.put(jsonResult);
}
} catch (Exception e) {
LOG.error("Exception caught in batchWrite()", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we open a ticket specifically for investigating and improving error handling?

Copy link
Member

@ricellis ricellis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on the whole, just a few minor suggestions.

@@ -116,6 +112,7 @@ public static JSONArray batchWrite(Map<String, String> props, JSONArray data)
result.put(jsonResult);
}
} catch (Exception e) {
LOG.error("Exception caught in batchWrite()", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a note in my error handling epic, I think strictly speaking we should iterate the result and push to the DLQ or whatever error handling is configured specifically for each document/message, but I'm ok with us improving that later.

@@ -88,13 +92,17 @@ public void testReplicateAll() throws Exception {
// - no offset
for (SourceRecord record : records) {

// source task returns strings but sink task expects structs or maps
// in a real kafka instance this would be fixed by using appropriate converters
Map recordValue = gson.fromJson((String)record.value(), Map.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test payloads are complex resulting in a huge inline schema per doc.

I'm fine with doing something other than the enable.schemas approach, but honestly this is a little concerning to me, yeah they do have a 100 or so properties, but they don't look that complex. Maybe this is something that we need to cover in QA though to get a better handle on what is stressing it.

@@ -60,7 +63,8 @@ protected void setUp() throws Exception {
data = new JSONArray(tokener);

// Load data into the source database (create if it does not exist)
JavaCloudantUtil.batchWrite(sourceProperties, data);
JavaCloudantUtil.batchWrite(sourceProperties,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed I've already got this noted downf or later work. There are 3 different JSON libs hanging around in various places and we should narrow that down to use only the one brought by Kafka itself or the cloudant-java-sdk or maybe both as they'll both be there anyway, but we definitely shouldn't rely on an extra third one.

Copy link
Member

@ricellis ricellis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am +1 now, but a couple of minor things

@ricellis ricellis added this to the 0.100.next milestone May 6, 2022
Copy link
Contributor

@emlaver emlaver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@tomblench tomblench merged commit 4964afc into master May 10, 2022
@ricellis ricellis deleted the 62-kafka-connect-data-format branch June 13, 2022 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants