Document the use of SMTs to customize _id field #83

emlaver · 2022-07-19T01:14:13Z

Checklist

Tick to sign-off your agreement to the Developer Certificate of Origin (DCO) 1.1
Added tests for code changes or test/build only changes
Updated the change log file (CHANGES.md|CHANGELOG.md) or test/build only changes
Completed the PR template below:

Description

fixes #65

Approach

Modify the ConnectRecordMapper to check for the presence of a custom header on the record and use the value of that header as the ID.
Use the HeaderFrom transform to convert a key to the header, avoiding the need for us to have any custom SMT.
Add README section for using SMTs to rename, replace, or filter out _id field. Also document how to remove tombstone records.

Schema & API Changes

"No change"

Security and Privacy

"No change"

Testing

Monitoring and Logging

"No change"

For reviewers:

This example requires a conditional SMT to drop the field if _id == null. I need to re-review the SMT docs to see if we can filter on the value of fields.
I've yet to successfully test the header to _id changes in a local Kafka environment. In Kafka 3.2.0, they enabled the option to set headers when using kafka-console-producer.sh. If you don't have 3.2.0 then you'll have to install and use kcat.
I'm using the example:

echo 'header:value\t{"test":"1", "try": 0, "time": true}'  | ./bin/kafka-console-producer.sh --topic kafka_test2 --bootstrap-server localhost:9092 --property "parse.headers=true" -

And this throws the error:

Caused by: org.apache.kafka.connect.errors.DataException: Only Map objects supported in absence of schema for [header move], found: null
	at org.apache.kafka.connect.transforms.util.Requirements.requireMap(Requirements.java:38)
	at org.apache.kafka.connect.transforms.HeaderFrom.applySchemaless(HeaderFrom.java:161)
	at org.apache.kafka.connect.transforms.HeaderFrom.apply(HeaderFrom.java:113)
	at org.apache.kafka.connect.runtime.TransformationChain.lambda$apply$0(TransformationChain.java:50)
	at org.apache.kafka.connect.runtime.TransformationChain$$Lambda$613/0x0000000000000000.call(Unknown Source)
	at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:156)
	at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:190)
	... 14 more

…bstone records WIP decide if we want custom conditional SMT that drops _id field if _id == null

…header on the record and use the value of that header as the ID. - Document the existing HeaderFrom transform to convert a key to the header

ricellis

For the error

Caused by: org.apache.kafka.connect.errors.DataException: Only Map objects supported in absence of schema for [header move], found: null
	at org.apache.kafka.connect.transforms.util.Requirements.requireMap(Requirements.java:38)
	at org.apache.kafka.connect.transforms.HeaderFrom.applySchemaless(HeaderFrom.java:161)
	at org.apache.kafka.connect.transforms.HeaderFrom.apply(HeaderFrom.java:113)

It looks like only Map is supported for the message key when there is no schema, so you either need to make your test message have a map key or add a schema to it.

README.md

src/main/java/com/ibm/cloudant/kafka/schema/ConnectRecordMapper.java

- Rename header to "cloudant_doc_id" - Add README priority order section, rename header

emlaver · 2022-07-20T03:03:07Z

For the error
Caused by: org.apache.kafka.connect.errors.DataException: Only Map objects supported in absence of schema for [header move], found: null
	at org.apache.kafka.connect.transforms.util.Requirements.requireMap(Requirements.java:38)
	at org.apache.kafka.connect.transforms.HeaderFrom.applySchemaless(HeaderFrom.java:161)
	at org.apache.kafka.connect.transforms.HeaderFrom.apply(HeaderFrom.java:113)
It looks like only Map is supported for the message key when there is no schema, so you either need to make your test message have a map key or add a schema to it.

Right, I realized my mistake. I was producing a record that had headers and a value but no key.
Using the Java producer API and the Kafka "map" key "{\"docid\":\"value1\"}", I managed to successfully create the bulk doc JSON body {"docs":[{"hello":"Message_1","_id":"value1"}]}. This was using key.converter=org.apache.kafka.connect.json.JsonConverter config.

ricellis

+1 though I'd advocate for adding the empty check you proposed, plus a couple of nits.

src/main/java/com/ibm/cloudant/kafka/schema/ConnectRecordMapper.java

src/test/java/com/ibm/cloudant/kafka/schema/ConnectRecordMapperTests.java

emlaver · 2022-07-20T14:43:09Z

README.md

+3. If you have messages where the `_id` field is absent or `null` then Cloudant will generate
+a document ID. If you don't want this to happen then set an `_id` (see earlier examples).
+Alternatively filter out those documents. For example if you have messages where the `_id`
+field is `null` then you'll need to use a transform and predicate to filter out and remove this
+field:
+    ```
+    TODO 
+    ```
+


I did some more digging and the current built-in SMT predicates and filter won't handle either a) dropping the field or b) filtering the document if _id == null .
At this point, I wouldn't want to delay this PR by trying to create a custom SMT. What if we update this to:

Suggested change

3. If you have messages where the `_id` field is absent or `null` then Cloudant will generate

a document ID. If you don't want this to happen then set an `_id` (see earlier examples).

Alternatively filter out those documents. For example if you have messages where the `_id`

field is `null` then you'll need to use a transform and predicate to filter out and remove this

field:

```

TODO

```

3. If you have messages where the `_id` field is absent or `null` then Cloudant will generate

a document ID. If you don't want this to happen then set an `_id` (see earlier examples).

If you need to filter out those documents or drop `_id` fields when the value is `null` then you'll need to create a custom SMT.

I don't think this needs to be update a numbered point. This could be a note at the end of this section.

I wouldn't have to delay this PR by trying to create a custom SMT

Agreed, I don't think it is a use case that warrants us delivering a custom SMT for it anyway.

I don't think this needs to be update a numbered point. This could be a note at the end of this section.

I think it can stay a numbered point, but I might move it to the bottom of the list.

emlaver · 2022-07-20T15:05:04Z

README.md

+
+   **Note**: The `header.converter` is required to be set to `StringConverter` since the document ID field only supports strings.
+
+**Note**: For any of the SMTs above, if the field does not exist it will skip over that message and continue processing the next message.


wdyt:

Suggested change

**Note**: For any of the SMTs above, if the field does not exist it will skip over that message and continue processing the next message.

**Note**: For any of the SMTs above, if the field does not exist it will leave the message unmodified and continue processing the next message.

- Add numbered point at the end of section about handling _id fields that are null - Update numbering - Fix final note

tomblench

Looks good, just a few minor issues especially around the README.

README.md

tomblench · 2022-07-22T09:16:44Z

src/test/java/com/ibm/cloudant/kafka/schema/ConnectRecordMapperTests.java

@@ -45,6 +59,22 @@ public void testConvertToMapNoSchema() {
        assertEquals("world", converted.get("hello"));
    }

+    @Test


What do you think about having a negative test, eg where the header value is not a string, to show that conversion doesn't blow up?

I've added two tests in 097f306:

Convert to map with no schema, existing _id field, and invalid map header. Assert that the _id field never changed.

Convert to struct with invalid boolean header. Assert that the _id field is null.

tomblench · 2022-07-22T09:17:08Z

src/main/java/com/ibm/cloudant/kafka/schema/ConnectRecordMapper.java

+    private String getHeaderForDocId(ConnectRecord<R> record) {
+        Header value = record.headers().lastWithName(HEADER_DOC_ID_KEY);
+        if (value != null && value.value() instanceof String) {
+            return value.value().toString();


Could also be a cast to String but it's a style issue as they both achieve the same thing.

I'd prefer to keep this as-is

…ion and the event is left unmodified

README.md

emlaver added 4 commits June 17, 2022 11:45

Add README section with SMT examples for customizing _id field

17c961a

Updated statement and example around handling _id fields with null value

690494d

Update SMT doc section to include what transform to use to remove tom…

fa99c35

…bstone records WIP decide if we want custom conditional SMT that drops _id field if _id == null

Modify the ConnectRecordMapper to check for the presence of a custom …

64875ff

…header on the record and use the value of that header as the ID. - Document the existing HeaderFrom transform to convert a key to the header

emlaver self-assigned this Jul 19, 2022

emlaver mentioned this pull request Jul 19, 2022

Document the use of SMTs to customize _id field #82

Closed

4 tasks

ricellis reviewed Jul 19, 2022

View reviewed changes

README.md Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

src/main/java/com/ibm/cloudant/kafka/schema/ConnectRecordMapper.java Outdated Show resolved Hide resolved

Move duplicate if/else to end of function

49037e0

- Rename header to "cloudant_doc_id" - Add README priority order section, rename header

ricellis approved these changes Jul 20, 2022

View reviewed changes

src/main/java/com/ibm/cloudant/kafka/schema/ConnectRecordMapper.java Outdated Show resolved Hide resolved

src/test/java/com/ibm/cloudant/kafka/schema/ConnectRecordMapperTests.java Outdated Show resolved Hide resolved

Address Rich's comments

3f6bb3b

emlaver commented Jul 20, 2022

View reviewed changes

emlaver requested a review from tomblench July 20, 2022 14:45

Updated access level for header constant

02b8e74

emlaver commented Jul 20, 2022

View reviewed changes

Update README

e9137b9

- Add numbered point at the end of section about handling _id fields that are null - Update numbering - Fix final note

ricellis approved these changes Jul 20, 2022

View reviewed changes

tomblench reviewed Jul 22, 2022

View reviewed changes

emlaver added 3 commits July 22, 2022 09:47

Addressed Tom's README updates

6c628ad

Update record -> event

fe82937

Add two tests to assert that invalid headers will not cause an except…

097f306

…ion and the event is left unmodified

tomblench reviewed Jul 22, 2022

View reviewed changes

README.md Show resolved Hide resolved

tomblench approved these changes Jul 22, 2022

View reviewed changes

emlaver merged commit 947c8d6 into master Jul 22, 2022

emlaver deleted the 65-sink-id-transform branch July 22, 2022 14:14

ricellis added this to the 0.200.0 milestone Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document the use of SMTs to customize _id field #83

Document the use of SMTs to customize _id field #83

emlaver commented Jul 19, 2022

ricellis left a comment

emlaver commented Jul 20, 2022

ricellis left a comment

emlaver Jul 20, 2022 •

edited

Loading

ricellis Jul 20, 2022

emlaver Jul 20, 2022

tomblench left a comment

tomblench Jul 22, 2022

emlaver Jul 22, 2022

tomblench Jul 22, 2022

emlaver Jul 22, 2022


		Note: The `header.converter` is required to be set to `StringConverter` since the document ID field only supports strings.

		Note: For any of the SMTs above, if the field does not exist it will skip over that message and continue processing the next message.

Document the use of SMTs to customize _id field #83

Document the use of SMTs to customize _id field #83

Conversation

emlaver commented Jul 19, 2022

Checklist

Description

Approach

Schema & API Changes

Security and Privacy

Testing

Monitoring and Logging

ricellis left a comment

Choose a reason for hiding this comment

emlaver commented Jul 20, 2022

ricellis left a comment

Choose a reason for hiding this comment

emlaver Jul 20, 2022 • edited Loading

Choose a reason for hiding this comment

ricellis Jul 20, 2022

Choose a reason for hiding this comment

emlaver Jul 20, 2022

Choose a reason for hiding this comment

tomblench left a comment

Choose a reason for hiding this comment

tomblench Jul 22, 2022

Choose a reason for hiding this comment

emlaver Jul 22, 2022

Choose a reason for hiding this comment

tomblench Jul 22, 2022

Choose a reason for hiding this comment

emlaver Jul 22, 2022

Choose a reason for hiding this comment

emlaver Jul 20, 2022 •

edited

Loading