Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 Dest emits V2 fields and captures failures #42409

Merged
merged 13 commits into from
Aug 14, 2024

Conversation

johnny-schmidt
Copy link
Contributor

@johnny-schmidt johnny-schmidt commented Jul 22, 2024

What

Accomplishes the field-renaming part of making S3 V2 compliant.

  • S3 Syncs now include all V2 metadata fields
  • _airbyte_meta.changes includes conversion failures for avro and parquet

How

  • fields are added separately by each conversion format when the converter is so configured (useV2Field[Name]s: bool is passed to each converter
  • v2 json-to-avro converter uses the FieldConversionFailureListener (added to the json-avro-converter artifact here), which
    • nulls the field (by returning null for a new value)
    • stages a post-processing change to insert a NULLED change object into the record (after it's completely processed)
  • !!!FOR THREAD SAFETY!!! the json-to-avro converter is no longer a constant (b/c the listener keeps state!). It is initialized by AvroRecordFactory, which is indirectly initialized on buffer creation (ie, at the start of each flush)
  • json-to-avrò schema converter also takes useV2Fields
  • record creation methods now take an optional generationId
  • setup code builds a map of StreamDescriptor -> generationId and passes it to the s3 flush worker, which in turn passes generationId to the converters
  • Tests:
    • the existing retrieveRecords method stripped the meta fields, I changed this and added a shim so most test cases would still get the stripped rows
    • added one test case that validates the new field names and some change capture scenarios

Review Guide

In addition to validating the above, please look out for implications for other destinations that use s3, avro/parquet, and/or the test framework

User Impact

This is a breaking change. However it will be bundled with two more PRs that also contain breaking changes

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

Copy link

vercel bot commented Jul 22, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Aug 14, 2024 3:54pm

@octavia-squidington-iii octavia-squidington-iii added area/connectors Connector related issues CDK Connector Development Kit labels Jul 22, 2024
Copy link
Contributor Author

johnny-schmidt commented Jul 22, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @johnny-schmidt and the rest of your teammates on Graphite Graphite

@johnny-schmidt johnny-schmidt marked this pull request as ready for review July 22, 2024 20:00
@johnny-schmidt johnny-schmidt requested review from a team and edgao July 22, 2024 20:00
@@ -19,6 +32,8 @@ class NoFlatteningSheetGenerator : BaseSheetGenerator(), CsvSheetGenerator {

/** When no flattening is needed, the record column is just one json blob. */
override fun getRecordColumns(json: JsonNode): List<String> {
return listOf(Jsons.serialize(json))
val tmp = Jsons.serialize(json)
println("tmp: $tmp")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove this.

@johnny-schmidt johnny-schmidt force-pushed the epic-8349-certify-s3/issue-8554-s3-async-gt branch 2 times, most recently from 470eee5 to f5414ea Compare July 23, 2024 00:40
@johnny-schmidt johnny-schmidt force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch 2 times, most recently from 480dc23 to 285767f Compare July 23, 2024 00:41
@johnny-schmidt johnny-schmidt force-pushed the epic-8349-certify-s3/issue-8554-s3-async-gt branch from f5414ea to 5c9ea03 Compare July 23, 2024 00:50
@johnny-schmidt johnny-schmidt force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch from 285767f to 92cca0f Compare July 23, 2024 00:52
@@ -22,6 +25,29 @@ class AvroRecordFactory(private val schema: Schema?, private val converter: Json
companion object {
private val MAPPER: ObjectMapper = MoreMappers.initMapper()
private val WRITER: ObjectWriter = MAPPER.writer()

fun createV1JsonToAvroConverter(): JsonAvroConverter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we mark this as deprecated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. It's still in use by GCS at least, but should go away completely when that's done.


return converter!!.convertToGenericDataRecord(WRITER.writeValueAsBytes(jsonRecord), schema)
}

@Throws(JsonProcessingException::class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deprecate?

@@ -84,6 +91,7 @@ protected constructor(protected val outputFormat: FileUploadFormat) :
DateTime.now(DateTimeZone.UTC),
s3DestinationConfig.pathFormat!!,
)
println("outputPrefix: $outputPrefix")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

}

run {
standardInput = System.in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this addition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for testing, will remove. (It's necessary to make the running gradle run task accept input from stdin.)

Copy link
Contributor

@stephane-airbyte stephane-airbyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor comments

Copy link
Contributor

@edgao edgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems reasonable, had some comments/questions. The runtime code changes looked like mostly plumbing so I didn't read them in much detail; had some more questions in the test code. LGTM once everything is resolved

@@ -42,7 +42,7 @@ protected constructor(private val bufferStorage: BufferStorage) : SerializableBu
*/
@Deprecated("")
@Throws(IOException::class)
protected abstract fun writeRecord(record: AirbyteRecordMessage)
protected abstract fun writeRecord(record: AirbyteRecordMessage, generationId: Long = 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird thought: should we take generation ID via the constructor? since it's constant across the entire sync

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered that, as well as changes necessary to make my code use the new methods.

I think that's the correct design. Threading things through is an anti pattern.

But the even more correct pattern is for this code to know nothing about airbyte fields, which I imagine is how it will work in a new CDK. So I punted on doing the work to rewrite anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, is it constant per sync or per stream?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah, it's per stream. Isn't this object instanced per stream? (totally fair that this would be too much work for no payoff though)

)
AvroSerializedBuffer(createStorageFunction.call(), codecFactory, schema)
println("schema: $schema")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this println / use a logger instance

Copy link
Contributor Author

@johnny-schmidt johnny-schmidt Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a stray debug statement, which I'll remove

I thought we should not log client schemas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client data is bad to log, but we treat the schemas as non-sensitive (e.g. a lot of DB sources log out every column in every table for debugging purposes)

.name(JavaBaseConstants.COLUMN_NAME_AB_GENERATION_ID)
.type(Schema.create(Schema.Type.LONG))
.noDefault()
val changeSchema: Schema =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this looks identical to the schema in AvroFieldConversionFailureListener, probably better to reuse that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to migrate it to AvroConstants

json.put(JavaBaseConstants.COLUMN_NAME_AB_ID, UUID.randomUUID().toString())
json.put(JavaBaseConstants.COLUMN_NAME_EMITTED_AT, record.emittedAt)
}

@Deprecated("Deprecated in Java")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.... I know you didn't write this, but I assume this should just be deprecated everywhere? I'm assuming this method was deprecated before we introduced kotlin >.>

)

val jsons = MoreMappers.initMapper().createObjectNode()
// Iterate over every key value pair in the json node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking my understanding: given this node:

{
  "_airbyte_raw_id": "foo",
  "_airbyte_data": {
    "id": 42
  }
}

we would return this jsons:

{
  "id": 42,
  "_airbyte_data": {
    "id": 42
  }
}

(assuming yes: are we intentionally promoting all the data fields, and also keeping the airbyte_data sub-object?)

... also, why not just this? is the idea to preserve non-airbyte top-level fields?

val newObjectNode = node["_airbyte_data"].deepCopy()
newObject.put("_airbyte_data", node["_airbyte_data"])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct, but that's a mistake. I'm missing an else near the end.

The intention is to remove airbyte fields AND promote anything under _airbyte_data to the top. I was trying to replicate the effective behavior of all the client implementations of the former retrieveRecords method.

Your example should produce.

{
  "id": 42
}

@gisripa gisripa force-pushed the epic-8349-certify-s3/issue-8554-s3-async-gt branch from 5c9ea03 to 4d4e5db Compare July 24, 2024 17:53
@@ -42,7 +42,7 @@ protected constructor(private val bufferStorage: BufferStorage) : SerializableBu
*/
@Deprecated("")
@Throws(IOException::class)
protected abstract fun writeRecord(record: AirbyteRecordMessage)
protected abstract fun writeRecord(record: AirbyteRecordMessage, generationId: Long = 0)

/**
* TODO: (ryankfu) move destination to use serialized record string instead of passing entire
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Re: line 62]

Other consumers using the Async rely on this method. We were unable to delete the method on L45 because of S3 not moving to Async and remnants of old calls. Any chance we can achieve that in this PR ?

See this comment inline on Graphite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

@johnny-schmidt johnny-schmidt force-pushed the epic-8349-certify-s3/issue-8554-s3-async-gt branch from 4d4e5db to b7615bb Compare July 25, 2024 17:16
@johnny-schmidt johnny-schmidt force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch from 2ed6b8e to 32a40ca Compare July 25, 2024 17:16
@gisripa gisripa force-pushed the epic-8349-certify-s3/issue-8554-s3-async-gt branch from b7615bb to 4d4e5db Compare July 25, 2024 18:43
@johnny-schmidt johnny-schmidt force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch from f97e80b to 80ef98f Compare August 12, 2024 18:16
@gisripa gisripa force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch 2 times, most recently from 40277b3 to 51c23f8 Compare August 12, 2024 23:58
@johnny-schmidt johnny-schmidt force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch from 51c23f8 to 1844494 Compare August 13, 2024 15:34
@gisripa gisripa force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch from 1844494 to 33e3e0d Compare August 13, 2024 21:26
@johnny-schmidt johnny-schmidt force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch 2 times, most recently from 1844494 to 194365d Compare August 13, 2024 22:16
@octavia-squidington-iii octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Aug 13, 2024
@johnny-schmidt johnny-schmidt force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch from 39aa845 to 1585419 Compare August 13, 2024 22:53
@johnny-schmidt johnny-schmidt force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch from 3acf1ff to 491ad7c Compare August 14, 2024 01:06
@johnny-schmidt johnny-schmidt force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch from 491ad7c to a626bae Compare August 14, 2024 15:54
@johnny-schmidt johnny-schmidt merged commit e9a37d4 into master Aug 14, 2024
31 of 32 checks passed
@johnny-schmidt johnny-schmidt deleted the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch August 14, 2024 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation CDK Connector Development Kit connectors/destination/s3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants