Support FlintTable batch write #1653

penghuo · 2023-05-24T15:01:26Z

Description

Add FlintWriter in Flint core.
Support Batch Write in Spark.

Implementation Detail

FlintTable is capable of batch write operations in overwrite mode. This interacts with the FlintOpenSearchClient within the FlintCore package. During this process, we utilize the CREATE action within the OpenSearch bulk request. Users have the capability to provide an ID field within their options. If no ID is provided, OpenSearch will generate one automatically. When writing to FlintCore, the following conditions are checked:

If a document with the same ID already exists, the system will skip this entry and do nothing.
If no document with the same ID is found, the system will index the new document.

Why not use INDEX action

The INDEX action will delete doc with same id, and index new doc. In case Luncene does not really delete the doc, the storage size is doubled.

Usage Example

val schema = StructType(Seq(StructField("aInt", IntegerType)))
val openSearchOptions = Map("host" -> "localhost", "port" -> "9200", "spark.flint.write.id.name" -> "aInt")
val df = spark.range(15).toDF("aInt")
df.coalesce(1).write.format("flint").options(openSearchOptions).mode("overwrite").save("t002")

Check List

New functionality includes testing.
- All tests pass, including unit test, integration test and doctest
New functionality has been documented.
- New functionality has javadoc added
- New functionality has user manual doc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Peng Huo <penghuo@gmail.com>

codecov · 2023-05-24T15:21:00Z

Codecov Report

Merging #1653 (8c9db09) into feature/flint (7268b5e) will not change coverage.
The diff coverage is n/a.

@@               Coverage Diff                @@
##             feature/flint    #1653   +/-   ##
================================================
  Coverage            97.19%   97.19%           
  Complexity            4107     4107           
================================================
  Files                  371      371           
  Lines                10464    10464           
  Branches               706      706           
================================================
  Hits                 10170    10170           
  Misses                 287      287           
  Partials                 7        7

Flag	Coverage Δ
sql-engine	`97.19% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

penghuo · 2023-05-24T15:21:21Z

...spark-integration/src/main/scala/org/apache/spark/sql/flint/json/FlintJacksonGenerator.scala

+/**
+ * copy from spark {@link JacksonGenerator}.
+ */
+case class FlintJacksonGenerator(dataType: DataType, writer: Writer, options: JSONOptions) {


To reviewer:
This class is copy from SPARK JacksonGenerator. I did not find easy way to directly use it. You can only review the function i defined
def writeAction(action: String, idOrdinal: Option[Int], row: InternalRow): Unit = {}

flint/flint-core/src/main/scala/org/opensearch/flint/core/storage/OpenSearchWriter.java

Signed-off-by: Peng Huo <penghuo@gmail.com>

dai-chen

Thanks for the changes!

penghuo added 2 commits May 23, 2023 18:07

add write support

183ada5

Signed-off-by: Peng Huo <penghuo@gmail.com>

add debug log

6dab0ab

Signed-off-by: Peng Huo <penghuo@gmail.com>

penghuo added the Flint label May 24, 2023

penghuo self-assigned this May 24, 2023

penghuo changed the title ~~Flint batch write pr~~ Flint - Support Batch Write May 24, 2023

penghuo mentioned this pull request Jul 11, 2023

[Feature] OpenSearch and Apache Spark Integration opensearch-project/opensearch-spark#3

Closed

penghuo commented May 24, 2023

View reviewed changes

penghuo marked this pull request as ready for review May 24, 2023 15:22

penghuo added the enhancement New feature or request label May 24, 2023

penghuo changed the title ~~Flint - Support Batch Write~~ Support FlintTable batch write May 24, 2023

dai-chen reviewed May 24, 2023

View reviewed changes

flint/flint-core/src/main/scala/org/opensearch/flint/core/storage/OpenSearchWriter.java Show resolved Hide resolved

add IT for FlintWriter

8c9db09

Signed-off-by: Peng Huo <penghuo@gmail.com>

dai-chen approved these changes May 24, 2023

View reviewed changes

penghuo merged commit b720b84 into opensearch-project:feature/flint May 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support FlintTable batch write #1653

Support FlintTable batch write #1653

penghuo commented May 24, 2023 •

edited

Loading

codecov bot commented May 24, 2023 •

edited

Loading

penghuo May 24, 2023

dai-chen left a comment

Support FlintTable batch write #1653

Support FlintTable batch write #1653

Conversation

penghuo commented May 24, 2023 • edited Loading

Description

Implementation Detail

Why not use INDEX action

Usage Example

Check List

codecov bot commented May 24, 2023 • edited Loading

Codecov Report

penghuo May 24, 2023

Choose a reason for hiding this comment

dai-chen left a comment

Choose a reason for hiding this comment

penghuo commented May 24, 2023 •

edited

Loading

codecov bot commented May 24, 2023 •

edited

Loading