feat(java): support Lance Spark Batch Write #2500

LuQQiu · 2024-06-21T04:05:50Z

Add Lance Spark Batch Write Support.
Add a simple Lance catalog to support create table.
Batch write writes chunk to ArrowReader and zero-copy to Lance core.

TODO add the test cases for InternalRowWriterArrowReader
TODO larger scale manual testing

LuQQiu · 2024-06-21T17:25:05Z

java/spark/src/main/java/com/lancedb/lance/spark/LanceCatalog.java

+
+import java.util.Map;
+
+public class LanceCatalog implements TableCatalog {


Please review

LuQQiu · 2024-06-21T17:25:40Z

java/spark/src/main/java/com/lancedb/lance/spark/write/InternalRowWriterArrowReader.java

+/**
+ * A custom arrow reader that supports writes Spark internal rows while reading data in batches.
+ */
+public class InternalRowWriterArrowReader extends ArrowReader {


Please review

Not clear what does this class do. RowWriter vs ArrowReader?

Can we have a better name?
Is this used by spark.write or spark.read?

Also can we use package public for this class?

LuQQiu · 2024-06-21T17:25:54Z

java/spark/src/main/java/com/lancedb/lance/spark/write/LanceDataWriter.java

+import java.util.concurrent.ExecutionException;
+import java.util.concurrent.FutureTask;
+
+public class LanceDataWriter implements DataWriter<InternalRow> {


Please review

LuQQiu · 2024-06-21T17:26:07Z

java/spark/src/test/java/com/lancedb/lance/spark/write/SparkWriteTest.java

+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertThrows;
+
+public class SparkWriteTest {


Please review

eddyxu · 2024-06-23T00:13:55Z

java/spark/src/test/java/com/lancedb/lance/spark/write/InternalRowWriterArrowReaderTest.java

+          }
+          arrowReader.setFinished();
+        } catch (Exception e) {
+          e.printStackTrace();


This one swallows the exception?

eddyxu · 2024-06-23T00:25:02Z

java/spark/src/main/java/com/lancedb/lance/spark/write/InternalRowWriterArrowReader.java

+    // If the following code impacts performance, can be removed
+    VectorSchemaRoot root = this.getVectorSchemaRoot();
+    VectorUnloader unloader = new VectorUnloader(root);
+    try (ArrowRecordBatch recordBatch = unloader.getRecordBatch()) {


A RecordBatch is just consumed to update the totalBytesRead.

Yeah, there is no better way to compute bytes read. Alternative approach is to let bytesRead() throws unsupportedOperationException, in my testing, bytesRead is never called

eddyxu · 2024-06-23T00:27:25Z

java/spark/src/main/java/com/lancedb/lance/spark/write/LanceDataWriter.java

+  }
+
+  @Override
+  public WriterCommitMessage commit() throws IOException {


This is commit at at the end of the full write? If so, we will accumulate too much data in arrowReader.

The current sequence is

create arrow reader

start the reader thread and execute Fragment.create(reader)

write to the arrow reader

reader thread (fragment.create) consumes data in batches

commit which notify there is no more new data and wait for reader thread (fragment.create) finishes

eddyxu · 2024-06-23T00:31:29Z

java/spark/src/main/java/com/lancedb/lance/spark/internal/LanceDatasetAdapter.java

+    return getSchema(config.getTablePath());
+  }
+
+  public static Optional<StructType> getSchema(String tablePath) {


This does not look right. So we have table concept and Dataset concept here. We need to be consistent.

Also, can we just call this class LanceDataset?

this class is for having a central place to put allocator and put all methods requires allocator in the same space

The LanceTable has been renamed to LanceDataset which allows user to read or write a single lance dataset

eddyxu · 2024-06-23T00:31:55Z

java/spark/src/main/java/com/lancedb/lance/spark/internal/LanceDatasetAdapter.java

+    }
+  }
+
+  public static void createTable(String datasetUri, StructType sparkSchema) {


be consistent of dataset or table

Changed to use dataset

LuQQiu · 2024-07-02T00:26:51Z

@eddyxu PTAL, thanks

LuQQiu · 2024-09-12T16:43:56Z

Merge this one for now.
Spark Write has the following main TODOs

commit fragments in once for all executor for one spark write job, instead of commit separately
Support FixedSizeList for Lance vector column

Support Lance Spark Write

7ff75ac

github-actions bot added enhancement New feature or request java labels Jun 21, 2024

LuQQiu commented Jun 21, 2024

View reviewed changes

LuQQiu assigned eddyxu and QianZhu Jun 21, 2024

Add Spark unit tests

2948f66

eddyxu reviewed Jun 23, 2024

View reviewed changes

use dataset instead of table

fc91417

Solve conflict

4e7cee0

eddyxu approved these changes Sep 12, 2024

View reviewed changes

LuQQiu merged commit ce1e61c into lancedb:main Sep 12, 2024
6 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(java): support Lance Spark Batch Write #2500

feat(java): support Lance Spark Batch Write #2500

LuQQiu commented Jun 21, 2024 •

edited

Loading

LuQQiu Jun 21, 2024

LuQQiu Jun 21, 2024

eddyxu Jun 23, 2024

eddyxu Jun 23, 2024

LuQQiu Jun 21, 2024

LuQQiu Jun 21, 2024

eddyxu Jun 23, 2024

eddyxu Jun 23, 2024

LuQQiu Jun 26, 2024

eddyxu Jun 23, 2024

LuQQiu Jun 26, 2024

eddyxu Jun 23, 2024

LuQQiu Jul 2, 2024

eddyxu Jun 23, 2024

LuQQiu Jul 2, 2024

LuQQiu commented Jul 2, 2024

LuQQiu commented Sep 12, 2024


		import java.util.Map;

		public class LanceCatalog implements TableCatalog {

feat(java): support Lance Spark Batch Write #2500

feat(java): support Lance Spark Batch Write #2500

Conversation

LuQQiu commented Jun 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LuQQiu commented Jul 2, 2024

LuQQiu commented Sep 12, 2024

LuQQiu commented Jun 21, 2024 •

edited

Loading