Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(java): support Lance Spark Batch Write #2500

Merged
merged 4 commits into from
Sep 12, 2024
Merged

Conversation

LuQQiu
Copy link
Contributor

@LuQQiu LuQQiu commented Jun 21, 2024

Add Lance Spark Batch Write Support.
Add a simple Lance catalog to support create table.
Batch write writes chunk to ArrowReader and zero-copy to Lance core.

TODO add the test cases for InternalRowWriterArrowReader
TODO larger scale manual testing

@github-actions github-actions bot added enhancement New feature or request java labels Jun 21, 2024

import java.util.Map;

public class LanceCatalog implements TableCatalog {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review

/**
* A custom arrow reader that supports writes Spark internal rows while reading data in batches.
*/
public class InternalRowWriterArrowReader extends ArrowReader {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear what does this class do. RowWriter vs ArrowReader?

Can we have a better name?
Is this used by spark.write or spark.read?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can we use package public for this class?

import java.util.concurrent.ExecutionException;
import java.util.concurrent.FutureTask;

public class LanceDataWriter implements DataWriter<InternalRow> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review

import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertThrows;

public class SparkWriteTest {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review

}
arrowReader.setFinished();
} catch (Exception e) {
e.printStackTrace();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one swallows the exception?

// If the following code impacts performance, can be removed
VectorSchemaRoot root = this.getVectorSchemaRoot();
VectorUnloader unloader = new VectorUnloader(root);
try (ArrowRecordBatch recordBatch = unloader.getRecordBatch()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A RecordBatch is just consumed to update the totalBytesRead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there is no better way to compute bytes read. Alternative approach is to let bytesRead() throws unsupportedOperationException, in my testing, bytesRead is never called

}

@Override
public WriterCommitMessage commit() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is commit at at the end of the full write? If so, we will accumulate too much data in arrowReader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current sequence is

  • create arrow reader
  • start the reader thread and execute Fragment.create(reader)
  • write to the arrow reader
  • reader thread (fragment.create) consumes data in batches
  • commit which notify there is no more new data and wait for reader thread (fragment.create) finishes

return getSchema(config.getTablePath());
}

public static Optional<StructType> getSchema(String tablePath) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look right. So we have table concept and Dataset concept here. We need to be consistent.

Also, can we just call this class LanceDataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this class is for having a central place to put allocator and put all methods requires allocator in the same space

The LanceTable has been renamed to LanceDataset which allows user to read or write a single lance dataset

}
}

public static void createTable(String datasetUri, StructType sparkSchema) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be consistent of dataset or table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use dataset

@LuQQiu
Copy link
Contributor Author

LuQQiu commented Jul 2, 2024

@eddyxu PTAL, thanks

@LuQQiu
Copy link
Contributor Author

LuQQiu commented Sep 12, 2024

Merge this one for now.
Spark Write has the following main TODOs

  1. commit fragments in once for all executor for one spark write job, instead of commit separately
  2. Support FixedSizeList for Lance vector column

@LuQQiu LuQQiu merged commit ce1e61c into lancedb:main Sep 12, 2024
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request java
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants