|
| 1 | +# Java bindings and SDK for Lance Data Format |
| 2 | + |
| 3 | +> :warning: **Under heavy development** |
| 4 | +
|
| 5 | +<div align="center"> |
| 6 | +<p align="center"> |
| 7 | + |
| 8 | +<img width="257" alt="Lance Logo" src="https://user-images.githubusercontent.com/917119/199353423-d3e202f7-0269-411d-8ff2-e747e419e492.png"> |
| 9 | + |
| 10 | +Lance is a new columnar data format for data science and machine learning |
| 11 | +</p></div> |
| 12 | + |
| 13 | +Why you should use Lance |
| 14 | +1. It is an order of magnitude faster than Parquet for point queries and nested data structures common to DS/ML |
| 15 | +2. It comes with a fast vector index that delivers sub-millisecond nearest neighbor search performance |
| 16 | +3. It is automatically versioned and supports lineage and time-travel for full reproducibility |
| 17 | +4. It is integrated with duckdb/pandas/polars already. Easily convert from/to Parquet in 2 lines of code |
| 18 | + |
| 19 | +## Quick start |
| 20 | + |
| 21 | +Introduce the Lance SDK Java Maven dependency(It is recommended to choose the latest version.): |
| 22 | + |
| 23 | +```shell |
| 24 | +<dependency> |
| 25 | + <groupId>com.lancedb</groupId> |
| 26 | + <artifactId>lance-core</artifactId> |
| 27 | + <version>0.18.0</version> |
| 28 | +</dependency> |
| 29 | +``` |
| 30 | + |
| 31 | +### Basic I/O |
| 32 | + |
| 33 | +* create empty dataset |
| 34 | + |
| 35 | +```java |
| 36 | +void createDataset() throws IOException, URISyntaxException { |
| 37 | + String datasetPath = tempDir.resolve("write_stream").toString(); |
| 38 | + Schema schema = |
| 39 | + new Schema( |
| 40 | + Arrays.asList( |
| 41 | + Field.nullable("id", new ArrowType.Int(32, true)), |
| 42 | + Field.nullable("name", new ArrowType.Utf8())), |
| 43 | + null); |
| 44 | + try (BufferAllocator allocator = new RootAllocator();) { |
| 45 | + Dataset.create(allocator, datasetPath, schema, new WriteParams.Builder().build()); |
| 46 | + try (Dataset dataset = Dataset.create(allocator, datasetPath, schema, new WriteParams.Builder().build());) { |
| 47 | + dataset.version(); |
| 48 | + dataset.latestVersion(); |
| 49 | + } |
| 50 | + } |
| 51 | +} |
| 52 | +``` |
| 53 | + |
| 54 | +* create and write a Lance dataset |
| 55 | + |
| 56 | +```java |
| 57 | +void createAndWriteDataset() throws IOException, URISyntaxException { |
| 58 | + Path path = ""; // the original source path |
| 59 | + String datasetPath = ""; // specify a path point to a dataset |
| 60 | + try (BufferAllocator allocator = new RootAllocator(); |
| 61 | + ArrowFileReader reader = |
| 62 | + new ArrowFileReader( |
| 63 | + new SeekableReadChannel( |
| 64 | + new ByteArrayReadableSeekableByteChannel(Files.readAllBytes(path))), allocator); |
| 65 | + ArrowArrayStream arrowStream = ArrowArrayStream.allocateNew(allocator)) { |
| 66 | + Data.exportArrayStream(allocator, reader, arrowStream); |
| 67 | + try (Dataset dataset = |
| 68 | + Dataset.create( |
| 69 | + allocator, |
| 70 | + arrowStream, |
| 71 | + datasetPath, |
| 72 | + new WriteParams.Builder() |
| 73 | + .withMaxRowsPerFile(10) |
| 74 | + .withMaxRowsPerGroup(20) |
| 75 | + .withMode(WriteParams.WriteMode.CREATE) |
| 76 | + .withStorageOptions(new HashMap<>()) |
| 77 | + .build())) { |
| 78 | + // access dataset |
| 79 | + } |
| 80 | + } |
| 81 | +} |
| 82 | +``` |
| 83 | +* read dataset |
| 84 | + |
| 85 | +```java |
| 86 | +void readDataset() { |
| 87 | + String datasetPath = ""; // specify a path point to a dataset |
| 88 | + try (BufferAllocator allocator = new RootAllocator()) { |
| 89 | + try (Dataset dataset = Dataset.open(datasetPath, allocator)) { |
| 90 | + dataset.countRows(); |
| 91 | + dataset.getSchema(); |
| 92 | + dataset.version(); |
| 93 | + dataset.latestVersion(); |
| 94 | + // access more information |
| 95 | + } |
| 96 | + } |
| 97 | +} |
| 98 | +``` |
| 99 | + |
| 100 | +* drop dataset |
| 101 | + |
| 102 | +```java |
| 103 | +void dropDataset() { |
| 104 | + String datasetPath = tempDir.resolve("drop_stream").toString(); |
| 105 | + Dataset.drop(datasetPath, new HashMap<>()); |
| 106 | +} |
| 107 | +``` |
| 108 | + |
| 109 | +### Random Access |
| 110 | + |
| 111 | +```java |
| 112 | +void randomAccess() { |
| 113 | + String datasetPath = ""; // specify a path point to a dataset |
| 114 | + try (BufferAllocator allocator = new RootAllocator()) { |
| 115 | + try (Dataset dataset = Dataset.open(datasetPath, allocator)) { |
| 116 | + List<Long> indices = Arrays.asList(1L, 4L); |
| 117 | + List<String> columns = Arrays.asList("id", "name"); |
| 118 | + try (ArrowReader reader = dataset.take(indices, columns)) { |
| 119 | + while (reader.loadNextBatch()) { |
| 120 | + VectorSchemaRoot result = reader.getVectorSchemaRoot(); |
| 121 | + result.getRowCount(); |
| 122 | + |
| 123 | + for (int i = 0; i < indices.size(); i++) { |
| 124 | + result.getVector("id").getObject(i); |
| 125 | + result.getVector("name").getObject(i); |
| 126 | + } |
| 127 | + } |
| 128 | + } |
| 129 | + } |
| 130 | + } |
| 131 | +} |
| 132 | +``` |
| 133 | + |
| 134 | +### Schema evolution |
| 135 | + |
| 136 | +* add columns |
| 137 | + |
| 138 | +```java |
| 139 | +void addColumns() { |
| 140 | + String datasetPath = ""; // specify a path point to a dataset |
| 141 | + try (BufferAllocator allocator = new RootAllocator()) { |
| 142 | + try (Dataset dataset = Dataset.open(datasetPath, allocator)) { |
| 143 | + SqlExpressions sqlExpressions = new SqlExpressions.Builder().withExpression("double_id", "id * 2").build(); |
| 144 | + dataset.addColumns(sqlExpressions, Optional.empty()); |
| 145 | + } |
| 146 | + } |
| 147 | +} |
| 148 | +``` |
| 149 | + |
| 150 | +* alter columns |
| 151 | + |
| 152 | +```java |
| 153 | +void alterColumns() { |
| 154 | + String datasetPath = ""; // specify a path point to a dataset |
| 155 | + try (BufferAllocator allocator = new RootAllocator()) { |
| 156 | + try (Dataset dataset = Dataset.open(datasetPath, allocator)) { |
| 157 | + ColumnAlteration nameColumnAlteration = |
| 158 | + new ColumnAlteration.Builder("name") |
| 159 | + .rename("new_name") |
| 160 | + .nullable(true) |
| 161 | + .castTo(new ArrowType.Utf8()) |
| 162 | + .build(); |
| 163 | + |
| 164 | + dataset.alterColumns(Collections.singletonList(nameColumnAlteration)); |
| 165 | + } |
| 166 | + } |
| 167 | +} |
| 168 | +``` |
| 169 | + |
| 170 | +* drop columns |
| 171 | + |
| 172 | +```java |
| 173 | +void dropColumns() { |
| 174 | + String datasetPath = ""; // specify a path point to a dataset |
| 175 | + try (BufferAllocator allocator = new RootAllocator()) { |
| 176 | + try (Dataset dataset = Dataset.open(datasetPath, allocator)) { |
| 177 | + dataset.dropColumns(Collections.singletonList("name")); |
| 178 | + } |
| 179 | + } |
| 180 | +} |
| 181 | +``` |
| 182 | + |
| 183 | +## Integrations |
| 184 | + |
| 185 | +This section introduces the ecosystem integration with Lance format. |
| 186 | +With the integration, users are able to access lance dataset with other technology or tools. |
| 187 | + |
| 188 | +### Spark connector |
| 189 | + |
| 190 | +The [spark](https://github.com/lancedb/lance/tree/main/java/spark) module is a standard maven module. |
| 191 | +It is the implementation of spark-lance connector that allows Apache Spark to efficiently access datasets stored in Lance format. |
| 192 | +More details please see the [README](https://github.com/lancedb/lance/blob/main/java/spark/README.md) file. |
| 193 | + |
| 194 | +## Contributing |
| 195 | + |
| 196 | +From the codebase dimension, the lance project is a multiple-lang project. All Java-related code is located in the `java` directory. |
| 197 | +And the whole `java` dir is a standard maven project(named `lance-parent`) can be imported into any IDEs support java project. |
| 198 | + |
| 199 | +Overall, it contains two Maven sub-modules: |
| 200 | + |
| 201 | +* lance-core: the core module of Lance Java binding, including `lance-jni`. |
| 202 | +* lance-spark: the spark connector module. |
| 203 | + |
| 204 | +To build the project, you can run the following command: |
| 205 | + |
| 206 | +```shell |
| 207 | +mvn clean package |
| 208 | +``` |
| 209 | + |
| 210 | +if you only want to build rust code(`lance-jni`), you can run the following command: |
| 211 | + |
| 212 | +```shell |
| 213 | +cargo build |
| 214 | +``` |
| 215 | + |
| 216 | +The java module uses `spotless` maven plugin to format the code and check the license header. |
| 217 | +And it is applied in the `validate` phase automatically. |
| 218 | + |
| 219 | +### Environment(IDE) setup |
| 220 | + |
| 221 | +Firstly, clone the repository into your local machine: |
| 222 | + |
| 223 | +```shell |
| 224 | +git clone https://github.com/lancedb/lance.git |
| 225 | +``` |
| 226 | + |
| 227 | +Then, import the `java` directory into your favorite IDEs, such as IntelliJ IDEA, Eclipse, etc. |
| 228 | + |
| 229 | +Due to the java module depends on the features provided by rust module. So, you also need to make sure you have installed rust in your local. |
| 230 | + |
| 231 | +To install rust, please refer to the [official documentation](https://www.rust-lang.org/tools/install). |
| 232 | + |
| 233 | +And you also need to install the rust plugin for your IDE. |
| 234 | + |
| 235 | +Then, you can build the whole java module: |
| 236 | + |
| 237 | +```shell |
| 238 | +mvn clean package |
| 239 | +``` |
| 240 | + |
| 241 | +Running these commands, it builds the rust jni binding codes automatically. |
0 commit comments