Skip to content

Commit 7f91eb0

Browse files
authored
docs: add README.md for java module (#3302)
1 parent f73398a commit 7f91eb0

File tree

1 file changed

+241
-0
lines changed

1 file changed

+241
-0
lines changed

java/README.md

+241
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,241 @@
1+
# Java bindings and SDK for Lance Data Format
2+
3+
> :warning: **Under heavy development**
4+
5+
<div align="center">
6+
<p align="center">
7+
8+
<img width="257" alt="Lance Logo" src="https://user-images.githubusercontent.com/917119/199353423-d3e202f7-0269-411d-8ff2-e747e419e492.png">
9+
10+
Lance is a new columnar data format for data science and machine learning
11+
</p></div>
12+
13+
Why you should use Lance
14+
1. It is an order of magnitude faster than Parquet for point queries and nested data structures common to DS/ML
15+
2. It comes with a fast vector index that delivers sub-millisecond nearest neighbor search performance
16+
3. It is automatically versioned and supports lineage and time-travel for full reproducibility
17+
4. It is integrated with duckdb/pandas/polars already. Easily convert from/to Parquet in 2 lines of code
18+
19+
## Quick start
20+
21+
Introduce the Lance SDK Java Maven dependency(It is recommended to choose the latest version.):
22+
23+
```shell
24+
<dependency>
25+
<groupId>com.lancedb</groupId>
26+
<artifactId>lance-core</artifactId>
27+
<version>0.18.0</version>
28+
</dependency>
29+
```
30+
31+
### Basic I/O
32+
33+
* create empty dataset
34+
35+
```java
36+
void createDataset() throws IOException, URISyntaxException {
37+
String datasetPath = tempDir.resolve("write_stream").toString();
38+
Schema schema =
39+
new Schema(
40+
Arrays.asList(
41+
Field.nullable("id", new ArrowType.Int(32, true)),
42+
Field.nullable("name", new ArrowType.Utf8())),
43+
null);
44+
try (BufferAllocator allocator = new RootAllocator();) {
45+
Dataset.create(allocator, datasetPath, schema, new WriteParams.Builder().build());
46+
try (Dataset dataset = Dataset.create(allocator, datasetPath, schema, new WriteParams.Builder().build());) {
47+
dataset.version();
48+
dataset.latestVersion();
49+
}
50+
}
51+
}
52+
```
53+
54+
* create and write a Lance dataset
55+
56+
```java
57+
void createAndWriteDataset() throws IOException, URISyntaxException {
58+
Path path = ""; // the original source path
59+
String datasetPath = ""; // specify a path point to a dataset
60+
try (BufferAllocator allocator = new RootAllocator();
61+
ArrowFileReader reader =
62+
new ArrowFileReader(
63+
new SeekableReadChannel(
64+
new ByteArrayReadableSeekableByteChannel(Files.readAllBytes(path))), allocator);
65+
ArrowArrayStream arrowStream = ArrowArrayStream.allocateNew(allocator)) {
66+
Data.exportArrayStream(allocator, reader, arrowStream);
67+
try (Dataset dataset =
68+
Dataset.create(
69+
allocator,
70+
arrowStream,
71+
datasetPath,
72+
new WriteParams.Builder()
73+
.withMaxRowsPerFile(10)
74+
.withMaxRowsPerGroup(20)
75+
.withMode(WriteParams.WriteMode.CREATE)
76+
.withStorageOptions(new HashMap<>())
77+
.build())) {
78+
// access dataset
79+
}
80+
}
81+
}
82+
```
83+
* read dataset
84+
85+
```java
86+
void readDataset() {
87+
String datasetPath = ""; // specify a path point to a dataset
88+
try (BufferAllocator allocator = new RootAllocator()) {
89+
try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
90+
dataset.countRows();
91+
dataset.getSchema();
92+
dataset.version();
93+
dataset.latestVersion();
94+
// access more information
95+
}
96+
}
97+
}
98+
```
99+
100+
* drop dataset
101+
102+
```java
103+
void dropDataset() {
104+
String datasetPath = tempDir.resolve("drop_stream").toString();
105+
Dataset.drop(datasetPath, new HashMap<>());
106+
}
107+
```
108+
109+
### Random Access
110+
111+
```java
112+
void randomAccess() {
113+
String datasetPath = ""; // specify a path point to a dataset
114+
try (BufferAllocator allocator = new RootAllocator()) {
115+
try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
116+
List<Long> indices = Arrays.asList(1L, 4L);
117+
List<String> columns = Arrays.asList("id", "name");
118+
try (ArrowReader reader = dataset.take(indices, columns)) {
119+
while (reader.loadNextBatch()) {
120+
VectorSchemaRoot result = reader.getVectorSchemaRoot();
121+
result.getRowCount();
122+
123+
for (int i = 0; i < indices.size(); i++) {
124+
result.getVector("id").getObject(i);
125+
result.getVector("name").getObject(i);
126+
}
127+
}
128+
}
129+
}
130+
}
131+
}
132+
```
133+
134+
### Schema evolution
135+
136+
* add columns
137+
138+
```java
139+
void addColumns() {
140+
String datasetPath = ""; // specify a path point to a dataset
141+
try (BufferAllocator allocator = new RootAllocator()) {
142+
try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
143+
SqlExpressions sqlExpressions = new SqlExpressions.Builder().withExpression("double_id", "id * 2").build();
144+
dataset.addColumns(sqlExpressions, Optional.empty());
145+
}
146+
}
147+
}
148+
```
149+
150+
* alter columns
151+
152+
```java
153+
void alterColumns() {
154+
String datasetPath = ""; // specify a path point to a dataset
155+
try (BufferAllocator allocator = new RootAllocator()) {
156+
try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
157+
ColumnAlteration nameColumnAlteration =
158+
new ColumnAlteration.Builder("name")
159+
.rename("new_name")
160+
.nullable(true)
161+
.castTo(new ArrowType.Utf8())
162+
.build();
163+
164+
dataset.alterColumns(Collections.singletonList(nameColumnAlteration));
165+
}
166+
}
167+
}
168+
```
169+
170+
* drop columns
171+
172+
```java
173+
void dropColumns() {
174+
String datasetPath = ""; // specify a path point to a dataset
175+
try (BufferAllocator allocator = new RootAllocator()) {
176+
try (Dataset dataset = Dataset.open(datasetPath, allocator)) {
177+
dataset.dropColumns(Collections.singletonList("name"));
178+
}
179+
}
180+
}
181+
```
182+
183+
## Integrations
184+
185+
This section introduces the ecosystem integration with Lance format.
186+
With the integration, users are able to access lance dataset with other technology or tools.
187+
188+
### Spark connector
189+
190+
The [spark](https://github.com/lancedb/lance/tree/main/java/spark) module is a standard maven module.
191+
It is the implementation of spark-lance connector that allows Apache Spark to efficiently access datasets stored in Lance format.
192+
More details please see the [README](https://github.com/lancedb/lance/blob/main/java/spark/README.md) file.
193+
194+
## Contributing
195+
196+
From the codebase dimension, the lance project is a multiple-lang project. All Java-related code is located in the `java` directory.
197+
And the whole `java` dir is a standard maven project(named `lance-parent`) can be imported into any IDEs support java project.
198+
199+
Overall, it contains two Maven sub-modules:
200+
201+
* lance-core: the core module of Lance Java binding, including `lance-jni`.
202+
* lance-spark: the spark connector module.
203+
204+
To build the project, you can run the following command:
205+
206+
```shell
207+
mvn clean package
208+
```
209+
210+
if you only want to build rust code(`lance-jni`), you can run the following command:
211+
212+
```shell
213+
cargo build
214+
```
215+
216+
The java module uses `spotless` maven plugin to format the code and check the license header.
217+
And it is applied in the `validate` phase automatically.
218+
219+
### Environment(IDE) setup
220+
221+
Firstly, clone the repository into your local machine:
222+
223+
```shell
224+
git clone https://github.com/lancedb/lance.git
225+
```
226+
227+
Then, import the `java` directory into your favorite IDEs, such as IntelliJ IDEA, Eclipse, etc.
228+
229+
Due to the java module depends on the features provided by rust module. So, you also need to make sure you have installed rust in your local.
230+
231+
To install rust, please refer to the [official documentation](https://www.rust-lang.org/tools/install).
232+
233+
And you also need to install the rust plugin for your IDE.
234+
235+
Then, you can build the whole java module:
236+
237+
```shell
238+
mvn clean package
239+
```
240+
241+
Running these commands, it builds the rust jni binding codes automatically.

0 commit comments

Comments
 (0)