Skip to content

Commit 25d3923

Browse files
authored
Merge branch 'main' into AddFileWriterOptions
2 parents 878c4f7 + 8643409 commit 25d3923

24 files changed

+2442
-1330
lines changed

Cargo.lock

+39-25
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

+13-8
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,13 @@
33

44
<img width="257" alt="Lance Logo" src="https://user-images.githubusercontent.com/917119/199353423-d3e202f7-0269-411d-8ff2-e747e419e492.png">
55

6-
**Modern columnar data format for ML. Convert from Parquet in 2-lines of code for 100x faster random access, a vector index, data versioning, and more.<br/>**
7-
**Compatible with pandas, DuckDB, Polars, and pyarrow with more integrations on the way.**
6+
**Modern columnar data format for ML. Convert from Parquet in 2-lines of code for 100x faster random access, zero-cost schema evolution, rich secondary indices, versioning, and more.<br/>**
7+
**Compatible with Pandas, DuckDB, Polars, Pyarrow, and Ray with more integrations on the way.**
88

99
<a href="https://lancedb.github.io/lance/">Documentation</a> •
1010
<a href="https://blog.lancedb.com/">Blog</a> •
1111
<a href="https://discord.gg/zMM32dvNtd">Discord</a> •
12-
<a href="https://twitter.com/lancedb">Twitter</a>
12+
<a href="https://x.com/lancedb">X</a>
1313

1414
[CI]: https://github.com/lancedb/lance/actions/workflows/rust.yml
1515
[CI Badge]: https://github.com/lancedb/lance/actions/workflows/rust.yml/badge.svg
@@ -44,7 +44,7 @@ The key features of Lance include:
4444

4545
* **Zero-copy, automatic versioning:** manage versions of your data without needing extra infrastructure.
4646

47-
* **Ecosystem integrations:** Apache Arrow, Pandas, Polars, DuckDB and more on the way.
47+
* **Ecosystem integrations:** Apache Arrow, Pandas, Polars, DuckDB, Ray, Spark and more on the way.
4848

4949
> [!TIP]
5050
> Lance is in active development and we welcome contributions. Please see our [contributing guide](docs/contributing.rst) for more information.
@@ -66,7 +66,7 @@ pip install --pre --extra-index-url https://pypi.fury.io/lancedb/ pylance
6666
> [!TIP]
6767
> Preview releases are released more often than full releases and contain the
6868
> latest features and bug fixes. They receive the same level of testing as full releases.
69-
> We guarantee they will remain published and available for download for at
69+
> We guarantee they will remain published and available for download for at
7070
> least 6 months. When you want to pin to a specific version, prefer a stable release.
7171
7272
**Converting to Lance**
@@ -186,8 +186,8 @@ Support both CPUs (``x86_64`` and ``arm``) and GPU (``Nvidia (cuda)`` and ``Appl
186186

187187
**Fast updates** (ROADMAP): Updates will be supported via write-ahead logs.
188188

189-
**Rich secondary indices** (ROADMAP):
190-
- Inverted index for fuzzy search over many label / annotation fields.
189+
**Rich secondary indices**: Support `BTree`, `Bitmap`, `Full text search`, `Label list`,
190+
`NGrams`, and more.
191191

192192
## Benchmarks
193193

@@ -253,11 +253,16 @@ A comparison of different data formats in each stage of ML development cycle.
253253

254254
Lance is currently used in production by:
255255
* [LanceDB](https://github.com/lancedb/lancedb), a serverless, low-latency vector database for ML applications
256+
* [LanceDB Enterprise](https://docs.lancedb.com/enterprise/introduction), hyperscale LanceDB with enterprise SLA.
257+
* Leading multimodal Gen AI companies for training over petabyte-scale multimodal data.
256258
* Self-driving car company for large-scale storage, retrieval and processing of multi-modal data.
257259
* E-commerce company for billion-scale+ vector personalized search.
258260
* and more.
259261

260-
## Presentations and Talks
262+
## Presentations, Blogs and Talks
261263

264+
* [Designing a Table Format for ML Workloads](https://blog.lancedb.com/designing-a-table-format-for-ml-workloads/), Feb 2025.
265+
* [Transforming Multimodal Data Management with LanceDB, Ray Summit](https://www.youtube.com/watch?v=xmTFEzAh8ho), Oct 2024.
266+
* [Lance v2: A columnar container format for modern data](https://blog.lancedb.com/lance-v2/), Apr 2024.
262267
* [Lance Deep Dive](https://drive.google.com/file/d/1Orh9rK0Mpj9zN_gnQF1eJJFpAc6lStGm/view?usp=drive_link). July 2023.
263268
* [Lance: A New Columnar Data Format](https://docs.google.com/presentation/d/1a4nAiQAkPDBtOfXFpPg7lbeDAxcNDVKgoUkw3cUs2rE/edit#slide=id.p), [Scipy 2022, Austin, TX](https://www.scipy2022.scipy.org/posters). July, 2022.

deny.toml

+1
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,7 @@ ignore = [
8383
{ id = "RUSTSEC-2021-0153", reason = "`encoding` is used by lindera" },
8484
{ id = "RUSTSEC-2024-0384", reason = "`instant` is used by tantivy" },
8585
{ id = "RUSTSEC-2024-0436", reason = "`paste` is used by datafusion" },
86+
{ id = "RUSTSEC-2025-0014", reason = "`humantime` is used by object_store" },
8687
]
8788
# If this is true, then cargo deny will use the git executable to fetch advisory database.
8889
# If this is false, then it uses a built-in git library.

docs/api/api.rst

+3-2
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ APIs
22
----
33

44
.. toctree::
5+
:maxdepth: 1
56

6-
Rust <https://docs.rs/crate/lance/latest>
7-
Python <./python.rst>
7+
Rust <https://docs.rs/crate/lance/latest>
8+
Python <./python.rst>

docs/conf.py

+16-14
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,5 @@
11
# Configuration file for the Sphinx documentation builder.
22

3-
import shutil
4-
5-
6-
def run_apidoc(_):
7-
from sphinx.ext.apidoc import main
8-
9-
shutil.rmtree("api/python", ignore_errors=True)
10-
main(["-f", "-o", "api/python", "../python/python/lance"])
11-
12-
13-
def setup(app):
14-
app.connect("builder-inited", run_apidoc)
15-
163

174
# -- Project information -----------------------------------------------------
185

@@ -29,6 +16,7 @@ def setup(app):
2916
extensions = [
3017
"breathe",
3118
"sphinx_immaterial",
19+
"sphinx_immaterial.apidoc.python.apigen",
3220
"sphinx.ext.autodoc",
3321
"sphinx.ext.doctest",
3422
"sphinx.ext.githubpages",
@@ -55,8 +43,22 @@ def setup(app):
5543
"numpy": ("https://numpy.org/doc/stable/", None),
5644
"pyarrow": ("https://arrow.apache.org/docs/", None),
5745
"pandas": ("https://pandas.pydata.org/pandas-docs/stable/", None),
46+
"ray": ("https://docs.ray.io/en/latest/", None),
5847
}
5948

49+
python_apigen_modules = {
50+
"lance": "api/python/",
51+
}
52+
object_description_options = [
53+
(
54+
"py:.*",
55+
dict(
56+
include_object_type_in_xref_tooltip=False,
57+
include_in_toc=False,
58+
include_fields_in_toc=False,
59+
),
60+
),
61+
]
6062

6163
# -- Options for HTML output -------------------------------------------------
6264

@@ -95,7 +97,7 @@ def setup(app):
9597
},
9698
],
9799
}
98-
include_in_toc = False
100+
99101

100102
# -- doctest configuration ---------------------------------------------------
101103

docs/index.rst

+3-1
Original file line numberDiff line numberDiff line change
@@ -43,14 +43,16 @@ Preview releases receive the same level of testing as regular releases.
4343
:maxdepth: 2
4444

4545
Quickstart <./notebooks/quickstart>
46-
./read_and_write
46+
./introduction/read_and_write
47+
./introduction/schema_evolution
4748

4849
.. toctree::
4950
:caption: Advanced Usage
5051
:maxdepth: 1
5152

5253
Lance Format Spec <./format>
5354
Blob API <./blob>
55+
Object Store Configuration <./object_store>
5456
Performance Guide <./performance>
5557
Tokenizer <./tokenizer>
5658
Extension Arrays <./arrays>

docs/integrations/ray.rst

+21-13
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,35 @@
11
Lance ❤️ Ray
22
--------------------
33

4-
Ray effortlessly scale up ML workload to large distributed compute environment.
4+
`Ray <https://www.anyscale.com/product/open-source/ray>`_ effortlessly scale up ML workload to large distributed
5+
compute environment.
56

6-
`Ray Data <https://docs.ray.io/en/latest/data/data.html>`_ can be directly written in Lance format by using the
7-
:class:`lance.ray.sink.LanceDatasink` class. For example:
7+
Lance format is one of the official `Ray data sources <https://docs.ray.io/en/latest/data/api/input_output.html#lance>`_:
88

9-
.. code-block:: bash
9+
* Lance Data Source :py:meth:`ray.data.read_lance`
10+
* Lance Data Sink :py:meth:`ray.data.Dataste.write_lance`
1011

11-
pip install pylance[ray]
12+
.. testsetup::
1213

14+
shutil.rmtree("./alice_bob_and_charlie.lance", ignore_errors=True)
1315

14-
``Ray Data Dataset`` can be written to Lance format using the following code:
15-
16-
.. code-block:: python
16+
.. testcode::
1717

1818
import ray
19-
from lance.ray.sink import LanceDatasink
2019

2120
ray.init()
2221

23-
sink = LanceDatasink("s3://bucket/to/data.lance")
24-
ray.data.range(10).map(
25-
lambda x: {"id": x["id"], "str": f"str-{x['id']}"}
26-
).write_datasink(sink)
22+
data = [
23+
{"id": 1, "name": "alice"},
24+
{"id": 2, "name": "bob"},
25+
{"id": 3, "name": "charlie"}
26+
]
27+
ray.data.from_items(data).write_lance("./alice_bob_and_charlie.lance")
28+
29+
# It can be read via lance directly
30+
tbl = lance.dataset("./alice_bob_and_charlie.lance").to_table()
31+
assert tbl == pa.Table.from_pylist(data)
2732

33+
# Or via Ray.data.read_lance
34+
pd_df = ray.data.read_lance("./alice_bob_and_charlie.lance").to_pandas()
35+
assert tbl == pa.Table.from_pandas(pd_df)

0 commit comments

Comments
 (0)