Skip to content

Commit 0487ff5

Browse files
authored
docs: fix read_and_write example (#3521)
1 parent 9175ff7 commit 0487ff5

File tree

8 files changed

+211
-175
lines changed

8 files changed

+211
-175
lines changed

.github/workflows/docs-check.yml

+9-4
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ on:
66
pull_request:
77
paths:
88
- docs/**
9+
- python/python/**
910
- .github/workflows/docs-check.yml
1011

1112
env:
@@ -26,7 +27,7 @@ jobs:
2627
- name: Set up Python
2728
uses: actions/setup-python@v5
2829
with:
29-
python-version: "3.11"
30+
python-version: "3.12"
3031
cache: 'pip'
3132
cache-dependency-path: "docs/requirements.txt"
3233
- name: Install dependencies
@@ -35,10 +36,14 @@ jobs:
3536
- name: Build python wheel
3637
uses: ./.github/workflows/build_linux_wheel
3738
- name: Build Python
38-
working-directory: python
39+
working-directory: docs
40+
run: |
41+
python -m pip install $(ls ../python/target/wheels/*.whl)
42+
python -m pip install -r requirements.txt
43+
- name: Run test
44+
working-directory: docs
3945
run: |
40-
python -m pip install $(ls target/wheels/*.whl)
41-
python -m pip install -r ../docs/requirements.txt
46+
make doctest
4247
- name: Build docs
4348
working-directory: docs
4449
run: |

.gitignore

+2-1
Original file line numberDiff line numberDiff line change
@@ -92,4 +92,5 @@ target
9292
python/venv
9393
test_data/venv
9494

95-
**/*.profraw
95+
**/*.profraw
96+
*.lance

docs/arrays.rst

+45-60
Original file line numberDiff line numberDiff line change
@@ -21,54 +21,51 @@ bfloat16 NumPy extension array.
2121
If you are using Pandas, you can use the `lance.bfloat16` dtype string to create
2222
the array:
2323

24-
.. testcode::
24+
.. doctest::
2525

26-
import pandas as pd
27-
import lance.arrow
28-
29-
series = pd.Series([1.1, 2.1, 3.4], dtype="lance.bfloat16")
30-
series
31-
32-
.. testoutput::
26+
>>> import lance.arrow
3327

28+
>>> pd.Series([1.1, 2.1, 3.4], dtype="lance.bfloat16")
3429
0 1.1015625
3530
1 2.09375
3631
2 3.40625
3732
dtype: lance.bfloat16
3833

3934
To create an an arrow array, use the :func:`lance.arrow.bfloat16_array` function:
4035

41-
.. testcode::
36+
.. code-block:: python
4237
43-
from lance.arrow import bfloat16_array
38+
>>> from lance.arrow import bfloat16_array
4439
45-
array = bfloat16_array([1.1, 2.1, 3.4])
46-
array
47-
48-
.. testoutput::
40+
>>> bfloat16_array([1.1, 2.1, 3.4])
41+
<lance.arrow.BFloat16Array object at 0x000000016feb94e0>
42+
[
43+
1.1015625,
44+
2.09375,
45+
3.40625
46+
]
4947
50-
<lance.arrow.BFloat16Array object at 0x.+>
51-
[1.1015625, 2.09375, 3.40625]
5248
5349
Finally, if you have a pre-existing NumPy array, you can convert it into either:
5450

55-
.. testcode::
56-
57-
import numpy as np
58-
from ml_dtypes import bfloat16
59-
from lance.arrow import PandasBFloat16Array, BFloat16Array
51+
.. doctest::
6052

61-
np_array = np.array([1.1, 2.1, 3.4], dtype=bfloat16)
62-
PandasBFloat16Array.from_numpy(np_array)
63-
BFloat16Array.from_numpy(np_array)
53+
>>> import numpy as np
54+
>>> from ml_dtypes import bfloat16
55+
>>> from lance.arrow import PandasBFloat16Array, BFloat16Array
6456

65-
.. testoutput::
66-
57+
>>> np_array = np.array([1.1, 2.1, 3.4], dtype=bfloat16)
58+
>>> PandasBFloat16Array.from_numpy(np_array)
6759
<PandasBFloat16Array>
6860
[1.1015625, 2.09375, 3.40625]
6961
Length: 3, dtype: lance.bfloat16
70-
<lance.arrow.BFloat16Array object at 0x.+>
71-
[1.1015625, 2.09375, 3.40625]
62+
>>> BFloat16Array.from_numpy(np_array)
63+
<lance.arrow.BFloat16Array object at 0x...>
64+
[
65+
1.1015625,
66+
2.09375,
67+
3.40625
68+
]
7269

7370
When reading, these can be converted back to to the NumPy bfloat16 dtype using
7471
each array class's ``to_numpy`` method.
@@ -86,25 +83,23 @@ with a list of URIs represented by either :py:class:`pyarrow.StringArray` or an
8683
iterable that yields strings. Note that the URIs are not strongly validated and images
8784
are not read into memory automatically.
8885

89-
.. testcode::
90-
91-
from lance.arrow import ImageURIArray
86+
.. doctest::
9287

93-
ImageURIArray.from_uris([
94-
"/tmp/image1.jpg",
95-
"file:///tmp/image2.jpg",
96-
"s3://example/image3.jpg"
97-
])
88+
>>> from lance.arrow import ImageURIArray
9889

99-
.. testoutput::
90+
>>> ImageURIArray.from_uris([
91+
... "/tmp/image1.jpg",
92+
... "file:///tmp/image2.jpg",
93+
... "s3://example/image3.jpg"
94+
... ])
95+
<lance.arrow.ImageURIArray object at 0x...>
96+
['/tmp/image1.jpg', 'file:///tmp/image2.jpg', 's3://example/image3.jpg']
10097

101-
<lance.arrow.ImageURIArray object at 0x.+>
102-
['/tmp/image1.jpg', 'file:///tmp/image2.jpg', 's3://example/image2.jpg']
10398

10499
:func:`lance.arrow.ImageURIArray.read_uris` will read images into memory and return
105100
them as a new :class:`lance.arrow.EncodedImageArray` object.
106101

107-
.. testcode::
102+
.. code-block:: python
108103
109104
from lance.arrow import ImageURIArray
110105
@@ -139,7 +134,7 @@ function parameter. If decoder is not provided it will attempt to use
139134
`Pillow`_ and `tensorflow`_ in that
140135
order. If neither library or custom decoder is available an exception will be raised.
141136

142-
.. testcode::
137+
.. code-block:: python
143138
144139
from lance.arrow import ImageURIArray
145140
@@ -185,30 +180,20 @@ If encoder is not provided it will attempt to use
185180
`tensorflow`_ and `Pillow`_ in that order. Default encoders will
186181
encode to PNG. If neither library is available it will raise an exception.
187182

188-
.. testcode::
189-
190-
from lance.arrow import ImageURIArray
191-
192-
def jpeg_encoder(images):
193-
import tensorflow as tf
183+
.. testsetup::
194184

195-
encoded_images = (
196-
tf.io.encode_jpeg(x).numpy() for x in tf.convert_to_tensor(images)
197-
)
198-
return pa.array(encoded_images, type=pa.binary())
185+
image_uri = os.path.abspath(os.path.join(os.path.dirname(__name__), "_static", "icon.png"))
199186

200-
uris = [os.path.join(os.path.dirname(__file__), "images/1.png")]
201-
tensor_images = ImageURIArray.from_uris(uris).read_uris().to_tensor()
202-
print(tensor_images.to_encoded())
203-
print(tensor_images.to_encoded(jpeg_encoder))
187+
.. doctest::
204188

205-
.. testoutput::
189+
>>> from lance.arrow import ImageURIArray
206190

191+
>>> uris = [image_uri]
192+
>>> tensor_images = ImageURIArray.from_uris(uris).read_uris().to_tensor()
193+
>>> tensor_images.to_encoded()
207194
<lance.arrow.EncodedImageArray object at 0x...>
208-
[b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00...']
209-
<lance.arrow.EncodedImageArray object at 0x00007f8d90b91b40>
210-
[b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x01...']
211-
195+
[...
196+
b'\x89PNG\r\n\x1a...'
212197

213198
.. _tensorflow: https://www.tensorflow.org/api_docs/python/tf/io/encode_png
214199
.. _Pillow: https://pillow.readthedocs.io/en/stable/

docs/conf.py

+26-2
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# Configuration file for the Sphinx documentation builder.
22

33
import shutil
4+
from datetime import datetime
45

56

67
def run_apidoc(_):
@@ -17,7 +18,7 @@ def setup(app):
1718
# -- Project information -----------------------------------------------------
1819

1920
project = "Lance"
20-
copyright = "2024, Lance Developer"
21+
copyright = f"{datetime.today().year}, Lance Developer"
2122
author = "Lance Developer"
2223

2324

@@ -27,12 +28,13 @@ def setup(app):
2728
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
2829
# ones.
2930
extensions = [
30-
"sphinx.ext.napoleon",
3131
"breathe",
32+
"sphinx_copybutton",
3233
"sphinx.ext.autodoc",
3334
"sphinx.ext.doctest",
3435
"sphinx.ext.githubpages",
3536
"sphinx.ext.intersphinx",
37+
"sphinx.ext.napoleon",
3638
]
3739

3840
napoleon_google_docstring = False
@@ -50,6 +52,12 @@ def setup(app):
5052
# This pattern also affects html_static_path and html_extra_path.
5153
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
5254

55+
intersphinx_mapping = {
56+
"numpy": ("https://numpy.org/doc/stable/", None),
57+
"pyarrow": ("https://arrow.apache.org/docs/", None),
58+
"pandas": ("https://pandas.pydata.org/pandas-docs/stable/", None),
59+
}
60+
5361

5462
# -- Options for HTML output -------------------------------------------------
5563

@@ -67,3 +75,19 @@ def setup(app):
6775
"source_icon": "github",
6876
}
6977
html_css_files = ["custom.css"]
78+
79+
# -- doctest configuration ---------------------------------------------------
80+
81+
doctest_global_setup = """
82+
import os
83+
import shutil
84+
from typing import Iterator
85+
86+
import lance
87+
import pyarrow as pa
88+
import numpy as np
89+
import pandas as pd
90+
"""
91+
92+
# Only test code examples in rst files
93+
doctest_test_doctest_blocks = ""

docs/format.rst

+9-9
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
Lance Formats
22
=============
33

4-
The Lance project includes both a table format and a file format. Lance typically refers
4+
The Lance format is both a table format and a file format. Lance typically refers
55
to tables as "datasets". A Lance dataset is designed to efficiently handle secondary indices,
66
fast ingestion and modification of data, and a rich set of schema evolution features.
77

@@ -31,7 +31,7 @@ Fragments
3131
~~~~~~~~~
3232

3333
``DataFragment`` represents a chunk of data in the dataset. Itself includes one or more ``DataFile``,
34-
where each ``DataFile`` can contain several columns in the chunk of data. It also may include a
34+
where each ``DataFile`` can contain several columns in the chunk of data. It also may include a
3535
``DeletionFile``, which is explained in a later section.
3636

3737
.. literalinclude:: ../protos/table.proto
@@ -86,7 +86,7 @@ and/or performance. However, older software versions may not be able to read ne
8686

8787
In addition, the latest version of the file format (next) is unstable and should not be
8888
used for production use cases. Breaking changes could be made to unstable encodings and
89-
that would mean that files written with these encodings are no longer readable by any
89+
that would mean that files written with these encodings are no longer readable by any
9090
newer versions of Lance. The ``next`` version should only be used for experimentation
9191
and benchmarking upcoming features.
9292

@@ -95,7 +95,7 @@ The following values are supported:
9595
.. list-table:: File Versions
9696
:widths: 20 20 20 40
9797
:header-rows: 1
98-
98+
9999
* - Version
100100
- Minimal Lance Version
101101
- Maximum Lance Version
@@ -206,7 +206,7 @@ Feature Flags
206206
As the file format and dataset evolve, new feature flags are added to the
207207
format. There are two separate fields for checking for feature flags, depending
208208
on whether you are trying to read or write the table. Readers should check the
209-
``reader_feature_flags`` to see if there are any flag it is not aware of. Writers
209+
``reader_feature_flags`` to see if there are any flag it is not aware of. Writers
210210
should check ``writer_feature_flags``. If either sees a flag they don't know, they
211211
should return an "unsupported" error on any read or write operation.
212212

@@ -286,7 +286,7 @@ deleted for some fragment. For a given version of the dataset, each fragment can
286286
have up to one deletion file. Fragments that have no deleted rows have no deletion
287287
file.
288288

289-
Readers should filter out row ids contained in these deletion files during a
289+
Readers should filter out row ids contained in these deletion files during a
290290
scan or ANN search.
291291

292292
Deletion files come in two flavors:
@@ -319,7 +319,7 @@ collisions. The suffix is determined by the file type (``.arrow`` for Arrow file
319319
:start-at: // Deletion File
320320
:end-at: } // DeletionFile
321321

322-
Deletes can be materialized by re-writing data files with the deleted rows
322+
Deletes can be materialized by re-writing data files with the deleted rows
323323
removed. However, this invalidates row indices and thus the ANN indices, which
324324
can be expensive to recompute.
325325

@@ -388,7 +388,7 @@ The commit process is as follows:
388388
fails because another writer has already committed, go back to step 3.
389389

390390
When checking whether two transactions conflict, be conservative. If the
391-
transaction file is missing, assume it conflicts. If the transaction file
391+
transaction file is missing, assume it conflicts. If the transaction file
392392
has an unknown operation, assume it conflicts.
393393

394394
.. _external-manifest-store:
@@ -555,7 +555,7 @@ The row id values for a fragment are stored in a ``RowIdSequence`` protobuf
555555
message. This is described in the `protos/rowids.proto`_ file. Row id sequences
556556
are just arrays of u64 values, which have representations optimized for the
557557
common case where they are sorted and possibly contiguous. For example, a new
558-
fragment will have a row id sequence that is just a simple range, so it is
558+
fragment will have a row id sequence that is just a simple range, so it is
559559
stored as a ``start`` and ``end`` value.
560560

561561
These sequence messages are either stored inline in the fragment metadata, or

docs/index.rst

+5-5
Original file line numberDiff line numberDiff line change
@@ -2,20 +2,20 @@
22
.. image:: _static/lance_logo.png
33
:width: 400
44

5-
Lance: modern columnar data format for ML
6-
======================================================================================
5+
Lance: modern columnar format for ML workloads
6+
==============================================
77

88

9-
`Lance` is a columnar data format that is easy and fast to version, query and train on.
9+
`Lance` is a columnar format that is easy and fast to version, query and train on.
1010
It’s designed to be used with images, videos, 3D point clouds, audio and of course tabular data.
1111
It supports any POSIX file systems, and cloud storage like AWS S3 and Google Cloud Storage.
1212
The key features of Lance include:
1313

1414
* **High-performance random access:** 100x faster than Parquet.
1515

16-
* **Vector search:** find nearest neighbors in under 1 millisecond and combine OLAP-queries with vector search.
16+
* **Zero-copy schema evolution:** add and drop columns without copying the entire dataset.
1717

18-
* **Zero-copy, automatic versioning:** manage versions of your data automatically, and reduce redundancy with zero-copy logic built-in.
18+
* **Vector search:** find nearest neighbors in under 1 millisecond and combine OLAP-queries with vector search.
1919

2020
* **Ecosystem integrations:** Apache-Arrow, DuckDB and more on the way.
2121

0 commit comments

Comments
 (0)