docs: fix read_and_write example (#3521)

eddyxu · web-flow · commit 0487ff519f0f · 2025-03-10T19:12:52.000-07:00
diff --git a/.github/workflows/docs-check.yml b/.github/workflows/docs-check.yml
@@ -6,6 +6,7 @@ on:
   pull_request:
     paths:
       - docs/**
+      - python/python/**
       - .github/workflows/docs-check.yml
 
 env:
@@ -26,7 +27,7 @@ jobs:
       - name: Set up Python
         uses: actions/setup-python@v5
         with:
-          python-version: "3.11"
+          python-version: "3.12"
           cache: 'pip'
           cache-dependency-path: "docs/requirements.txt"
       - name: Install dependencies
@@ -35,10 +36,14 @@ jobs:
       - name: Build python wheel
         uses: ./.github/workflows/build_linux_wheel
       - name: Build Python
-        working-directory: python
+        working-directory: docs
+        run: |
+          python -m pip install $(ls ../python/target/wheels/*.whl)
+          python -m pip install -r requirements.txt
+      - name: Run test
+        working-directory: docs
         run: |
-          python -m pip install $(ls target/wheels/*.whl)
-          python -m pip install -r ../docs/requirements.txt
+          make doctest
       - name: Build docs
         working-directory: docs
         run: |
diff --git a/.gitignore b/.gitignore
@@ -92,4 +92,5 @@ target
 python/venv
 test_data/venv
 
-**/*.profraw
+**/*.profraw
+*.lance
diff --git a/docs/arrays.rst b/docs/arrays.rst
@@ -21,54 +21,51 @@ bfloat16 NumPy extension array.
 If you are using Pandas, you can use the `lance.bfloat16` dtype string to create
 the array:
 
-.. testcode::
+.. doctest::
 
-    import pandas as pd
-    import lance.arrow
-
-    series = pd.Series([1.1, 2.1, 3.4], dtype="lance.bfloat16")
-    series
-
-.. testoutput::
+    >>> import lance.arrow
 
+    >>> pd.Series([1.1, 2.1, 3.4], dtype="lance.bfloat16")
     0    1.1015625
     1      2.09375
     2      3.40625
     dtype: lance.bfloat16
 
 To create an an arrow array, use the :func:`lance.arrow.bfloat16_array` function:
 
-.. testcode::
+.. code-block:: python
 
-    from lance.arrow import bfloat16_array
+    >>> from lance.arrow import bfloat16_array
 
-    array = bfloat16_array([1.1, 2.1, 3.4])
-    array
-
-.. testoutput::
+    >>> bfloat16_array([1.1, 2.1, 3.4])
+    <lance.arrow.BFloat16Array object at 0x000000016feb94e0>
+    [
+      1.1015625,
+      2.09375,
+      3.40625
+    ]
 
-    <lance.arrow.BFloat16Array object at 0x.+>
-    [1.1015625, 2.09375, 3.40625]
 
 Finally, if you have a pre-existing NumPy array, you can convert it into either:
 
-.. testcode::
-
-    import numpy as np
-    from ml_dtypes import bfloat16
-    from lance.arrow import PandasBFloat16Array, BFloat16Array
+.. doctest::
 
-    np_array = np.array([1.1, 2.1, 3.4], dtype=bfloat16)
-    PandasBFloat16Array.from_numpy(np_array)
-    BFloat16Array.from_numpy(np_array)
+    >>> import numpy as np
+    >>> from ml_dtypes import bfloat16
+    >>> from lance.arrow import PandasBFloat16Array, BFloat16Array
 
-.. testoutput::
-    
+    >>> np_array = np.array([1.1, 2.1, 3.4], dtype=bfloat16)
+    >>> PandasBFloat16Array.from_numpy(np_array)
     <PandasBFloat16Array>
     [1.1015625, 2.09375, 3.40625]
     Length: 3, dtype: lance.bfloat16
-    <lance.arrow.BFloat16Array object at 0x.+>
-    [1.1015625, 2.09375, 3.40625]
+    >>> BFloat16Array.from_numpy(np_array)
+    <lance.arrow.BFloat16Array object at 0x...>
+    [
+      1.1015625,
+      2.09375,
+      3.40625
+    ]
 
 When reading, these can be converted back to to the NumPy bfloat16 dtype using
 each array class's ``to_numpy`` method.
@@ -86,25 +83,23 @@ with a list of URIs represented by either :py:class:`pyarrow.StringArray` or an
 iterable that yields strings. Note that the URIs are not strongly validated and images
 are not read into memory automatically.
 
-.. testcode::
-
-    from lance.arrow import ImageURIArray
+.. doctest::
 
-    ImageURIArray.from_uris([
-        "/tmp/image1.jpg",
-        "file:///tmp/image2.jpg",
-        "s3://example/image3.jpg"
-    ])
+    >>> from lance.arrow import ImageURIArray
 
-.. testoutput::
+    >>> ImageURIArray.from_uris([
+    ...    "/tmp/image1.jpg",
+    ...    "file:///tmp/image2.jpg",
+    ...    "s3://example/image3.jpg"
+    ... ])
+    <lance.arrow.ImageURIArray object at 0x...>
+    ['/tmp/image1.jpg', 'file:///tmp/image2.jpg', 's3://example/image3.jpg']
 
-    <lance.arrow.ImageURIArray object at 0x.+>
-    ['/tmp/image1.jpg', 'file:///tmp/image2.jpg', 's3://example/image2.jpg']
 
 :func:`lance.arrow.ImageURIArray.read_uris` will read images into memory and return
 them as a new :class:`lance.arrow.EncodedImageArray` object.
 
-.. testcode::
+.. code-block:: python
 
     from lance.arrow import ImageURIArray
 
@@ -139,7 +134,7 @@ function parameter. If decoder is not provided it will attempt to use
 `Pillow`_ and `tensorflow`_ in that
 order. If neither library or custom decoder is available an exception will be raised.
 
-.. testcode::
+.. code-block:: python
 
     from lance.arrow import ImageURIArray
 
@@ -185,30 +180,20 @@ If encoder is not provided it will attempt to use
 `tensorflow`_ and `Pillow`_ in that order. Default encoders will
 encode to PNG. If neither library is available it will raise an exception.
 
-.. testcode::
-
-    from lance.arrow import ImageURIArray
-
-    def jpeg_encoder(images):
-        import tensorflow as tf
+.. testsetup::
 
-        encoded_images = (
-            tf.io.encode_jpeg(x).numpy() for x in tf.convert_to_tensor(images)
-        )
-        return pa.array(encoded_images, type=pa.binary())
+    image_uri = os.path.abspath(os.path.join(os.path.dirname(__name__), "_static", "icon.png"))
 
-    uris = [os.path.join(os.path.dirname(__file__), "images/1.png")]
-    tensor_images = ImageURIArray.from_uris(uris).read_uris().to_tensor()
-    print(tensor_images.to_encoded())
-    print(tensor_images.to_encoded(jpeg_encoder))
+.. doctest::
 
-.. testoutput::
+    >>> from lance.arrow import ImageURIArray
 
+    >>> uris = [image_uri]
+    >>> tensor_images = ImageURIArray.from_uris(uris).read_uris().to_tensor()
+    >>> tensor_images.to_encoded()
     <lance.arrow.EncodedImageArray object at 0x...>
-    [b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00...']
-    <lance.arrow.EncodedImageArray object at 0x00007f8d90b91b40>
-    [b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x01...']
-
+    [...
+    b'\x89PNG\r\n\x1a...'
 
 .. _tensorflow: https://www.tensorflow.org/api_docs/python/tf/io/encode_png
 .. _Pillow: https://pillow.readthedocs.io/en/stable/
diff --git a/docs/conf.py b/docs/conf.py
@@ -1,6 +1,7 @@
 # Configuration file for the Sphinx documentation builder.
 
 import shutil
+from datetime import datetime
 
 
 def run_apidoc(_):
@@ -17,7 +18,7 @@ def setup(app):
 # -- Project information -----------------------------------------------------
 
 project = "Lance"
-copyright = "2024, Lance Developer"
+copyright = f"{datetime.today().year}, Lance Developer"
 author = "Lance Developer"
 
 
@@ -27,12 +28,13 @@ def setup(app):
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
 extensions = [
-    "sphinx.ext.napoleon",
     "breathe",
+    "sphinx_copybutton",
     "sphinx.ext.autodoc",
     "sphinx.ext.doctest",
     "sphinx.ext.githubpages",
     "sphinx.ext.intersphinx",
+    "sphinx.ext.napoleon",
 ]
 
 napoleon_google_docstring = False
@@ -50,6 +52,12 @@ def setup(app):
 # This pattern also affects html_static_path and html_extra_path.
 exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
 
+intersphinx_mapping = {
+    "numpy": ("https://numpy.org/doc/stable/", None),
+    "pyarrow": ("https://arrow.apache.org/docs/", None),
+    "pandas": ("https://pandas.pydata.org/pandas-docs/stable/", None),
+}
+
 
 # -- Options for HTML output -------------------------------------------------
 
@@ -67,3 +75,19 @@ def setup(app):
     "source_icon": "github",
 }
 html_css_files = ["custom.css"]
+
+# -- doctest configuration ---------------------------------------------------
+
+doctest_global_setup = """
+import os
+import shutil
+from typing import Iterator
+
+import lance
+import pyarrow as pa
+import numpy as np
+import pandas as pd
+"""
+
+# Only test code examples in rst files
+doctest_test_doctest_blocks = ""
diff --git a/docs/format.rst b/docs/format.rst
@@ -1,7 +1,7 @@
 Lance Formats
 =============
 
-The Lance project includes both a table format and a file format.  Lance typically refers
+The Lance format is both a table format and a file format.  Lance typically refers
 to tables as "datasets".  A Lance dataset is designed to efficiently handle secondary indices,
 fast ingestion and modification of data, and a rich set of schema evolution features.
 
@@ -31,7 +31,7 @@ Fragments
 ~~~~~~~~~
 
 ``DataFragment`` represents a chunk of data in the dataset. Itself includes one or more ``DataFile``,
-where each ``DataFile`` can contain several columns in the chunk of data. It also may include a 
+where each ``DataFile`` can contain several columns in the chunk of data. It also may include a
 ``DeletionFile``, which is explained in a later section.
 
 .. literalinclude:: ../protos/table.proto
@@ -86,7 +86,7 @@ and/or performance.  However, older software versions may not be able to read ne
 
 In addition, the latest version of the file format (next) is unstable and should not be
 used for production use cases.  Breaking changes could be made to unstable encodings and
-that would mean that files written with these encodings are no longer readable by any 
+that would mean that files written with these encodings are no longer readable by any
 newer versions of Lance.  The ``next`` version should only be used for experimentation
 and benchmarking upcoming features.
 
@@ -95,7 +95,7 @@ The following values are supported:
 .. list-table:: File Versions
     :widths: 20 20 20 40
     :header-rows: 1
-  
+
     * - Version
       - Minimal Lance Version
       - Maximum Lance Version
@@ -206,7 +206,7 @@ Feature Flags
 As the file format and dataset evolve, new feature flags are added to the
 format. There are two separate fields for checking for feature flags, depending
 on whether you are trying to read or write the table. Readers should check the
-``reader_feature_flags`` to see if there are any flag it is not aware of. Writers 
+``reader_feature_flags`` to see if there are any flag it is not aware of. Writers
 should check ``writer_feature_flags``. If either sees a flag they don't know, they
 should return an "unsupported" error on any read or write operation.
 
@@ -286,7 +286,7 @@ deleted for some fragment. For a given version of the dataset, each fragment can
 have up to one deletion file. Fragments that have no deleted rows have no deletion
 file.
 
-Readers should filter out row ids contained in these deletion files during a 
+Readers should filter out row ids contained in these deletion files during a
 scan or ANN search.
 
 Deletion files come in two flavors:
@@ -319,7 +319,7 @@ collisions. The suffix is determined by the file type (``.arrow`` for Arrow file
    :start-at: // Deletion File
    :end-at: } // DeletionFile
 
-Deletes can be materialized by re-writing data files with the deleted rows 
+Deletes can be materialized by re-writing data files with the deleted rows
 removed. However, this invalidates row indices and thus the ANN indices, which
 can be expensive to recompute.
 
@@ -388,7 +388,7 @@ The commit process is as follows:
     fails because another writer has already committed, go back to step 3.
 
 When checking whether two transactions conflict, be conservative. If the
-transaction file is missing, assume it conflicts. If the transaction file 
+transaction file is missing, assume it conflicts. If the transaction file
 has an unknown operation, assume it conflicts.
 
 .. _external-manifest-store:
@@ -555,7 +555,7 @@ The row id values for a fragment are stored in a ``RowIdSequence`` protobuf
 message. This is described in the `protos/rowids.proto`_ file. Row id sequences
 are just arrays of u64 values, which have representations optimized for the
 common case where they are sorted and possibly contiguous. For example, a new
-fragment will have a row id sequence that is just a simple range, so it is 
+fragment will have a row id sequence that is just a simple range, so it is
 stored as a ``start`` and ``end`` value.
 
 These sequence messages are either stored inline in the fragment metadata, or
diff --git a/docs/index.rst b/docs/index.rst
@@ -2,20 +2,20 @@
 .. image:: _static/lance_logo.png
   :width: 400
 
-Lance: modern columnar data format for ML
-======================================================================================
+Lance: modern columnar format for ML workloads
+==============================================
 
 
-`Lance` is a columnar data format that is easy and fast to version, query and train on.
+`Lance` is a columnar format that is easy and fast to version, query and train on.
 It’s designed to be used with images, videos, 3D point clouds, audio and of course tabular data.
 It supports any POSIX file systems, and cloud storage like AWS S3 and Google Cloud Storage.
 The key features of Lance include:
 
 * **High-performance random access:** 100x faster than Parquet.
 
-* **Vector search:** find nearest neighbors in under 1 millisecond and combine OLAP-queries with vector search.
+* **Zero-copy schema evolution:** add and drop columns without copying the entire dataset.
 
-* **Zero-copy, automatic versioning:** manage versions of your data automatically, and reduce redundancy with zero-copy logic built-in.
+* **Vector search:** find nearest neighbors in under 1 millisecond and combine OLAP-queries with vector search.
 
 * **Ecosystem integrations:** Apache-Arrow, DuckDB and more on the way.
 
diff --git a/docs/read_and_write.rst b/docs/read_and_write.rst
diff --git a/docs/requirements.txt b/docs/requirements.txt