Skip to content

Commit a9be7af

Browse files
authored
docs: object store configuration (#1849)
Closes #1844 * Documents how to configure S3 and GCS * This includes S3 Express * Added a CI job to check for broken links * Fixed various existing issues in docs that cause this CI job to fail
1 parent 936c60a commit a9be7af

File tree

7 files changed

+231
-78
lines changed

7 files changed

+231
-78
lines changed

.github/workflows/docs-check.yml

+50
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
name: Check docs
2+
3+
on:
4+
push:
5+
branches: ["main"]
6+
pull_request:
7+
paths:
8+
- docs/**
9+
10+
env:
11+
# Disable full debug symbol generation to speed up CI build and keep memory down
12+
# "1" means line tables only, which is useful for panic tracebacks.
13+
RUSTFLAGS: "-C debuginfo=1"
14+
# according to: https://matklad.github.io/2021/09/04/fast-rust-builds.html
15+
# CI builds are faster with incremental disabled.
16+
CARGO_INCREMENTAL: "0"
17+
18+
jobs:
19+
# Single deploy job since we're just deploying
20+
check-docs:
21+
runs-on: ubuntu-22.04-4core
22+
steps:
23+
- name: Checkout
24+
uses: actions/checkout@v3
25+
- name: Set up Python
26+
uses: actions/setup-python@v4
27+
with:
28+
python-version: "3.11"
29+
cache: 'pip'
30+
cache-dependency-path: "docs/requirements.txt"
31+
- name: Install dependencies
32+
run: |
33+
sudo apt install -y -qq doxygen pandoc
34+
- name: Build python wheel
35+
uses: ./.github/workflows/build_linux_wheel
36+
- name: Build Python
37+
working-directory: python
38+
run: |
39+
python -m pip install $(ls target/wheels/*.whl)
40+
python -m pip install -r ../docs/requirements.txt
41+
- name: Build docs
42+
working-directory: docs
43+
run: |
44+
make nbconvert
45+
make html
46+
- name: Check links
47+
working-directory: docs
48+
run: |
49+
make linkcheck
50+

.github/workflows/docs.yml .github/workflows/docs-deploy.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -65,4 +65,4 @@ jobs:
6565
path: 'docs/_build/html'
6666
- name: Deploy to GitHub Pages
6767
id: deployment
68-
uses: actions/deploy-pages@v1
68+
uses: actions/deploy-pages@v1

docs/Makefile

+7
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ SPHINXOPTS ?=
77
SPHINXBUILD ?= sphinx-build
88
SOURCEDIR = .
99
BUILDDIR = _build
10+
LINKCHECKDIR = build/linkcheck
1011

1112
# Put it first so that "make" without argument is like "make help".
1213
help:
@@ -21,3 +22,9 @@ nbconvert:
2122
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
2223
%: Makefile
2324
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
25+
26+
checklinks:
27+
$(SPHINXBUILD) -b linkcheck "$(SOURCEDIR)" $(LINKCHECKDIR)
28+
@echo
29+
@echo "Check finished. Report is in $(LINKCHECKDIR)."
30+
.PHONY: checklinks

docs/arrays.rst

+6-4
Original file line numberDiff line numberDiff line change
@@ -136,8 +136,7 @@ encoded images and return them as :class:`lance.arrow.FixedShapeImageTensorArray
136136
which they can be converted to numpy arrays or TensorFlow tensors.
137137
For decoding images, it will first attempt to use a decoder provided via the optional
138138
function parameter. If decoder is not provided it will attempt to use
139-
`Pillow <https://pillow.readthedocs.io/en/stable/>`_ and
140-
`tensorflow <https://www.tensorflow.org/api_docs/python/tf/io/decode_image>`_ in that
139+
`Pillow`_ and `tensorflow`_ in that
141140
order. If neither library or custom decoder is available an exception will be raised.
142141

143142
.. testcode::
@@ -183,8 +182,7 @@ created by calling :func:`lance.arrow.ImageArray.from_array` and passing in a
183182
It can be encoded into to :class:`lance.arrow.EncodedImageArray` by calling
184183
:func:`lance.arrow.FixedShapeImageTensorArray.to_encoded` and passing custom encoder
185184
If encoder is not provided it will attempt to use
186-
`tensorflow <https://www.tensorflow.org/api_docs/python/tf/io/encode_png>`_ and
187-
`Pillow <https://pillow.readthedocs.io/en/stable/>`_ in that order. Default encoders will
185+
`tensorflow`_ and `Pillow`_ in that order. Default encoders will
188186
encode to PNG. If neither library is available it will raise an exception.
189187

190188
.. testcode::
@@ -210,3 +208,7 @@ encode to PNG. If neither library is available it will raise an exception.
210208
[b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00...']
211209
<lance.arrow.EncodedImageArray object at 0x00007f8d90b91b40>
212210
[b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x01...']
211+
212+
213+
.. _tensorflow: https://www.tensorflow.org/api_docs/python/tf/io/encode_png
214+
.. _Pillow: https://pillow.readthedocs.io/en/stable/

docs/integrations/tensorflow.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ By default, Lance will infer the Tensor spec from the projected columns. You can
5555
Distributed Training and Shuffling
5656
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5757
58-
Since `a Lance Dataset is a set of Fragments <../format>`_, we can distribute and shuffle Fragments to different
58+
Since `a Lance Dataset is a set of Fragments <../format.rst>`_, we can distribute and shuffle Fragments to different
5959
workers.
6060
6161
.. code-block:: python

docs/read_and_write.rst

+165-71
Original file line numberDiff line numberDiff line change
@@ -139,77 +139,7 @@ of Alice and Bob in the same example, we could write:
139139
.. for updating single rows in a loop, and users should instead do bulk updates
140140
.. using MERGE.
141141
142-
Committing mechanisms for S3
143-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
144-
145-
Most supported storage systems (e.g. local file system, Google Cloud Storage,
146-
Azure Blob Store) natively support atomic commits, which prevent concurrent
147-
writers from corrupting the dataset. However, S3 does not support this natively.
148-
To work around this, you may provide a locking mechanism that Lance can use to
149-
lock the table while providing a write. To do so, you should implement a
150-
context manager that acquires and releases a lock and then pass that to the
151-
``commit_lock`` parameter of :py:meth:`lance.write_dataset`.
152142
153-
.. note::
154-
155-
In order for the locking mechanism to work, all writers must use the same exact
156-
mechanism. Otherwise, Lance will not be able to detect conflicts.
157-
158-
On entering, the context manager should acquire the lock on the table. The table
159-
version being committed is passed in as an argument, which may be used if the
160-
locking service wishes to keep track of the current version of the table, but
161-
this is not required. If the table is already locked by another transaction,
162-
it should wait until it is unlocked, since the other transaction may fail. Once
163-
unlocked, it should either lock the table or, if the lock keeps track of the
164-
current version of the table, return a :class:`CommitConflictError` if the
165-
requested version has already been committed.
166-
167-
To prevent poisoned locks, it's recommended to set a timeout on the locks. That
168-
way, if a process crashes while holding the lock, the lock will be released
169-
eventually. The timeout should be no less than 30 seconds.
170-
171-
.. code-block:: python
172-
173-
from contextlib import contextmanager
174-
175-
@contextmanager
176-
def commit_lock(version: int);
177-
# Acquire the lock
178-
my_lock.acquire()
179-
try:
180-
yield
181-
except:
182-
failed = True
183-
finally:
184-
my_lock.release()
185-
186-
lance.write_dataset(data, "s3://bucket/path/", commit_lock=commit_lock)
187-
188-
When the context manager is exited, it will raise an exception if the commit
189-
failed. This might be because of a network error or if the version has already
190-
been written. Either way, the context manager should release the lock. Use a
191-
try/finally block to ensure that the lock is released.
192-
193-
Concurrent Writer on S3 using DynamoDB
194-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
195-
196-
.. warning::
197-
198-
This feature is experimental at the moment
199-
200-
Lance has native support for concurrent writers on S3 using DynamoDB instead of locking.
201-
User may pass in a DynamoDB table name alone with the S3 URI to their dataset to enable this feature.
202-
203-
.. code-block:: python
204-
205-
import lance
206-
# s3+ddb:// URL scheme let's lance know that you want to use DynamoDB for writing to S3 concurrently
207-
ds = lance.dataset("s3+ddb://my-bucket/mydataset.lance?ddbTableName=mytable")
208-
209-
The DynamoDB table is expected to have a primary hash key of ``base_uri`` and a range key ``version``.
210-
The key ``base_uri`` should be string type, and the key ``version`` should be number type.
211-
212-
For details on how this feature works, please see :ref:`external-manifest-store`.
213143
214144
215145
Reading Lance Dataset
@@ -227,7 +157,7 @@ To open a Lance dataset, use the :py:meth:`lance.dataset` function:
227157
.. note::
228158

229159
Lance supports local file system, AWS ``s3`` and Google Cloud Storage(``gs``) as storage backends
230-
at the moment.
160+
at the moment. Read more in `Object Store Configuration`_.
231161

232162
The most straightforward approach for reading a Lance dataset is to utilize the :py:meth:`lance.LanceDataset.to_table`
233163
method in order to load the entire dataset into memory.
@@ -424,3 +354,167 @@ rows don't have to be skipped during the scan.
424354
When files are rewritten, the original row ids are invalidated. This means the
425355
affected files are no longer part of any ANN index if they were before. Because
426356
of this, it's recommended to rewrite files before re-building indices.
357+
358+
359+
Object Store Configuration
360+
--------------------------
361+
362+
Lance supports object stores such as AWS S3 (and compatible stores), Azure Blob Store,
363+
and Google Cloud Storage. Which object store to use is determined by the URI scheme of
364+
the dataset path. For example, ``s3://bucket/path`` will use S3, ``az://bucket/path``
365+
will use Azure, and ``gs://bucket/path`` will use GCS.
366+
367+
Lance uses the `object-store`_ Rust crate for object store access. There are general
368+
environment variables that can be used to configure the object store, such as the
369+
request timeout and proxy configuration. See the `object_store ClientConfigKey`__ docs
370+
for available configuration options. (The environment variables that can be set
371+
are the snake-cased versions of these variable names. For example, to set ``ProxyUrl``
372+
use the environment variable ``PROXY_URL``.)
373+
374+
.. _object-store: https://docs.rs/object_store/0.9.0/object_store/
375+
.. __: https://docs.rs/object_store/latest/object_store/enum.ClientConfigKey.html
376+
377+
378+
S3 Configuration
379+
~~~~~~~~~~~~~~~~
380+
381+
To configure credentials for AWS S3, you can use the ``AWS_ACCESS_KEY_ID``,
382+
``AWS_SECRET_ACCESS_KEY``, and ``AWS_SESSION_TOKEN`` environment variables.
383+
384+
Alternatively, if you are using AWS SSO, you can use the ``AWS_PROFILE`` and
385+
``AWS_DEFAULT_REGION`` environment variables.
386+
387+
You can see a full list of environment variables `here`__.
388+
389+
.. __: https://docs.rs/object_store/latest/object_store/aws/struct.AmazonS3Builder.html#method.from_env
390+
391+
S3-compatible stores
392+
^^^^^^^^^^^^^^^^^^^^
393+
394+
Lance can also connect to S3-compatible stores, such as MinIO. To do so, you must
395+
specify two environment variables: ``AWS_ENDPOINT`` and ``AWS_DEFAULT_REGION``.
396+
``AWS_ENDPOINT`` should be the URL of the S3-compatible store, and
397+
``AWS_DEFAULT_REGION`` should be the region to use.
398+
399+
S3 Express
400+
^^^^^^^^^^
401+
402+
.. versionadded:: 0.9.7
403+
404+
Lance supports `S3 Express One Zone`_ endpoints, but requires additional configuration. Also,
405+
S3 Express endpoints only support connecting from an EC2 instance within the same
406+
region.
407+
408+
.. _S3 Express One Zone: https://aws.amazon.com/s3/storage-classes/express-one-zone/
409+
410+
To configure Lance to use an S3 Express endpoint, you must set the environment
411+
variable ``S3_EXPRESS``:
412+
413+
.. code-block:: bash
414+
415+
export S3_EXPRESS=true
416+
417+
You can then pass the bucket name **including the suffix** as you would normally:
418+
419+
.. code-block:: python
420+
421+
import lance
422+
ds = lance.dataset("s3://my-bucket--use1-az4--x-s3/path/imagenet.lance")
423+
424+
425+
Committing mechanisms for S3
426+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
427+
428+
Most supported storage systems (e.g. local file system, Google Cloud Storage,
429+
Azure Blob Store) natively support atomic commits, which prevent concurrent
430+
writers from corrupting the dataset. However, S3 does not support this natively.
431+
To work around this, you may provide a locking mechanism that Lance can use to
432+
lock the table while providing a write. To do so, you should implement a
433+
context manager that acquires and releases a lock and then pass that to the
434+
``commit_lock`` parameter of :py:meth:`lance.write_dataset`.
435+
436+
.. note::
437+
438+
In order for the locking mechanism to work, all writers must use the same exact
439+
mechanism. Otherwise, Lance will not be able to detect conflicts.
440+
441+
On entering, the context manager should acquire the lock on the table. The table
442+
version being committed is passed in as an argument, which may be used if the
443+
locking service wishes to keep track of the current version of the table, but
444+
this is not required. If the table is already locked by another transaction,
445+
it should wait until it is unlocked, since the other transaction may fail. Once
446+
unlocked, it should either lock the table or, if the lock keeps track of the
447+
current version of the table, return a :class:`CommitConflictError` if the
448+
requested version has already been committed.
449+
450+
To prevent poisoned locks, it's recommended to set a timeout on the locks. That
451+
way, if a process crashes while holding the lock, the lock will be released
452+
eventually. The timeout should be no less than 30 seconds.
453+
454+
.. code-block:: python
455+
456+
from contextlib import contextmanager
457+
458+
@contextmanager
459+
def commit_lock(version: int);
460+
# Acquire the lock
461+
my_lock.acquire()
462+
try:
463+
yield
464+
except:
465+
failed = True
466+
finally:
467+
my_lock.release()
468+
469+
lance.write_dataset(data, "s3://bucket/path/", commit_lock=commit_lock)
470+
471+
When the context manager is exited, it will raise an exception if the commit
472+
failed. This might be because of a network error or if the version has already
473+
been written. Either way, the context manager should release the lock. Use a
474+
try/finally block to ensure that the lock is released.
475+
476+
Concurrent Writer on S3 using DynamoDB
477+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
478+
479+
.. warning::
480+
481+
This feature is experimental at the moment
482+
483+
Lance has native support for concurrent writers on S3 using DynamoDB instead of locking.
484+
User may pass in a DynamoDB table name alone with the S3 URI to their dataset to enable this feature.
485+
486+
.. code-block:: python
487+
488+
import lance
489+
# s3+ddb:// URL scheme let's lance know that you want to use DynamoDB for writing to S3 concurrently
490+
ds = lance.dataset("s3+ddb://my-bucket/mydataset.lance?ddbTableName=mytable")
491+
492+
The DynamoDB table is expected to have a primary hash key of ``base_uri`` and a range key ``version``.
493+
The key ``base_uri`` should be string type, and the key ``version`` should be number type.
494+
495+
For details on how this feature works, please see :ref:`external-manifest-store`.
496+
497+
498+
Google Cloud Storage Configuration
499+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
500+
501+
GCS credentials are configured by setting the ``GOOGLE_SERVICE_ACCOUNT`` environment
502+
variable to the path of a JSON file containing the service account credentials.
503+
There are several aliases for this environment variable, documented `here`__.
504+
505+
.. __: https://docs.rs/object_store/latest/object_store/gcp/struct.GoogleCloudStorageBuilder.html#method.from_env
506+
507+
.. note::
508+
509+
By default, GCS uses HTTP/1 for communication, as opposed to HTTP/2. This improves
510+
maximum throughput significantly. However, if you wish to use HTTP/2 for some reason,
511+
you can set the environment variable ``HTTP1_ONLY`` to ``false``.
512+
513+
Azure Blob Storage Configuration
514+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
515+
516+
Azure Blob Storage credentials can be configured by setting the ``AZURE_STORAGE_ACCOUNT_NAME``
517+
and ``AZURE_STORAGE_ACCOUNT_KEY`` environment variables. The full list of environment
518+
variables that can be set are documented `here`__.
519+
520+
.. __: https://docs.rs/object_store/latest/object_store/azure/struct.MicrosoftAzureBuilder.html#method.from_env

python/python/lance/torch/distance.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ def cosine_distance(
9797
) -> Tuple[torch.Tensor, torch.Tensor]:
9898
"""Cosine pair-wise distances between two 2-D Tensors.
9999
100-
Cosine distance = 1 - |xy| / ||x|| * ||y||
100+
Cosine distance = ``1 - |xy| / ||x|| * ||y||``
101101
102102
Parameters
103103
----------

0 commit comments

Comments
 (0)