@@ -139,77 +139,7 @@ of Alice and Bob in the same example, we could write:
139
139
.. for updating single rows in a loop, and users should instead do bulk updates
140
140
.. using MERGE.
141
141
142
- Committing mechanisms for S3
143
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
144
-
145
- Most supported storage systems (e.g. local file system, Google Cloud Storage,
146
- Azure Blob Store) natively support atomic commits, which prevent concurrent
147
- writers from corrupting the dataset. However, S3 does not support this natively.
148
- To work around this, you may provide a locking mechanism that Lance can use to
149
- lock the table while providing a write. To do so, you should implement a
150
- context manager that acquires and releases a lock and then pass that to the
151
- ``commit_lock `` parameter of :py:meth: `lance.write_dataset `.
152
142
153
- .. note ::
154
-
155
- In order for the locking mechanism to work, all writers must use the same exact
156
- mechanism. Otherwise, Lance will not be able to detect conflicts.
157
-
158
- On entering, the context manager should acquire the lock on the table. The table
159
- version being committed is passed in as an argument, which may be used if the
160
- locking service wishes to keep track of the current version of the table, but
161
- this is not required. If the table is already locked by another transaction,
162
- it should wait until it is unlocked, since the other transaction may fail. Once
163
- unlocked, it should either lock the table or, if the lock keeps track of the
164
- current version of the table, return a :class: `CommitConflictError ` if the
165
- requested version has already been committed.
166
-
167
- To prevent poisoned locks, it's recommended to set a timeout on the locks. That
168
- way, if a process crashes while holding the lock, the lock will be released
169
- eventually. The timeout should be no less than 30 seconds.
170
-
171
- .. code-block :: python
172
-
173
- from contextlib import contextmanager
174
-
175
- @contextmanager
176
- def commit_lock (version : int );
177
- # Acquire the lock
178
- my_lock.acquire()
179
- try :
180
- yield
181
- except :
182
- failed = True
183
- finally :
184
- my_lock.release()
185
-
186
- lance.write_dataset(data, " s3://bucket/path/" , commit_lock = commit_lock)
187
-
188
- When the context manager is exited, it will raise an exception if the commit
189
- failed. This might be because of a network error or if the version has already
190
- been written. Either way, the context manager should release the lock. Use a
191
- try/finally block to ensure that the lock is released.
192
-
193
- Concurrent Writer on S3 using DynamoDB
194
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
195
-
196
- .. warning ::
197
-
198
- This feature is experimental at the moment
199
-
200
- Lance has native support for concurrent writers on S3 using DynamoDB instead of locking.
201
- User may pass in a DynamoDB table name alone with the S3 URI to their dataset to enable this feature.
202
-
203
- .. code-block :: python
204
-
205
- import lance
206
- # s3+ddb:// URL scheme let's lance know that you want to use DynamoDB for writing to S3 concurrently
207
- ds = lance.dataset(" s3+ddb://my-bucket/mydataset.lance?ddbTableName=mytable" )
208
-
209
- The DynamoDB table is expected to have a primary hash key of ``base_uri `` and a range key ``version ``.
210
- The key ``base_uri `` should be string type, and the key ``version `` should be number type.
211
-
212
- For details on how this feature works, please see :ref: `external-manifest-store `.
213
143
214
144
215
145
Reading Lance Dataset
@@ -227,7 +157,7 @@ To open a Lance dataset, use the :py:meth:`lance.dataset` function:
227
157
.. note ::
228
158
229
159
Lance supports local file system, AWS ``s3 `` and Google Cloud Storage(``gs ``) as storage backends
230
- at the moment.
160
+ at the moment. Read more in ` Object Store Configuration `_.
231
161
232
162
The most straightforward approach for reading a Lance dataset is to utilize the :py:meth: `lance.LanceDataset.to_table `
233
163
method in order to load the entire dataset into memory.
@@ -424,3 +354,167 @@ rows don't have to be skipped during the scan.
424
354
When files are rewritten, the original row ids are invalidated. This means the
425
355
affected files are no longer part of any ANN index if they were before. Because
426
356
of this, it's recommended to rewrite files before re-building indices.
357
+
358
+
359
+ Object Store Configuration
360
+ --------------------------
361
+
362
+ Lance supports object stores such as AWS S3 (and compatible stores), Azure Blob Store,
363
+ and Google Cloud Storage. Which object store to use is determined by the URI scheme of
364
+ the dataset path. For example, ``s3://bucket/path `` will use S3, ``az://bucket/path ``
365
+ will use Azure, and ``gs://bucket/path `` will use GCS.
366
+
367
+ Lance uses the `object-store `_ Rust crate for object store access. There are general
368
+ environment variables that can be used to configure the object store, such as the
369
+ request timeout and proxy configuration. See the `object_store ClientConfigKey `__ docs
370
+ for available configuration options. (The environment variables that can be set
371
+ are the snake-cased versions of these variable names. For example, to set ``ProxyUrl ``
372
+ use the environment variable ``PROXY_URL ``.)
373
+
374
+ .. _object-store : https://docs.rs/object_store/0.9.0/object_store/
375
+ .. __ : https://docs.rs/object_store/latest/object_store/enum.ClientConfigKey.html
376
+
377
+
378
+ S3 Configuration
379
+ ~~~~~~~~~~~~~~~~
380
+
381
+ To configure credentials for AWS S3, you can use the ``AWS_ACCESS_KEY_ID ``,
382
+ ``AWS_SECRET_ACCESS_KEY ``, and ``AWS_SESSION_TOKEN `` environment variables.
383
+
384
+ Alternatively, if you are using AWS SSO, you can use the ``AWS_PROFILE `` and
385
+ ``AWS_DEFAULT_REGION `` environment variables.
386
+
387
+ You can see a full list of environment variables `here `__.
388
+
389
+ .. __ : https://docs.rs/object_store/latest/object_store/aws/struct.AmazonS3Builder.html#method.from_env
390
+
391
+ S3-compatible stores
392
+ ^^^^^^^^^^^^^^^^^^^^
393
+
394
+ Lance can also connect to S3-compatible stores, such as MinIO. To do so, you must
395
+ specify two environment variables: ``AWS_ENDPOINT `` and ``AWS_DEFAULT_REGION ``.
396
+ ``AWS_ENDPOINT `` should be the URL of the S3-compatible store, and
397
+ ``AWS_DEFAULT_REGION `` should be the region to use.
398
+
399
+ S3 Express
400
+ ^^^^^^^^^^
401
+
402
+ .. versionadded :: 0.9.7
403
+
404
+ Lance supports `S3 Express One Zone `_ endpoints, but requires additional configuration. Also,
405
+ S3 Express endpoints only support connecting from an EC2 instance within the same
406
+ region.
407
+
408
+ .. _S3 Express One Zone : https://aws.amazon.com/s3/storage-classes/express-one-zone/
409
+
410
+ To configure Lance to use an S3 Express endpoint, you must set the environment
411
+ variable ``S3_EXPRESS ``:
412
+
413
+ .. code-block :: bash
414
+
415
+ export S3_EXPRESS=true
416
+
417
+ You can then pass the bucket name **including the suffix ** as you would normally:
418
+
419
+ .. code-block :: python
420
+
421
+ import lance
422
+ ds = lance.dataset(" s3://my-bucket--use1-az4--x-s3/path/imagenet.lance" )
423
+
424
+
425
+ Committing mechanisms for S3
426
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
427
+
428
+ Most supported storage systems (e.g. local file system, Google Cloud Storage,
429
+ Azure Blob Store) natively support atomic commits, which prevent concurrent
430
+ writers from corrupting the dataset. However, S3 does not support this natively.
431
+ To work around this, you may provide a locking mechanism that Lance can use to
432
+ lock the table while providing a write. To do so, you should implement a
433
+ context manager that acquires and releases a lock and then pass that to the
434
+ ``commit_lock `` parameter of :py:meth: `lance.write_dataset `.
435
+
436
+ .. note ::
437
+
438
+ In order for the locking mechanism to work, all writers must use the same exact
439
+ mechanism. Otherwise, Lance will not be able to detect conflicts.
440
+
441
+ On entering, the context manager should acquire the lock on the table. The table
442
+ version being committed is passed in as an argument, which may be used if the
443
+ locking service wishes to keep track of the current version of the table, but
444
+ this is not required. If the table is already locked by another transaction,
445
+ it should wait until it is unlocked, since the other transaction may fail. Once
446
+ unlocked, it should either lock the table or, if the lock keeps track of the
447
+ current version of the table, return a :class: `CommitConflictError ` if the
448
+ requested version has already been committed.
449
+
450
+ To prevent poisoned locks, it's recommended to set a timeout on the locks. That
451
+ way, if a process crashes while holding the lock, the lock will be released
452
+ eventually. The timeout should be no less than 30 seconds.
453
+
454
+ .. code-block :: python
455
+
456
+ from contextlib import contextmanager
457
+
458
+ @contextmanager
459
+ def commit_lock (version : int );
460
+ # Acquire the lock
461
+ my_lock.acquire()
462
+ try :
463
+ yield
464
+ except :
465
+ failed = True
466
+ finally :
467
+ my_lock.release()
468
+
469
+ lance.write_dataset(data, " s3://bucket/path/" , commit_lock = commit_lock)
470
+
471
+ When the context manager is exited, it will raise an exception if the commit
472
+ failed. This might be because of a network error or if the version has already
473
+ been written. Either way, the context manager should release the lock. Use a
474
+ try/finally block to ensure that the lock is released.
475
+
476
+ Concurrent Writer on S3 using DynamoDB
477
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
478
+
479
+ .. warning ::
480
+
481
+ This feature is experimental at the moment
482
+
483
+ Lance has native support for concurrent writers on S3 using DynamoDB instead of locking.
484
+ User may pass in a DynamoDB table name alone with the S3 URI to their dataset to enable this feature.
485
+
486
+ .. code-block :: python
487
+
488
+ import lance
489
+ # s3+ddb:// URL scheme let's lance know that you want to use DynamoDB for writing to S3 concurrently
490
+ ds = lance.dataset(" s3+ddb://my-bucket/mydataset.lance?ddbTableName=mytable" )
491
+
492
+ The DynamoDB table is expected to have a primary hash key of ``base_uri `` and a range key ``version ``.
493
+ The key ``base_uri `` should be string type, and the key ``version `` should be number type.
494
+
495
+ For details on how this feature works, please see :ref: `external-manifest-store `.
496
+
497
+
498
+ Google Cloud Storage Configuration
499
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
500
+
501
+ GCS credentials are configured by setting the ``GOOGLE_SERVICE_ACCOUNT `` environment
502
+ variable to the path of a JSON file containing the service account credentials.
503
+ There are several aliases for this environment variable, documented `here `__.
504
+
505
+ .. __ : https://docs.rs/object_store/latest/object_store/gcp/struct.GoogleCloudStorageBuilder.html#method.from_env
506
+
507
+ .. note ::
508
+
509
+ By default, GCS uses HTTP/1 for communication, as opposed to HTTP/2. This improves
510
+ maximum throughput significantly. However, if you wish to use HTTP/2 for some reason,
511
+ you can set the environment variable ``HTTP1_ONLY `` to ``false ``.
512
+
513
+ Azure Blob Storage Configuration
514
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
515
+
516
+ Azure Blob Storage credentials can be configured by setting the ``AZURE_STORAGE_ACCOUNT_NAME ``
517
+ and ``AZURE_STORAGE_ACCOUNT_KEY `` environment variables. The full list of environment
518
+ variables that can be set are documented `here `__.
519
+
520
+ .. __ : https://docs.rs/object_store/latest/object_store/azure/struct.MicrosoftAzureBuilder.html#method.from_env
0 commit comments