Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: update documentation for S3 / DynamoDb log store configuration #2041

Merged
merged 3 commits into from
Jan 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 1 addition & 4 deletions crates/deltalake-aws/src/storage.rs
Original file line number Diff line number Diff line change
Expand Up @@ -539,7 +539,6 @@ mod tests {
s3_constants::AWS_ACCESS_KEY_ID.to_string() => "test_id_mixed".to_string(),
s3_constants::AWS_SECRET_ACCESS_KEY.to_string() => "test_secret_mixed".to_string(),
s3_constants::AWS_REGION.to_string() => "us-west-2".to_string(),
"DYNAMO_LOCK_PARTITION_KEY_VALUE".to_string() => "my_lock".to_string(),
"AWS_S3_GET_INTERNAL_SERVER_ERROR_RETRIES".to_string() => "3".to_string(),
});

Expand All @@ -562,9 +561,7 @@ mod tests {
s3_pool_idle_timeout: Duration::from_secs(1),
sts_pool_idle_timeout: Duration::from_secs(2),
s3_get_internal_server_error_retries: 3,
extra_opts: hashmap! {
"DYNAMO_LOCK_PARTITION_KEY_VALUE".to_string() => "my_lock".to_string(),
},
extra_opts: hashmap! {},
allow_unsafe_rename: false,
},
options
Expand Down
4 changes: 0 additions & 4 deletions crates/deltalake-aws/tests/common.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,6 @@ impl Default for S3Integration {
impl StorageIntegration for S3Integration {
/// Create a new bucket
fn create_bucket(&self) -> std::io::Result<ExitStatus> {
set_env_if_not_set(
"DYNAMO_LOCK_PARTITION_KEY_VALUE",
format!("s3://{}", self.bucket_name()),
);
Self::create_lock_table()?;
let mut child = Command::new("aws")
.args(["s3", "mb", &self.root_uri()])
Expand Down
2 changes: 1 addition & 1 deletion python/deltalake/writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ def write_deltalake(

Additionally, you must create a DynamoDB table with the name 'delta_rs_lock_table'
so that it can be automatically discovered by delta-rs. Alternatively, you can
use a table name of your choice, but you must set the `DYNAMO_LOCK_TABLE_NAME`
use a table name of your choice, but you must set the `DELTA_DYNAMO_TABLE_NAME`
variable to match your chosen table name. The required schema for the DynamoDB
table is as follows:

Expand Down
56 changes: 43 additions & 13 deletions python/docs/source/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -492,47 +492,77 @@ locking provider at the moment in delta-rs. To enable DynamoDB as the
locking provider, you need to set the **AWS_S3_LOCKING_PROVIDER** to 'dynamodb'
as a ``storage_options`` or as an environment variable.

Additionally, you must create a DynamoDB table with the name ``delta_rs_lock_table``
Additionally, you must create a DynamoDB table with the name ``delta_log``
so that it can be automatically recognized by delta-rs. Alternatively, you can
use a table name of your choice, but you must set the **DYNAMO_LOCK_TABLE_NAME**
use a table name of your choice, but you must set the **DELTA_DYNAMO_TABLE_NAME**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall from the pull request reviews previously, but why did this name change @dispanser ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the initial implementation path, both locking solutions where applied at the same time, so the table name had to be different (due to incompatible schemas). After throwing out the old locking logic, I simply did not think about changing this.
Given that a user switching to the new library has to tear down the old lock table and set up the new one anyways, it may not be breaking any more than the logic change already does, and I do think having DELTA in there is actually better.

I now realize that it's not very fortunate, but with the release already made (?), changing back might make more harm than good.

variable to match your chosen table name. The required schema for the DynamoDB
table is as follows:

.. code-block:: json


{

"Table": {
"AttributeDefinitions": [
{
"AttributeName": "key",
"AttributeName": "fileName",
"AttributeType": "S"
},
{
"AttributeName": "tablePath",
"AttributeType": "S"
}
],
"TableName": "delta_rs_lock_table",
"TableName": "delta_log",
"KeySchema": [
{
"AttributeName": "key",
"AttributeName": "tablePath",
"KeyType": "HASH"
},
{
"AttributeName": "fileName",
"KeyType": "RANGE"
}
]
}
],
}

Here is an example writing to s3 using this mechanism:

.. code-block:: python

>>> from deltalake import write_deltalake
>>> df = pd.DataFrame({'x': [1, 2, 3]})
>>> storage_options = {'AWS_S3_LOCKING_PROVIDER': 'dynamodb', 'DYNAMO_LOCK_TABLE_NAME': 'custom_table_name'}
>>> write_deltalake('s3://path/to/table', df, 'storage_options'= storage_options)
>>> storage_options = {'AWS_S3_LOCKING_PROVIDER': 'dynamodb', 'DELTA_DYNAMO_TABLE_NAME': 'custom_table_name'}
>>> write_deltalake('s3a://path/to/table', df, 'storage_options'= storage_options)

.. note::
This locking mechanism is compatible with the one used by Apache Spark. The `tablePath` property,
denoting the root url of the delta table itself, is part of the primary key, and all writers
intending to write to the same table must match this property precisely. In Spark, S3 URLs
are prefixed with `s3a://`, and a table in delta-rs must be configured accordingly.

The following code allows creating the necessary table from the AWS cli:

.. code-block:: sh

aws dynamodb create-table \
--table-name delta_log \
--attribute-definitions AttributeName=tablePath,AttributeType=S AttributeName=fileName,AttributeType=S \
--key-schema AttributeName=tablePath,KeyType=HASH AttributeName=fileName,KeyType=RANGE \
--provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5

You can find additional information in the `delta-rs-documentation
https://docs.delta.io/latest/delta-storage.html#multi-cluster-setup`_, which
also includes recommendations on configuring a time-to-live (TTL) for the table to
avoid growing the table indefinitely.

https://docs.delta.io/latest/delta-storage.html#production-configuration-s3-multi-cluster

.. note::
if for some reason you don't want to use dynamodb as your locking mechanism you can
choose to set the `AWS_S3_ALLOW_UNSAFE_RENAME` variable to ``true`` in order to enable
S3 unsafe writes.

Please note that this locking mechanism is not compatible with any other
locking mechanisms, including the one used by Spark.

Updating Delta Tables
---------------------
Expand Down Expand Up @@ -561,7 +591,7 @@ Update all the rows for the column "processed" to the value True.
:meth:`DeltaTable.update` predicates and updates are all in string format. The predicates and expressions,
are parsed into Apache Datafusion expressions.

Apply a soft deletion based on a predicate, so update all the rows for the column "deleted" to the value
Apply a soft deletion based on a predicate, so update all the rows for the column "deleted" to the value
True where x = 3

.. code-block:: python
Expand Down
Loading