-
Notifications
You must be signed in to change notification settings - Fork 199
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add Delta table support for
filesystem
destination (#1382)
* add delta table support for filesystem destination * Merge branch 'refs/heads/devel' into 978-filesystem-delta-table * remove duplicate method definition * make property robust * exclude high-precision decimal columns * make delta imports conditional * include pyarrow in deltalake dependency * install extra deltalake dependency * disable high precision decimal arrow test columns by default * include arrow max precision decimal column * introduce directory job and refactor delta table code * refactor delta table load * revert import changes * add delta table format child table handling * make table_format key lookups robust * write remote path to reference file * add supported table formats and file format adapter to destination capabilities * remove jsonl and parquet from table formats * add object_store rust crate credentials handling * add deltalake_storage_options to filesystem config * move function to top level to prevent multiprocessing pickle error * add new deltalake_storage_options filesystem config key to tests * replace secrets with dummy values in test * reorganize object_store rust crate credentials tests * add delta table format docs * move delta table logical delete logic to filesystem client * rename pyarrow lib method names * rename utils to delta_utils * import pyarrow from dlt common libs * move delta lake utitilities to module in dlt common libs * import delta lake utils early to assert dependencies availability * handle file format adaptation at table level * initialize file format variables * split delta table format tests * handle table schema is None case * add test for dynamic dispatching of delta tables * mark core delta table test as essential * simplify item normalizer dict key * make list copy to prevent in place mutations * add extra deltalake dependency * only test deltalake lib on local filesystem * properly evaluates lazy annotations * uses base FilesystemConfiguration from common in libs * solves union type reordering due to caching and clash with delta-rs DeltaTable method signature * creates a table with just root name to cache item normalizers properly --------- Co-authored-by: Jorrit Sandbrink <sandbj01@heiway.net> Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
- Loading branch information
1 parent
e78a3c1
commit 1c1ce7e
Showing
41 changed files
with
1,203 additions
and
207 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
from typing import Optional, Dict, Union | ||
|
||
from dlt import version | ||
from dlt.common import logger | ||
from dlt.common.libs.pyarrow import pyarrow as pa | ||
from dlt.common.libs.pyarrow import dataset_to_table, cast_arrow_schema_types | ||
from dlt.common.schema.typing import TWriteDisposition | ||
from dlt.common.exceptions import MissingDependencyException | ||
from dlt.common.storages import FilesystemConfiguration | ||
|
||
try: | ||
from deltalake import write_deltalake | ||
except ModuleNotFoundError: | ||
raise MissingDependencyException( | ||
"dlt deltalake helpers", | ||
[f"{version.DLT_PKG_NAME}[deltalake]"], | ||
"Install `deltalake` so dlt can create Delta tables in the `filesystem` destination.", | ||
) | ||
|
||
|
||
def ensure_delta_compatible_arrow_table(table: pa.table) -> pa.Table: | ||
"""Returns Arrow table compatible with Delta table format. | ||
Casts table schema to replace data types not supported by Delta. | ||
""" | ||
ARROW_TO_DELTA_COMPATIBLE_ARROW_TYPE_MAP = { | ||
# maps type check function to type factory function | ||
pa.types.is_null: pa.string(), | ||
pa.types.is_time: pa.string(), | ||
pa.types.is_decimal256: pa.string(), # pyarrow does not allow downcasting to decimal128 | ||
} | ||
adjusted_schema = cast_arrow_schema_types( | ||
table.schema, ARROW_TO_DELTA_COMPATIBLE_ARROW_TYPE_MAP | ||
) | ||
return table.cast(adjusted_schema) | ||
|
||
|
||
def get_delta_write_mode(write_disposition: TWriteDisposition) -> str: | ||
"""Translates dlt write disposition to Delta write mode.""" | ||
if write_disposition in ("append", "merge"): # `merge` disposition resolves to `append` | ||
return "append" | ||
elif write_disposition == "replace": | ||
return "overwrite" | ||
else: | ||
raise ValueError( | ||
"`write_disposition` must be `append`, `replace`, or `merge`," | ||
f" but `{write_disposition}` was provided." | ||
) | ||
|
||
|
||
def write_delta_table( | ||
path: str, | ||
data: Union[pa.Table, pa.dataset.Dataset], | ||
write_disposition: TWriteDisposition, | ||
storage_options: Optional[Dict[str, str]] = None, | ||
) -> None: | ||
"""Writes in-memory Arrow table to on-disk Delta table.""" | ||
|
||
table = dataset_to_table(data) | ||
|
||
# throws warning for `s3` protocol: https://github.com/delta-io/delta-rs/issues/2460 | ||
# TODO: upgrade `deltalake` lib after https://github.com/delta-io/delta-rs/pull/2500 | ||
# is released | ||
write_deltalake( # type: ignore[call-overload] | ||
table_or_uri=path, | ||
data=ensure_delta_compatible_arrow_table(table), | ||
mode=get_delta_write_mode(write_disposition), | ||
schema_mode="merge", # enable schema evolution (adding new columns) | ||
storage_options=storage_options, | ||
engine="rust", # `merge` schema mode requires `rust` engine | ||
) | ||
|
||
|
||
def _deltalake_storage_options(config: FilesystemConfiguration) -> Dict[str, str]: | ||
"""Returns dict that can be passed as `storage_options` in `deltalake` library.""" | ||
creds = {} | ||
extra_options = {} | ||
if config.protocol in ("az", "gs", "s3"): | ||
creds = config.credentials.to_object_store_rs_credentials() | ||
if config.deltalake_storage_options is not None: | ||
extra_options = config.deltalake_storage_options | ||
shared_keys = creds.keys() & extra_options.keys() | ||
if len(shared_keys) > 0: | ||
logger.warning( | ||
"The `deltalake_storage_options` configuration dictionary contains " | ||
"keys also provided by dlt's credential system: " | ||
+ ", ".join([f"`{key}`" for key in shared_keys]) | ||
+ ". dlt will use the values in `deltalake_storage_options`." | ||
) | ||
return {**creds, **extra_options} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.