Idempotent parquet writes with "overwrite" mode #3112

MisterKloudy · 2024-10-24T05:26:50Z

Is your feature request related to a problem?

Currently all writes only perform new writes and return the paths of the new parquets being written.
This is equivalent to an "append" operation but is inconvenient for creating pipelines which should be idempotent.
Having an "overwrite" mode will allow for idempotent read/writes which can be easily rerun (such as those which are refreshed daily, or need to be rerun due to errors)

Describe the solution you'd like

Have a parameter/mode for the write_parquet which can overwrite the existing contents of the folder and replace it with the new parquets.

Describe alternatives you've considered

Deleting the data beforehand using another engine but it's less convenient

Additional Context

Similar to mode("overwrite") in pyspark

Would you like to implement a fix?

No

colin-ho · 2024-10-24T20:04:31Z

On it

Addresses: #3112 and #1768 Implements overwrite mode for write_parquet and write_csv. Upon finishing the write, we are left with a manifest of written file paths. We can use this to perform a `delete all files not in manifest`, by: 1. Do an `ls` to figure out all the current files in the root dir. 2. Use daft's built in `is_in` expression to get the file paths to delete. 3. Delete them. Notes: - Relies on fsspec for `ls` and `rm` functionalities. This is favored over pyarrow filesystem because `rm` is a **bulk** delete method, aka we can do the delete in a single API call. Pyarrow filesystem does not have bulk deletes. --------- Co-authored-by: Colin Ho <colinho@Colins-MacBook-Pro.local> Co-authored-by: Colin Ho <colinho@Colins-MBP.localdomain>

colin-ho · 2024-11-06T22:43:18Z

This should be ready in the latest release!

MisterKloudy added enhancement New feature or request needs triage labels Oct 24, 2024

colin-ho mentioned this issue Oct 24, 2024

[FEAT] Overwrite mode for write parquet/csv #3108

Merged

colin-ho closed this as completed Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idempotent parquet writes with "overwrite" mode #3112

Idempotent parquet writes with "overwrite" mode #3112

MisterKloudy commented Oct 24, 2024

colin-ho commented Oct 24, 2024

colin-ho commented Nov 6, 2024

Idempotent parquet writes with "overwrite" mode #3112

Idempotent parquet writes with "overwrite" mode #3112

Comments

MisterKloudy commented Oct 24, 2024

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional Context

Would you like to implement a fix?

colin-ho commented Oct 24, 2024

colin-ho commented Nov 6, 2024