Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idempotent parquet writes with "overwrite" mode #3112

Closed
MisterKloudy opened this issue Oct 24, 2024 · 2 comments
Closed

Idempotent parquet writes with "overwrite" mode #3112

MisterKloudy opened this issue Oct 24, 2024 · 2 comments
Labels
enhancement New feature or request needs triage

Comments

@MisterKloudy
Copy link

Is your feature request related to a problem?

Currently all writes only perform new writes and return the paths of the new parquets being written.
This is equivalent to an "append" operation but is inconvenient for creating pipelines which should be idempotent.
Having an "overwrite" mode will allow for idempotent read/writes which can be easily rerun (such as those which are refreshed daily, or need to be rerun due to errors)

Describe the solution you'd like

Have a parameter/mode for the write_parquet which can overwrite the existing contents of the folder and replace it with the new parquets.

Describe alternatives you've considered

Deleting the data beforehand using another engine but it's less convenient

Additional Context

Similar to mode("overwrite") in pyspark

Would you like to implement a fix?

No

@colin-ho
Copy link
Contributor

On it

colin-ho added a commit that referenced this issue Nov 6, 2024
Addresses: #3112 and
#1768

Implements overwrite mode for write_parquet and write_csv.

Upon finishing the write, we are left with a manifest of written file
paths. We can use this to perform a `delete all files not in manifest`,
by:
1. Do an `ls` to figure out all the current files in the root dir.
2. Use daft's built in `is_in` expression to get the file paths to
delete.
3. Delete them.

Notes:
- Relies on fsspec for `ls` and `rm` functionalities. This is favored
over pyarrow filesystem because `rm` is a **bulk** delete method, aka we
can do the delete in a single API call. Pyarrow filesystem does not have
bulk deletes.

---------

Co-authored-by: Colin Ho <colinho@Colins-MacBook-Pro.local>
Co-authored-by: Colin Ho <colinho@Colins-MBP.localdomain>
@colin-ho
Copy link
Contributor

colin-ho commented Nov 6, 2024

This should be ready in the latest release!

@colin-ho colin-ho closed this as completed Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs triage
Projects
None yet
Development

No branches or pull requests

2 participants