You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently all writes only perform new writes and return the paths of the new parquets being written.
This is equivalent to an "append" operation but is inconvenient for creating pipelines which should be idempotent.
Having an "overwrite" mode will allow for idempotent read/writes which can be easily rerun (such as those which are refreshed daily, or need to be rerun due to errors)
Describe the solution you'd like
Have a parameter/mode for the write_parquet which can overwrite the existing contents of the folder and replace it with the new parquets.
Describe alternatives you've considered
Deleting the data beforehand using another engine but it's less convenient
Additional Context
Similar to mode("overwrite") in pyspark
Would you like to implement a fix?
No
The text was updated successfully, but these errors were encountered:
Addresses: #3112 and
#1768
Implements overwrite mode for write_parquet and write_csv.
Upon finishing the write, we are left with a manifest of written file
paths. We can use this to perform a `delete all files not in manifest`,
by:
1. Do an `ls` to figure out all the current files in the root dir.
2. Use daft's built in `is_in` expression to get the file paths to
delete.
3. Delete them.
Notes:
- Relies on fsspec for `ls` and `rm` functionalities. This is favored
over pyarrow filesystem because `rm` is a **bulk** delete method, aka we
can do the delete in a single API call. Pyarrow filesystem does not have
bulk deletes.
---------
Co-authored-by: Colin Ho <colinho@Colins-MacBook-Pro.local>
Co-authored-by: Colin Ho <colinho@Colins-MBP.localdomain>
Is your feature request related to a problem?
Currently all writes only perform new writes and return the paths of the new parquets being written.
This is equivalent to an "append" operation but is inconvenient for creating pipelines which should be idempotent.
Having an "overwrite" mode will allow for idempotent read/writes which can be easily rerun (such as those which are refreshed daily, or need to be rerun due to errors)
Describe the solution you'd like
Have a parameter/mode for the write_parquet which can overwrite the existing contents of the folder and replace it with the new parquets.
Describe alternatives you've considered
Deleting the data beforehand using another engine but it's less convenient
Additional Context
Similar to mode("overwrite") in pyspark
Would you like to implement a fix?
No
The text was updated successfully, but these errors were encountered: