-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] add missing metadata tables #1053
Comments
@kevinjqliu I would like to work on this one. |
Hey @soumya-ghosh - if you want to split the workload between us i would love to also give this a try |
Sure @amitgilad3, most likely there will be separate PRs for each of above metadata tables. |
Thanks for volunteering to contribute! I was thinking we could do something similar to #511 where each metadata table can be assigned at a time. And feel free to work on another after the first is done! |
@kevinjqliu we can group the tasks in following way:
What do you think? |
That makes sense to me, thanks @soumya-ghosh |
@kevinjqliu added PR #1066 for |
Hey @kevinjqliu, any thoughts how to implement |
What is the difference between your implementation's output vs sparks? From the spark docs, "To show all files, data files and delete files across all tracked snapshots, query prod.db.table.all_files"
this sounds right to me. maybe spark gets rid of duplicate rows? |
From spark docs,
So, here's my approach (pseudo-code): metadata = load_table_metadata()
for snapshot in metadata["snapshots"]:
manifest_list = read manifest list from snapshot
for manifest_file in manifest_list:
manifest = read manifest file
for file in manifest:
process file (data_file or delete_file) With this approach the number of files in output is much higher than the corresponding output of |
I see. So if I have a new table and append to it 5 times, I expect 5 snapshots and 5 manifest list files. I think each manifest list file will repeatedly refer to the same underlying manifest file, which will be read over and over causing duplicates. What if you just return all unique (data+delete) files? |
In this case, output will not match with Spark. Will that be okay? Also found this PR from Iceberg,
|
@soumya-ghosh I wonder if that's still the case today, that PR is from 2020. |
@kevinjqliu added PR - #1241 for Will get on with |
Thanks for your contribution here @soumya-ghosh. I just merged #1241 for |
Yes I will start working on that soon, have been busy last few weeks so couldn't make any progress. |
Hey @soumya-ghosh & @kevinjqliu , would love to contribute . i dont want to step on you work so i was wondering what i can take from this list: positional_deletes, all_files, all_data_files and all_delete_files ? |
sure @amitgilad3. You can work on If you want to work on |
@soumya-ghosh , Ill will start with positional_deletes and see how fast i can finish it , once im done we can see about the rest |
@kevinjqliu added PR - #1626 for |
Awesome work!! @soumya-ghosh - if all goes well next release will have all metadata tables acessable from pyiceberg 🚀 |
@amitgilad3 Right back at you! |
Thanks for the contribution!! Appreciate it.
Other than those, i think we're good to include this in the next release! 🥳 |
For point 1 - will raise a separate PR covering documentation updates for these metadata tables. For point 2, 3, 4 - As per current iceberg code, such operations are not supported on |
what i mean is that for I guess this can also occur for the rest of the metadata tables too. For example, there's a bug in the partitions metadata table right now for partition evolution #1120 I just want to double check these things before calling this done :) |
I understand that I did a test to see the behavior in Spark, observations in gist. It appears that in Spark constructs the Thoughts @kevinjqliu ? |
Hey @soumya-ghosh @kevinjqliu - just so i understand since i already implemented support for specific snapshot in all_entries and in position_deletes , do we want to support this or not ? |
@amitgilad3 were you able to test |
@soumya-ghosh - when i run
i get the following error -
so i guess we should not support it for all_entries but for position_deletes it works with spark so ill keep it |
@kevinjqliu awaiting your thoughts on above 4 comments. |
hey folks, sorry for the late response here. I think there are a couple of different things here.
@soumya-ghosh @amitgilad3 does that make sense? Please let me know if i misinterpreted anything |
Feature Request / Improvement
Looks like there are a few more metadata tables currently missing in PyIceberg.
Source of truth for metadata tables: https://iceberg.apache.org/javadoc/latest/org/apache/iceberg/MetadataTableType.html
Done: https://py.iceberg.apache.org/api/#inspecting-tables
Missing:
The text was updated successfully, but these errors were encountered: