[feat] add missing metadata tables #1053

kevinjqliu · 2024-08-13T06:15:39Z

soumya-ghosh · 2024-08-13T08:02:50Z

@kevinjqliu I would like to work on this one.

amitgilad3 · 2024-08-13T08:34:19Z

Hey @soumya-ghosh - if you want to split the workload between us i would love to also give this a try

soumya-ghosh · 2024-08-13T09:26:26Z

Sure @amitgilad3, most likely there will be separate PRs for each of above metadata tables.
I can work on data_files, all_data_files and all_manifests

kevinjqliu · 2024-08-13T11:18:23Z

Thanks for volunteering to contribute! I was thinking we could do something similar to #511 where each metadata table can be assigned at a time. And feel free to work on another after the first is done!

soumya-ghosh · 2024-08-13T19:49:04Z

@kevinjqliu we can group the tasks in following way:

data_files and delete_files - they are subsets of files, just a filter condition on content field, hence can be addressed in same PR
all_files, all_data_files and all_delete_files - Once the all_files is implemented, the other tables are again subsets of all_files, can be addressed in single PR
all_entries
all_manifests
position_deletes

What do you think?

kevinjqliu · 2024-08-13T21:15:12Z

That makes sense to me, thanks @soumya-ghosh

soumya-ghosh · 2024-08-15T22:26:04Z

@kevinjqliu added PR #1066 for data_files and delete_files.

soumya-ghosh · 2024-09-12T19:12:10Z

Hey @kevinjqliu, any thoughts how to implement all_files table?
I initially thought that that all_files is returning files from all snapshots referenced in current table metadata and hence the repetitions in the output.
I tested this logic and compared the output against all_files metadata table through Spark.
I observed that although there were duplicates for several file_path, number of files returned in Spark is much less than above hypothesis.

kevinjqliu · 2024-09-13T00:03:57Z

What is the difference between your implementation's output vs sparks?

From the spark docs, "To show all files, data files and delete files across all tracked snapshots, query prod.db.table.all_files"

I initially thought that that all_files is returning files from all snapshots referenced in current table metadata and hence the repetitions in the output.

this sounds right to me. maybe spark gets rid of duplicate rows?

soumya-ghosh · 2024-09-13T20:22:59Z

From spark docs,

These tables are unions of the metadata tables specific to the current snapshot, and return metadata across all snapshots.
The "all" metadata tables may produce more than one row per data file or manifest file because metadata files may be part of more than one table snapshot.

So, here's my approach (pseudo-code):

metadata = load_table_metadata()
for snapshot in metadata["snapshots"]:
    manifest_list = read manifest list from snapshot
    for manifest_file in manifest_list:
        manifest = read manifest file
        for file in manifest:
            process file (data_file or delete_file)

With this approach the number of files in output is much higher than the corresponding output of all_files table in Spark.

kevinjqliu · 2024-09-13T21:36:14Z

I see. So if I have a new table and append to it 5 times, I expect 5 snapshots and 5 manifest list files. I think each manifest list file will repeatedly refer to the same underlying manifest file, which will be read over and over causing duplicates.

What if you just return all unique (data+delete) files?

soumya-ghosh · 2024-09-14T10:42:36Z

What if you just return all unique (data+delete) files?

In this case, output will not match with Spark. Will that be okay?

Also found this PR from Iceberg,

These tables may contain duplicate rows. Deduplication can't be done through the current scan interface unless all of the work is done during scan planning on a single node. Duplicates are the trade-off for being able to process the metadata in parallel for large tables.

kevinjqliu · 2024-10-03T20:01:52Z

@soumya-ghosh I wonder if that's still the case today, that PR is from 2020.
Do you have a WIP PR I can take a look at? We can also bring this to the devlist to double-check the correct behavior

soumya-ghosh · 2024-10-20T18:15:41Z

@kevinjqliu added PR - #1241 for all_manifests.

Will get on with all_files, all_data_files and all_delete_files next.

kevinjqliu · 2025-01-11T01:13:54Z

Thanks for your contribution here @soumya-ghosh. I just merged #1241 for all_manifests. Are you still interested in adding all_files, all_data_files and all_delete_files?

soumya-ghosh · 2025-01-11T05:41:35Z

Yes I will start working on that soon, have been busy last few weeks so couldn't make any progress.

amitgilad3 · 2025-01-11T12:40:32Z

Hey @soumya-ghosh & @kevinjqliu , would love to contribute . i dont want to step on you work so i was wondering what i can take from this list: positional_deletes, all_files, all_data_files and all_delete_files ?

soumya-ghosh · 2025-01-11T13:35:17Z

sure @amitgilad3. You can work on positional_deletes and all_entries.
all_files, all_data_files and all_delete_files will use the same base implementation and I've an approach in mind so let me give it a shot. If I'm unable to make progress, will let you know.

If you want to work on all_files, I can swap it you.

amitgilad3 · 2025-01-12T10:23:08Z

@soumya-ghosh , Ill will start with positional_deletes and see how fast i can finish it , once im done we can see about the rest

soumya-ghosh · 2025-02-08T10:52:53Z

@kevinjqliu added PR - #1626 for all_files, all_data_files and all_delete_files.
Have implemented them in single PR as they data and delete files are subsets of all_files.

amitgilad3 · 2025-02-08T11:02:32Z

Awesome work!! @soumya-ghosh - if all goes well next release will have all metadata tables acessable from pyiceberg 🚀

soumya-ghosh · 2025-02-08T11:21:08Z

@amitgilad3 Right back at you!
I see you've raised PRs for the remaining ones, will take a look.

kevinjqliu · 2025-02-08T18:07:55Z

Thanks for the contribution!! Appreciate it.
Before we close out this issue, i want to double check a few things

Documentation, all tables are documented at https://py.iceberg.apache.org/api/#inspecting-tables
Optional time travel, for any none all_* metadata tables, lets expose an optional parameter snapshot_id to provide the ability to time travel. This is already available in some metadata tables. Let's make sure its consistent across all metadata tables
When time traveling, we should take into account the state of the table at the particular snapshot_id. Things like schema and partition evolution. In some places, we just use tbl.metadata.schema() which is the current table schema and might be incorrect when time traveling
Similar to above, for all_* metadata tables, when querying other snapshots, make sure we're using the correct table state
(edit) double check all metadata tables. For example, partitions metadata table does not respect partition evolution currently [bug] table.inspect.partitions() does not respect partition evolution #1120

Other than those, i think we're good to include this in the next release! 🥳

soumya-ghosh · 2025-02-08T22:07:12Z

For point 1 - will raise a separate PR covering documentation updates for these metadata tables.

For point 2, 3, 4 -
Is time travel through snapshot_id or timestamp supported for all_* metadata tables?
I tried and got below error in Spark
Query - spark.sql(f"SELECT count(1) FROM {identifier}.all_files for version as of {snapshot_id}").show()
Error - pyspark.errors.exceptions.captured.UnsupportedOperationException: Cannot select snapshot in table: ALL_FILES

As per current iceberg code, such operations are not supported on all_* metadata table.

kevinjqliu · 2025-02-09T01:23:14Z

Is time travel through snapshot_id or timestamp supported for all_* metadata tables?

what i mean is that for all_* metadata tables, we're essentially doing something like [inspect.files(snapshot.snapshot_id) for snapshot in all_snapshots] and we should make sure that we're not just referring to the current schema, for example.

I guess this can also occur for the rest of the metadata tables too. For example, there's a bug in the partitions metadata table right now for partition evolution #1120

I just want to double check these things before calling this done :)

soumya-ghosh · 2025-02-09T22:16:20Z

I understand that files table by snapshot and all_files (and its derivatives) should respect schema evolution.
The keys in column of readable_metrics is derived from schema, thus the source of inconsistency.

I did a test to see the behavior in Spark, observations in gist. It appears that in Spark constructs the readable_metrics column by considering the current schema (which maybe a bug).

Thoughts @kevinjqliu ?

amitgilad3 · 2025-02-10T14:51:11Z

Hey @soumya-ghosh @kevinjqliu - just so i understand since i already implemented support for specific snapshot in all_entries and in position_deletes , do we want to support this or not ?

soumya-ghosh · 2025-02-10T16:21:54Z

@amitgilad3 were you able to test all_entries against the spark in the integration tests?
As per Iceberg code, it should throw exception if one tries to query all_* for a specific snapshot.
Will check all_entries PR.

amitgilad3 · 2025-02-10T17:58:12Z

@soumya-ghosh - when i run

    spark.sql(f"SELECT count(1) FROM {identifier}.all_entries for version as of {snapshot_id}").show()

i get the following error -

pyspark.errors.exceptions.captured.UnsupportedOperationException: Cannot select snapshot in table: ALL_ENTRIES

so i guess we should not support it for all_entries but for position_deletes it works with spark so ill keep it

soumya-ghosh · 2025-02-14T22:36:08Z

@kevinjqliu awaiting your thoughts on above 4 comments.

kevinjqliu · 2025-02-16T21:50:33Z

hey folks, sorry for the late response here.

I think there are a couple of different things here.

time travel for metadata tables. this is a feature for metadata tables other than the all_* metadata tables. In python, this is analogous to calling tbl.inspect.entries(snapshot_id=...) (in java its FROM tbl.entries AS OF snapshot_id). Time traveling is not available for all_* metadata tables since we're already getting it for all snapshots.
Metadata tables (except the all_* metadata tables) should support time travel (by optionally accepting a snapshot_id argument) and then use the snapshot id to provide table metadata such as schema and partition spec. We should double check all usage of self.tbl.metadata.* (self.tbl.metadata.schema(), self.tbl.metadata.spec_struct()) since this is using the current table metadata instead of the metadata at the time of the snapshot.
all_* metadata tables are implicitly using the time travel feature. For example, all_entries is essentially calling [tbl.inspect.entries(snapshot_id) for snapshot_id in tbl.snapshots()]. Because of this, we should ensure the same time travel behavior as mentioned in (2).

@soumya-ghosh @amitgilad3 does that make sense? Please let me know if i misinterpreted anything

soumya-ghosh mentioned this issue Aug 15, 2024

Add metadata tables for data_files and delete_files #1066

Merged

soumya-ghosh mentioned this issue Oct 20, 2024

Add all_manifests metadata table with tests #1241

Merged

kevinjqliu added this to the PyIceberg 0.9.0 release milestone Jan 11, 2025

This was referenced Feb 5, 2025

support all_entries in pyiceberg #1608

Open

position_deletes metadata table #1615

Open

soumya-ghosh mentioned this issue Feb 8, 2025

Add all filles metadata tables #1626

Open

kevinjqliu removed this from the PyIceberg 0.9.0 release milestone Feb 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] add missing metadata tables #1053

[feat] add missing metadata tables #1053

kevinjqliu commented Aug 13, 2024 •

edited

Loading

soumya-ghosh commented Aug 13, 2024

amitgilad3 commented Aug 13, 2024

soumya-ghosh commented Aug 13, 2024

kevinjqliu commented Aug 13, 2024

soumya-ghosh commented Aug 13, 2024

kevinjqliu commented Aug 13, 2024

soumya-ghosh commented Aug 15, 2024

soumya-ghosh commented Sep 12, 2024

kevinjqliu commented Sep 13, 2024

soumya-ghosh commented Sep 13, 2024

kevinjqliu commented Sep 13, 2024

soumya-ghosh commented Sep 14, 2024

kevinjqliu commented Oct 3, 2024

soumya-ghosh commented Oct 20, 2024

kevinjqliu commented Jan 11, 2025

soumya-ghosh commented Jan 11, 2025

amitgilad3 commented Jan 11, 2025

soumya-ghosh commented Jan 11, 2025 •

edited

Loading

amitgilad3 commented Jan 12, 2025

soumya-ghosh commented Feb 8, 2025

amitgilad3 commented Feb 8, 2025

soumya-ghosh commented Feb 8, 2025

kevinjqliu commented Feb 8, 2025 •

edited

Loading

soumya-ghosh commented Feb 8, 2025 •

edited

Loading

kevinjqliu commented Feb 9, 2025

soumya-ghosh commented Feb 9, 2025

amitgilad3 commented Feb 10, 2025

soumya-ghosh commented Feb 10, 2025

amitgilad3 commented Feb 10, 2025 •

edited

Loading

soumya-ghosh commented Feb 14, 2025

kevinjqliu commented Feb 16, 2025

[feat] add missing metadata tables #1053

[feat] add missing metadata tables #1053

Comments

kevinjqliu commented Aug 13, 2024 • edited Loading

Feature Request / Improvement

soumya-ghosh commented Aug 13, 2024

amitgilad3 commented Aug 13, 2024

soumya-ghosh commented Aug 13, 2024

kevinjqliu commented Aug 13, 2024

soumya-ghosh commented Aug 13, 2024

kevinjqliu commented Aug 13, 2024

soumya-ghosh commented Aug 15, 2024

soumya-ghosh commented Sep 12, 2024

kevinjqliu commented Sep 13, 2024

soumya-ghosh commented Sep 13, 2024

kevinjqliu commented Sep 13, 2024

soumya-ghosh commented Sep 14, 2024

kevinjqliu commented Oct 3, 2024

soumya-ghosh commented Oct 20, 2024

kevinjqliu commented Jan 11, 2025

soumya-ghosh commented Jan 11, 2025

amitgilad3 commented Jan 11, 2025

soumya-ghosh commented Jan 11, 2025 • edited Loading

amitgilad3 commented Jan 12, 2025

soumya-ghosh commented Feb 8, 2025

amitgilad3 commented Feb 8, 2025

soumya-ghosh commented Feb 8, 2025

kevinjqliu commented Feb 8, 2025 • edited Loading

soumya-ghosh commented Feb 8, 2025 • edited Loading

kevinjqliu commented Feb 9, 2025

soumya-ghosh commented Feb 9, 2025

amitgilad3 commented Feb 10, 2025

soumya-ghosh commented Feb 10, 2025

amitgilad3 commented Feb 10, 2025 • edited Loading

soumya-ghosh commented Feb 14, 2025

kevinjqliu commented Feb 16, 2025

kevinjqliu commented Aug 13, 2024 •

edited

Loading

soumya-ghosh commented Jan 11, 2025 •

edited

Loading

kevinjqliu commented Feb 8, 2025 •

edited

Loading

soumya-ghosh commented Feb 8, 2025 •

edited

Loading

amitgilad3 commented Feb 10, 2025 •

edited

Loading