Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test data for RLE with bit_width == 0 #57

Merged
merged 4 commits into from
Aug 23, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added bad_data/ARROW-GH-43605.parquet
Binary file not shown.
2 changes: 2 additions & 0 deletions bad_data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,5 @@ These are files used for reproducing various bugs that have been reported.

* PARQUET-1481.parquet: tests a case where a schema Thrift value has been
corrupted

* ARROW-GH-43605.parquet: In Go, file reader goroutine crashed.
Copy link
Member

@mapleFU mapleFU Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think we should describe more about why this causing crashed, like footer corrupt, page header metadata, invalid content like page size less than metadata described, less column chunk than in metadata description etc. We have lots of "corruput file", but the reason would be differ

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give me some help to know which part of the file is corrupted? I am not expert enough to know why the file is corrupt. I only know it is generated with a specific version of polars, as described in this PR: apache/arrow#43607 (comment)
Latest version of polars does not cause the issue anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait a minute, I'll test this on my local pc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emm I don't know, arrow-c++ can read this file, emm...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If arrow-cpp manages to open it is nice, arrow-go crashed with this before related PR on arrow. Should I move this into valid data folder then?

Copy link
Member

@pitrou pitrou Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The criterion should be: is it a valid Parquet file or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a valid Parquet file and the issue was a bug in Arrow Go preventing to read it, then it should go in data. If it's an invalid Parquet file and the issue was that Arrow Go crashed instead of reporting a regular error, then it should go in bad_data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check this later

Copy link
Member

@mapleFU mapleFU Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: #57 (comment)

Dictionary Rle page with RLE encoding bit-width is 0.