-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] [Parquet] Crash / heap-buffer-overflow in TableBatchReader::ReadNext() on a corrupted Parquet file #41317
Comments
I found the root-cause is:
That's because less data is read to the table. I may suggest another way to fix this |
rouault
added a commit
to rouault/arrow
that referenced
this issue
Apr 24, 2024
Lead-authored-by: mwish <maplewish117@gmail.com>
rouault
added a commit
to rouault/arrow
that referenced
this issue
Apr 24, 2024
rouault
added a commit
to rouault/arrow
that referenced
this issue
Apr 27, 2024
rouault
added a commit
to rouault/arrow
that referenced
this issue
Apr 28, 2024
mapleFU
pushed a commit
that referenced
this issue
Apr 30, 2024
### Rationale for this change Fixes the crash detailed in #41317 in TableBatchReader::ReadNext() on a corrupted Parquet file ### What changes are included in this PR? Add a validation that all read columns have the same size ### Are these changes tested? I've tested on the reproducer I provided in #41317 that it now triggers a clean error: ``` Traceback (most recent call last): File "test.py", line 3, in <module> [_ for _ in parquet_file.iter_batches()] File "test.py", line 3, in <listcomp> [_ for _ in parquet_file.iter_batches()] File "pyarrow/_parquet.pyx", line 1587, in iter_batches File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: columns do not have the same size ``` I'm not sure if/how unit tests for corrupted datasets should be added ### Are there any user-facing changes? No **This PR contains a "Critical Fix".** * GitHub Issue: #41317 Authored-by: Even Rouault <even.rouault@spatialys.com> Signed-off-by: mwish <maplewish117@gmail.com>
Issue resolved by pull request 41366 |
tolleybot
pushed a commit
to tmct/arrow
that referenced
this issue
May 2, 2024
### Rationale for this change Fixes the crash detailed in apache#41317 in TableBatchReader::ReadNext() on a corrupted Parquet file ### What changes are included in this PR? Add a validation that all read columns have the same size ### Are these changes tested? I've tested on the reproducer I provided in apache#41317 that it now triggers a clean error: ``` Traceback (most recent call last): File "test.py", line 3, in <module> [_ for _ in parquet_file.iter_batches()] File "test.py", line 3, in <listcomp> [_ for _ in parquet_file.iter_batches()] File "pyarrow/_parquet.pyx", line 1587, in iter_batches File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: columns do not have the same size ``` I'm not sure if/how unit tests for corrupted datasets should be added ### Are there any user-facing changes? No **This PR contains a "Critical Fix".** * GitHub Issue: apache#41317 Authored-by: Even Rouault <even.rouault@spatialys.com> Signed-off-by: mwish <maplewish117@gmail.com>
vibhatha
pushed a commit
to vibhatha/arrow
that referenced
this issue
May 25, 2024
### Rationale for this change Fixes the crash detailed in apache#41317 in TableBatchReader::ReadNext() on a corrupted Parquet file ### What changes are included in this PR? Add a validation that all read columns have the same size ### Are these changes tested? I've tested on the reproducer I provided in apache#41317 that it now triggers a clean error: ``` Traceback (most recent call last): File "test.py", line 3, in <module> [_ for _ in parquet_file.iter_batches()] File "test.py", line 3, in <listcomp> [_ for _ in parquet_file.iter_batches()] File "pyarrow/_parquet.pyx", line 1587, in iter_batches File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: columns do not have the same size ``` I'm not sure if/how unit tests for corrupted datasets should be added ### Are there any user-facing changes? No **This PR contains a "Critical Fix".** * GitHub Issue: apache#41317 Authored-by: Even Rouault <even.rouault@spatialys.com> Signed-off-by: mwish <maplewish117@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug, including details regarding any error messages, version, and platform.
While fuzzing the GDAL Parquet reader with a local run of ossfuzz, I got the following crash in TableBatchReader::ReadNext() on this attached fuzzed parquet file (to be unzipped first) : crash-581a7ec06da982291398aa3f63361ecb69fe20fc.zip
The bug isn't specific of the GDAL integration and can be reproduced with this simple pyarrow.parquet based script:
Reproducable with v15.0.0 and latest master at time of writing (16e20b7)
I'm not sure if it is related but trying to read with ParquetReader.read_all() triggers a clean exception:
and another one when using ParquetReader.scan_contents():
Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: