-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Crash / heap-use-after-free in ByteArrayChunkedRecordReader::ReadValuesSpaced() on a corrupted Parquet file #41321
Comments
What's the version of code are you using? When I read this I got "Invalid or corrupted bit_width", have you select some columns during read? |
Ah nice, that's probably a problem about exception safety, I'll dive into it |
I've check the C++ using sanitizer:
This raise the error but doesn't cause memory access. I guess the problem is in PyArrow wrapper |
no, it is not PyArrow specific. It can also be reproduced using plain C++ Parquet Arrow API Pseudo-code: std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
auto poMemoryPool = std::shared_ptr<arrow::MemoryPool>(arrow::MemoryPool::CreateDefault().release());
parquet::arrow::OpenFile(std::move(infile), poMemoryPool.get(), &arrow_reader);
const int nNumGroups = arrow_reader->num_row_groups();
for (int i = 0; i < nNumGroups; ++i)
anRowGroups.push_back(i);
std::shared_ptr<arrow::RecordBatchReader> poRecordBatchReader;
arrow_reader->GetRecordBatchReader(anRowGroups, &poRecordBatchReader);
std::shared_ptr<arrow::RecordBatch> poBatch;
while (true)
{
poRecordBatchReader->ReadNext(&poBatch);
if (!poBatch)
break;
} |
Emmm would you mind check
|
by chance, can you share the source code of the standalone .cpp you use, so I can start from that ? That will make it easier for me to tune it to provide a full reproducer
This is not a leak, but a heap-use-after-free. And from my understanding the error happens in an auxiliary reading thread. But I'm not familiar enough with the libarrow/libparquet internals |
do you mean the code here: https://github.com/OSGeo/gdal/blob/27b5611353ae0cfe6d4e0244ef3272723845bd14/ogr/ogrsf_frmts/parquet/ogrparquetlayer.cpp Let me try it |
Oh I've reproduce the problem, let me fix it |
The root cause of this memory access is clear, it doesn't happen during reading a "valid" parquet file. During decompressing the "corrupt" file, this file has two row-groups:
When decoding the first row, Solving: checking levels read is equal to row-group metadata during read levels ( This is used during read values, but read levels doesn't check this) [1] https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L794 |
@rouault I've find out the reason here #41321 (comment) . I'm a bit tired today and will fix it tomorrow. This will not happen when file is not corrupt |
take |
@rouault I've verified my fixing works on this file, besides, you can also upload some corrupt file to https://github.com/apache/parquet-testing/tree/master/bad_data and this can help other subproject for testing the same problem |
submitted per apache/parquet-testing#48 |
### Rationale for this change In #41321 , user reports a corrupt when reading from a corrupt parquet file. This is because we lost some checking. Current code works on reading a normal parquet file. But when reading a corrupt file, this need to be more strict. **Currently this patch just enhance the checking on Parquet Level, the correspond value check would be add in later patches** ### What changes are included in this PR? More strict parquet checkings on Level ### Are these changes tested? Already exists test, maybe we can introduce parquet file as test file ### Are there any user-facing changes? More strict checkings * GitHub Issue: #41321 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: mwish <anmmscs_maple@qq.com> Signed-off-by: mwish <maplewish117@gmail.com>
Issue resolved by pull request 41346 |
…ache#41346) ### Rationale for this change In apache#41321 , user reports a corrupt when reading from a corrupt parquet file. This is because we lost some checking. Current code works on reading a normal parquet file. But when reading a corrupt file, this need to be more strict. **Currently this patch just enhance the checking on Parquet Level, the correspond value check would be add in later patches** ### What changes are included in this PR? More strict parquet checkings on Level ### Are these changes tested? Already exists test, maybe we can introduce parquet file as test file ### Are there any user-facing changes? More strict checkings * GitHub Issue: apache#41321 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: mwish <anmmscs_maple@qq.com> Signed-off-by: mwish <maplewish117@gmail.com>
Describe the bug, including details regarding any error messages, version, and platform.
While fuzzing the GDAL Parquet reader with a local run of ossfuzz, I got the following crash in ByteArrayChunkedRecordReader::ReadValuesSpaced() on this attached fuzzed parquet file (to be unzipped first) : crash-34fd88d625cc5fef893bcba62aad402883d98f47.zip
The bug isn't specific of the GDAL integration and can be reproduced with this simple pyarrow.parquet based script:
ParquetReader.scan_contents() detects an error, so there's likely a missing validation in the code path followed by DecodeRowGroups() (the fix I propose in #41320 (comment) doesn't help):
Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: