-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41317: [C++] Fix crash on invalid Parquet file #41320
Conversation
|
Digging further, seeing that FileReaderImpl::DecodeRowGroups() already calls Table::Validate(), but that GetRecordBatchReader() didn't, I've also tested successfully the following alternative patch:
With that patch, the error reported is "Column 18 named timestamp_us_no_tz expected length 5 but got length 2" I'm not sure which approach is preferred. |
cc @mapleFU |
I'll take a carefully round tomorrow, the fix is ok but I'm not sure that's rootcause |
See #41317 (comment)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diff --git a/cpp/src/parquet/arrow/reader.cc b/cpp/src/parquet/arrow/reader.cc
index d6ad7c25b..e05e243e3 100644
--- a/cpp/src/parquet/arrow/reader.cc
+++ b/cpp/src/parquet/arrow/reader.cc
@@ -1043,6 +1043,17 @@ Status FileReaderImpl::GetRecordBatchReader(const std::vector<int>& row_groups,
}
}
+ // Check all columns has same row-size
+ if (!columns.empty()) {
+ int64_t row_size = columns[0]->length();
+ for (size_t i = 1; i < columns.size(); ++i) {
+ if (columns[i]->length() != row_size) {
+ return ::arrow::Status::Invalid("read column size inequal");
+ }
+ }
+ }
+
+
auto table = ::arrow::Table::Make(batch_schema, std::move(columns));
auto table_reader = std::make_shared<::arrow::TableBatchReader>(*table);
Would you mind change like this?
Also cc @felipecrv , do you think TableReader would check the input is valid? I think the generate side should checks it, and the consumer would better dcheck that? |
Unless the inputs come from an external source (eg IPC), we shouldn't be validating them in Is the Parquet reader producing an invalid |
Yes. User is running fuzzing on parquet file, when parsing a corrupt parquet file, we don't apply enough checkings current it. So I need add more strict checkings here. Just curious should we add DCHECK in TableReader helps debugging here |
Add a |
Thanks! @rouault Would you mind edit this? Or let me handle this with a new patch? |
not sure in which direction: your above proposal " // Check all columns has same row-size" or the alternative proposal I made in #41320 (comment) using Validate() ? I'm happy if you follow up with another PR in the direction you prefer. I would argue that libarrow/libparquet should be robust against hostile/corrupted datasets even on Release builds, as those kind of crashes are undesirable, and may potentially have security implications. |
I mean we can check it here: #41320 (review) . After this, Error would be raise if column number mismatches. Other can be
Yeah. I only malform table is the root (see #41317 (comment) ) cause of this memory access, other can just be debug checkings. |
ok, closing that PR, and opening #41366 with that fix |
I don't disagree, but we can't and don't want to validate everywhere to ensure safety, some classes need to assume pre-conditions. These pre-conditions should be documented. For instance, every array type has a complex |
Rationale for this change
Fixes the crash detailed in #41317 in TableBatchReader::ReadNext() on a corrupted Parquet file
What changes are included in this PR?
Add a validation on the chunk index requested in column_data_[i]->chunk() and return an error if out of obunds
Are these changes tested?
I've tested on the reproducer I provided in #41317 that it now triggers a clean error:
I'm not sure if/how unit tests for corrupted datasets should be added
Are there any user-facing changes?
No
This PR contains a "Critical Fix".