Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41317: [C++] Fix crash on invalid Parquet file #41320

Closed
wants to merge 1 commit into from

Conversation

rouault
Copy link
Contributor

@rouault rouault commented Apr 21, 2024

Rationale for this change

Fixes the crash detailed in #41317 in TableBatchReader::ReadNext() on a corrupted Parquet file

What changes are included in this PR?

Add a validation on the chunk index requested in column_data_[i]->chunk() and return an error if out of obunds

Are these changes tested?

I've tested on the reproducer I provided in #41317 that it now triggers a clean error:

Traceback (most recent call last):
  File "test.py", line 3, in <module>
    [_ for _ in parquet_file.iter_batches()]
  File "test.py", line 3, in <listcomp>
    [_ for _ in parquet_file.iter_batches()]
  File "pyarrow/_parquet.pyx", line 1587, in iter_batches
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Requesting too large chunk number 1 for column 18

I'm not sure if/how unit tests for corrupted datasets should be added

Are there any user-facing changes?

No

This PR contains a "Critical Fix".

Copy link

⚠️ GitHub issue #41317 has been automatically assigned in GitHub to PR creator.

@rouault
Copy link
Contributor Author

rouault commented Apr 21, 2024

Digging further, seeing that FileReaderImpl::DecodeRowGroups() already calls Table::Validate(), but that GetRecordBatchReader() didn't, I've also tested successfully the following alternative patch:

diff --git a/cpp/src/parquet/arrow/reader.cc b/cpp/src/parquet/arrow/reader.cc
index d6ad7c25b..9adb6e2c0 100644
--- a/cpp/src/parquet/arrow/reader.cc
+++ b/cpp/src/parquet/arrow/reader.cc
@@ -1044,6 +1044,7 @@ Status FileReaderImpl::GetRecordBatchReader(const std::vector<int>& row_groups,
         }
 
         auto table = ::arrow::Table::Make(batch_schema, std::move(columns));
+        RETURN_NOT_OK(table->Validate());
         auto table_reader = std::make_shared<::arrow::TableBatchReader>(*table);
 
         // NB: explicitly preserve table so that table_reader doesn't outlive it

With that patch, the error reported is "Column 18 named timestamp_us_no_tz expected length 5 but got length 2"

I'm not sure which approach is preferred.

@wgtmac
Copy link
Member

wgtmac commented Apr 22, 2024

cc @mapleFU

@mapleFU
Copy link
Member

mapleFU commented Apr 23, 2024

I'll take a carefully round tomorrow, the fix is ok but I'm not sure that's rootcause

@mapleFU
Copy link
Member

mapleFU commented Apr 24, 2024

See #41317 (comment)
The root cause is that we didn't check the length. This can be detect by add the ARROW_DCHECK_OK(table_.ValidateFull()); if debug enabled. Maybe we can solve this by more strict checkings in RecordReader. besides, I've paste a checking below

TableBatchReader::TableBatchReader(const Table& table)
    : owned_table_(nullptr),
      table_(table),
      column_data_(table.num_columns()),
      chunk_numbers_(table.num_columns(), 0),
      chunk_offsets_(table.num_columns(), 0),
      absolute_row_position_(0),
      max_chunksize_(std::numeric_limits<int64_t>::max()) {
  for (int i = 0; i < table.num_columns(); ++i) {
    column_data_[i] = table.column(i).get();
  }
  ARROW_DCHECK_OK(table_.ValidateFull());
}

TableBatchReader::TableBatchReader(std::shared_ptr<Table> table)
    : owned_table_(std::move(table)),
      table_(*owned_table_),
      column_data_(owned_table_->num_columns()),
      chunk_numbers_(owned_table_->num_columns(), 0),
      chunk_offsets_(owned_table_->num_columns(), 0),
      absolute_row_position_(0),
      max_chunksize_(std::numeric_limits<int64_t>::max()) {
  for (int i = 0; i < owned_table_->num_columns(); ++i) {
    column_data_[i] = owned_table_->column(i).get();
  }
  ARROW_DCHECK_OK(table_.ValidateFull());
}

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diff --git a/cpp/src/parquet/arrow/reader.cc b/cpp/src/parquet/arrow/reader.cc
index d6ad7c25b..e05e243e3 100644
--- a/cpp/src/parquet/arrow/reader.cc
+++ b/cpp/src/parquet/arrow/reader.cc
@@ -1043,6 +1043,17 @@ Status FileReaderImpl::GetRecordBatchReader(const std::vector<int>& row_groups,
           }
         }
 
+        // Check all columns has same row-size
+        if (!columns.empty()) {
+          int64_t row_size = columns[0]->length();
+          for (size_t i = 1; i < columns.size(); ++i) {
+            if (columns[i]->length() != row_size) {
+              return ::arrow::Status::Invalid("read column size inequal");
+            }
+          }
+        }
+
+
         auto table = ::arrow::Table::Make(batch_schema, std::move(columns));
         auto table_reader = std::make_shared<::arrow::TableBatchReader>(*table);

Would you mind change like this?

@mapleFU
Copy link
Member

mapleFU commented Apr 24, 2024

Also cc @felipecrv , do you think TableReader would check the input is valid? I think the generate side should checks it, and the consumer would better dcheck that?

@felipecrv
Copy link
Contributor

Also cc @felipecrv , do you think TableReader would check the input is valid? I think the generate side should checks it, and the consumer would better dcheck that?

Unless the inputs come from an external source (eg IPC), we shouldn't be validating them in Release builds — validation with DCHECK is OK. We should fix the producer of the invalid Table in this case.

Is the Parquet reader producing an invalid Table instance and passing it to TableReader?

@mapleFU
Copy link
Member

mapleFU commented Apr 24, 2024

Is the Parquet reader producing an invalid Table instance and passing it to TableReader?

Yes. User is running fuzzing on parquet file, when parsing a corrupt parquet file, we don't apply enough checkings current it. So I need add more strict checkings here. Just curious should we add DCHECK in TableReader helps debugging here

@felipecrv
Copy link
Contributor

Just curious should we add DCHECK in TableReader helps debugging here

Add a DCHECK in TableBatchReader, document the precondition (with /// \pre ... in the constructor, and try to make the parquet reader correct by construction, if that's not possible, put a Validate() call right before the Table is returned.

@mapleFU
Copy link
Member

mapleFU commented Apr 24, 2024

Thanks! @rouault Would you mind edit this? Or let me handle this with a new patch?

@rouault
Copy link
Contributor Author

rouault commented Apr 24, 2024

Would you mind edit this?

not sure in which direction: your above proposal " // Check all columns has same row-size" or the alternative proposal I made in #41320 (comment) using Validate() ? I'm happy if you follow up with another PR in the direction you prefer.

I would argue that libarrow/libparquet should be robust against hostile/corrupted datasets even on Release builds, as those kind of crashes are undesirable, and may potentially have security implications.

@mapleFU
Copy link
Member

mapleFU commented Apr 24, 2024

not sure in which direction: your above proposal " // Check all columns has same row-size" or the alternative proposal I made in #41320 (comment) using Validate() ? I'm happy if you follow up with another PR in the direction you prefer.

I mean we can check it here: #41320 (review) . After this, Error would be raise if column number mismatches. Other can be ARROW_DCHECK.

I would argue that libarrow/libparquet should be robust against hostile/corrupted datasets even on Release builds, as those kind of crashes are undesirable, and may potentially have security implications.

Yeah. I only malform table is the root (see #41317 (comment) ) cause of this memory access, other can just be debug checkings. ::arrow::Table::Validate is also ok but it might be a bit heavy than required. It's also ok for me. I think just check length is better because parquet::arrow::LeafReader already do Validate on column.

@rouault
Copy link
Contributor Author

rouault commented Apr 24, 2024

I mean we can check it here: #41320 (review) .

ok, closing that PR, and opening #41366 with that fix

@rouault rouault closed this Apr 24, 2024
@felipecrv
Copy link
Contributor

I would argue that libarrow/libparquet should be robust against hostile/corrupted datasets even on Release builds, as those kind of crashes are undesirable, and may potentially have security implications.

I don't disagree, but we can't and don't want to validate everywhere to ensure safety, some classes need to assume pre-conditions. These pre-conditions should be documented.

For instance, every array type has a complex Validate function that we can't afford to call on every compute kernel, but that doesn't mean we trust files or IPC stream to contain valid arrays. Validations should happen as close as possible to communication of the untrusted external world (e.g. reading Parquet files).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants