GH-41317: [C++] Fix crash on invalid Parquet file #41320

rouault · 2024-04-21T13:24:42Z

Rationale for this change

Fixes the crash detailed in #41317 in TableBatchReader::ReadNext() on a corrupted Parquet file

What changes are included in this PR?

Add a validation on the chunk index requested in column_data_[i]->chunk() and return an error if out of obunds

Are these changes tested?

I've tested on the reproducer I provided in #41317 that it now triggers a clean error:

Traceback (most recent call last):
  File "test.py", line 3, in <module>
    [_ for _ in parquet_file.iter_batches()]
  File "test.py", line 3, in <listcomp>
    [_ for _ in parquet_file.iter_batches()]
  File "pyarrow/_parquet.pyx", line 1587, in iter_batches
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Requesting too large chunk number 1 for column 18

I'm not sure if/how unit tests for corrupted datasets should be added

Are there any user-facing changes?

No

This PR contains a "Critical Fix".

GitHub Issue: [C++] [Parquet] Crash / heap-buffer-overflow in TableBatchReader::ReadNext() on a corrupted Parquet file #41317

github-actions · 2024-04-21T13:25:05Z

⚠️ GitHub issue #41317 has been automatically assigned in GitHub to PR creator.

rouault · 2024-04-21T14:26:44Z

Digging further, seeing that FileReaderImpl::DecodeRowGroups() already calls Table::Validate(), but that GetRecordBatchReader() didn't, I've also tested successfully the following alternative patch:

diff --git a/cpp/src/parquet/arrow/reader.cc b/cpp/src/parquet/arrow/reader.cc
index d6ad7c25b..9adb6e2c0 100644
--- a/cpp/src/parquet/arrow/reader.cc
+++ b/cpp/src/parquet/arrow/reader.cc
@@ -1044,6 +1044,7 @@ Status FileReaderImpl::GetRecordBatchReader(const std::vector<int>& row_groups,
         }
 
         auto table = ::arrow::Table::Make(batch_schema, std::move(columns));
+        RETURN_NOT_OK(table->Validate());
         auto table_reader = std::make_shared<::arrow::TableBatchReader>(*table);
 
         // NB: explicitly preserve table so that table_reader doesn't outlive it

With that patch, the error reported is "Column 18 named timestamp_us_no_tz expected length 5 but got length 2"

I'm not sure which approach is preferred.

wgtmac · 2024-04-22T12:05:33Z

cc @mapleFU

mapleFU · 2024-04-23T18:04:13Z

I'll take a carefully round tomorrow, the fix is ok but I'm not sure that's rootcause

mapleFU · 2024-04-24T13:28:03Z

See #41317 (comment)
The root cause is that we didn't check the length. This can be detect by add the ARROW_DCHECK_OK(table_.ValidateFull()); if debug enabled. Maybe we can solve this by more strict checkings in RecordReader. besides, I've paste a checking below

TableBatchReader::TableBatchReader(const Table& table)
    : owned_table_(nullptr),
      table_(table),
      column_data_(table.num_columns()),
      chunk_numbers_(table.num_columns(), 0),
      chunk_offsets_(table.num_columns(), 0),
      absolute_row_position_(0),
      max_chunksize_(std::numeric_limits<int64_t>::max()) {
  for (int i = 0; i < table.num_columns(); ++i) {
    column_data_[i] = table.column(i).get();
  }
  ARROW_DCHECK_OK(table_.ValidateFull());
}

TableBatchReader::TableBatchReader(std::shared_ptr<Table> table)
    : owned_table_(std::move(table)),
      table_(*owned_table_),
      column_data_(owned_table_->num_columns()),
      chunk_numbers_(owned_table_->num_columns(), 0),
      chunk_offsets_(owned_table_->num_columns(), 0),
      absolute_row_position_(0),
      max_chunksize_(std::numeric_limits<int64_t>::max()) {
  for (int i = 0; i < owned_table_->num_columns(); ++i) {
    column_data_[i] = owned_table_->column(i).get();
  }
  ARROW_DCHECK_OK(table_.ValidateFull());
}

mapleFU

diff --git a/cpp/src/parquet/arrow/reader.cc b/cpp/src/parquet/arrow/reader.cc
index d6ad7c25b..e05e243e3 100644
--- a/cpp/src/parquet/arrow/reader.cc
+++ b/cpp/src/parquet/arrow/reader.cc
@@ -1043,6 +1043,17 @@ Status FileReaderImpl::GetRecordBatchReader(const std::vector<int>& row_groups,
           }
         }
 
+        // Check all columns has same row-size
+        if (!columns.empty()) {
+          int64_t row_size = columns[0]->length();
+          for (size_t i = 1; i < columns.size(); ++i) {
+            if (columns[i]->length() != row_size) {
+              return ::arrow::Status::Invalid("read column size inequal");
+            }
+          }
+        }
+
+
         auto table = ::arrow::Table::Make(batch_schema, std::move(columns));
         auto table_reader = std::make_shared<::arrow::TableBatchReader>(*table);

Would you mind change like this?

mapleFU · 2024-04-24T13:51:40Z

Also cc @felipecrv , do you think TableReader would check the input is valid? I think the generate side should checks it, and the consumer would better dcheck that?

felipecrv · 2024-04-24T14:15:38Z

Also cc @felipecrv , do you think TableReader would check the input is valid? I think the generate side should checks it, and the consumer would better dcheck that?

Unless the inputs come from an external source (eg IPC), we shouldn't be validating them in Release builds — validation with DCHECK is OK. We should fix the producer of the invalid Table in this case.

Is the Parquet reader producing an invalid Table instance and passing it to TableReader?

mapleFU · 2024-04-24T14:18:26Z

Is the Parquet reader producing an invalid Table instance and passing it to TableReader?

Yes. User is running fuzzing on parquet file, when parsing a corrupt parquet file, we don't apply enough checkings current it. So I need add more strict checkings here. Just curious should we add DCHECK in TableReader helps debugging here

felipecrv · 2024-04-24T14:26:05Z

Just curious should we add DCHECK in TableReader helps debugging here

Add a DCHECK in TableBatchReader, document the precondition (with /// \pre ... in the constructor, and try to make the parquet reader correct by construction, if that's not possible, put a Validate() call right before the Table is returned.

mapleFU · 2024-04-24T14:28:36Z

Thanks! @rouault Would you mind edit this? Or let me handle this with a new patch?

rouault · 2024-04-24T14:56:24Z

Would you mind edit this?

not sure in which direction: your above proposal " // Check all columns has same row-size" or the alternative proposal I made in #41320 (comment) using Validate() ? I'm happy if you follow up with another PR in the direction you prefer.

I would argue that libarrow/libparquet should be robust against hostile/corrupted datasets even on Release builds, as those kind of crashes are undesirable, and may potentially have security implications.

mapleFU · 2024-04-24T15:16:15Z

not sure in which direction: your above proposal " // Check all columns has same row-size" or the alternative proposal I made in #41320 (comment) using Validate() ? I'm happy if you follow up with another PR in the direction you prefer.

I mean we can check it here: #41320 (review) . After this, Error would be raise if column number mismatches. Other can be ARROW_DCHECK.

I would argue that libarrow/libparquet should be robust against hostile/corrupted datasets even on Release builds, as those kind of crashes are undesirable, and may potentially have security implications.

Yeah. I only malform table is the root (see #41317 (comment) ) cause of this memory access, other can just be debug checkings. ::arrow::Table::Validate is also ok but it might be a bit heavy than required. It's also ok for me. I think just check length is better because parquet::arrow::LeafReader already do Validate on column.

rouault · 2024-04-24T15:48:21Z

I mean we can check it here: #41320 (review) .

ok, closing that PR, and opening #41366 with that fix

felipecrv · 2024-04-24T17:14:50Z

I would argue that libarrow/libparquet should be robust against hostile/corrupted datasets even on Release builds, as those kind of crashes are undesirable, and may potentially have security implications.

I don't disagree, but we can't and don't want to validate everywhere to ensure safety, some classes need to assume pre-conditions. These pre-conditions should be documented.

For instance, every array type has a complex Validate function that we can't afford to call on every compute kernel, but that doesn't mean we trust files or IPC stream to contain valid arrays. Validations should happen as close as possible to communication of the untrusted external world (e.g. reading Parquet files).

apacheGH-41317: [C++] Fix crash on invalid Parquet file

4fb154e

github-actions bot added Component: C++ awaiting review Awaiting review labels Apr 21, 2024

rouault mentioned this pull request Apr 21, 2024

[C++][Parquet] Crash / heap-use-after-free in ByteArrayChunkedRecordReader::ReadValuesSpaced() on a corrupted Parquet file #41321

Closed

mapleFU added the Component: Parquet label Apr 24, 2024

mapleFU reviewed Apr 24, 2024

View reviewed changes

rouault closed this Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-41317: [C++] Fix crash on invalid Parquet file #41320

GH-41317: [C++] Fix crash on invalid Parquet file #41320

rouault commented Apr 21, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Apr 21, 2024

rouault commented Apr 21, 2024

wgtmac commented Apr 22, 2024

mapleFU commented Apr 23, 2024

mapleFU commented Apr 24, 2024 •

edited

Loading

mapleFU left a comment

mapleFU commented Apr 24, 2024

felipecrv commented Apr 24, 2024

mapleFU commented Apr 24, 2024

felipecrv commented Apr 24, 2024

mapleFU commented Apr 24, 2024

rouault commented Apr 24, 2024

mapleFU commented Apr 24, 2024 •

edited

Loading

rouault commented Apr 24, 2024

felipecrv commented Apr 24, 2024

GH-41317: [C++] Fix crash on invalid Parquet file #41320

GH-41317: [C++] Fix crash on invalid Parquet file #41320

Conversation

rouault commented Apr 21, 2024 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Apr 21, 2024

rouault commented Apr 21, 2024

wgtmac commented Apr 22, 2024

mapleFU commented Apr 23, 2024

mapleFU commented Apr 24, 2024 • edited Loading

mapleFU left a comment

Choose a reason for hiding this comment

mapleFU commented Apr 24, 2024

felipecrv commented Apr 24, 2024

mapleFU commented Apr 24, 2024

felipecrv commented Apr 24, 2024

mapleFU commented Apr 24, 2024

rouault commented Apr 24, 2024

mapleFU commented Apr 24, 2024 • edited Loading

rouault commented Apr 24, 2024

felipecrv commented Apr 24, 2024

rouault commented Apr 21, 2024 •

edited by github-actions bot

Loading

mapleFU commented Apr 24, 2024 •

edited

Loading

mapleFU commented Apr 24, 2024 •

edited

Loading