[C++][Parquet] Crash / heap-use-after-free in ByteArrayChunkedRecordReader::ReadValuesSpaced() on a corrupted Parquet file #41321

rouault · 2024-04-21T14:51:38Z

Describe the bug, including details regarding any error messages, version, and platform.

While fuzzing the GDAL Parquet reader with a local run of ossfuzz, I got the following crash in ByteArrayChunkedRecordReader::ReadValuesSpaced() on this attached fuzzed parquet file (to be unzipped first) : crash-34fd88d625cc5fef893bcba62aad402883d98f47.zip

==14==ERROR: AddressSanitizer: heap-use-after-free on address 0x60f000046e58 at pc 0x000007a43e97 bp 0x7f926c00d7e0 sp 0x7f926c00d7d8
READ of size 8 at 0x60f000046e58 thread T6
SCARINESS: 51 (8-byte-read-heap-use-after-free)
    #0 0x7a43e96 in parquet::internal::(anonymous namespace)::ByteArrayChunkedRecordReader::ReadValuesSpaced(long, long) /src/gdal/arrow/cpp/src/parquet/column_reader.cc:2135:51
    #1 0x7a3d3e4 in ReadSpacedForOptionalOrRepeated /src/gdal/arrow/cpp/src/parquet/column_reader.cc:1914:5
    #2 0x7a3d3e4 in ReadOptionalRecords /src/gdal/arrow/cpp/src/parquet/column_reader.cc:1870:7
    #3 0x7a3d3e4 in parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadRecordData(long) /src/gdal/arrow/cpp/src/parquet/column_reader.cc:1940:22
    #4 0x7a1fce0 in parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadRecords(long) /src/gdal/arrow/cpp/src/parquet/column_reader.cc
    #5 0x78acf06 in parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long) /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:482:46
    #6 0x78d07a5 in NextBatch /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:109:5
    #7 0x78d07a5 in operator() /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9
    #8 0x78d07a5 in operator()<(lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9) &, int &, arrow::Status, arrow::Future > /src/gdal/arrow/cpp/src/arrow/util/future.h:150:23
    #9 0x78d07a5 in __invoke &, (lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9) &, int &> /usr/local/bin/../include/c++/v1/type_traits:3592:23
    #10 0x78d07a5 in __apply_functor, (lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9), int>, 0UL, 1UL, 2UL, std::__1::tuple<> > /usr/local/bin/../include/c++/v1/__functional/bind.h:263:12
    #11 0x78d07a5 in operator()<> /usr/local/bin/../include/c++/v1/__functional/bind.h:298:20
    #12 0x78d07a5 in arrow::internal::FnOnce::FnImpl&, parquet::arrow::(anonymous namespace)::FileReaderImpl::GetRecordBatchReader(std::__1::vector > const&, std::__1::vector > const&, std::__1::unique_ptr >*)::$_1::operator()()::'lambda'(int)&, int&> >::invoke() /src/gdal/arrow/cpp/src/arrow/util/functional.h:152:42
    #13 0x66b0845 in operator() /src/gdal/arrow/cpp/src/arrow/util/functional.h:140:17
    #14 0x66b0845 in WorkerLoop /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:457:11
    #15 0x66b0845 in operator() /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:618:7
    #16 0x66b0845 in __invoke<(lambda at /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:616:23)> /usr/local/bin/../include/c++/v1/type_traits:3592:23
    #17 0x66b0845 in __thread_execute >, (lambda at /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:616:23)> /usr/local/bin/../include/c++/v1/thread:281:5
    #18 0x66b0845 in void* std::__1::__thread_proxy >, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3> >(void*) /usr/local/bin/../include/c++/v1/thread:292:5
    #19 0x7f9272659608 in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x8608) (BuildId: 0c044ba611aeeeaebb8374e660061f341ebc0bac)
    #20 0x7f927240a352 in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x11f352) (BuildId: eebe5d5f4b608b8a53ec446b63981bba373ca0ca)

DEDUP_TOKEN: parquet::internal::(anonymous namespace)::ByteArrayChunkedRecordReader::ReadValuesSpaced(long, long)--ReadSpacedForOptionalOrRepeated--ReadOptionalRecords
0x60f000046e58 is located 168 bytes inside of 176-byte region [0x60f000046db0,0x60f000046e60)
freed by thread T6 here:
    #0 0x5fb33d in operator delete(void*) /src/llvm-project/compiler-rt/lib/asan/asan_new_delete.cpp:152:3
    #1 0x7a245a9 in operator() /usr/local/bin/../include/c++/v1/__memory/unique_ptr.h:53:5
    #2 0x7a245a9 in reset /usr/local/bin/../include/c++/v1/__memory/unique_ptr.h:314:7
    #3 0x7a245a9 in ~unique_ptr /usr/local/bin/../include/c++/v1/__memory/unique_ptr.h:268:19
    #4 0x7a245a9 in ~pair /usr/local/bin/../include/c++/v1/__utility/pair.h:40:29
    #5 0x7a245a9 in destroy >, std::__1::default_delete > > > >, void, void> /usr/local/bin/../include/c++/v1/__memory/allocator_traits.h:319:15
    #6 0x7a245a9 in __deallocate_node /usr/local/bin/../include/c++/v1/__hash_table:1572:9
    #7 0x7a245a9 in clear /usr/local/bin/../include/c++/v1/__hash_table:1818:9
    #8 0x7a245a9 in clear /usr/local/bin/../include/c++/v1/unordered_map:1346:42
    #9 0x7a245a9 in ResetDecoders /src/gdal/arrow/cpp/src/parquet/column_reader.cc:1810:42
    #10 0x7a245a9 in SetPageReader /src/gdal/arrow/cpp/src/parquet/column_reader.cc:1802:5
    #11 0x7a245a9 in virtual thunk to parquet::internal::(anonymous namespace)::TypedRecordReader >::SetPageReader(std::__1::unique_ptr >) /src/gdal/arrow/cpp/src/parquet/column_reader.cc
    #12 0x78abf1d in parquet::arrow::(anonymous namespace)::LeafReader::NextRowGroup() /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:506:21
    #13 0x78acf3e in parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long) /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:485:9
    #14 0x78d07a5 in NextBatch /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:109:5
    #15 0x78d07a5 in operator() /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9
    #16 0x78d07a5 in operator()<(lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9) &, int &, arrow::Status, arrow::Future > /src/gdal/arrow/cpp/src/arrow/util/future.h:150:23
    #17 0x78d07a5 in __invoke &, (lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9) &, int &> /usr/local/bin/../include/c++/v1/type_traits:3592:23
    #18 0x78d07a5 in __apply_functor, (lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9), int>, 0UL, 1UL, 2UL, std::__1::tuple<> > /usr/local/bin/../include/c++/v1/__functional/bind.h:263:12
    #19 0x78d07a5 in operator()<> /usr/local/bin/../include/c++/v1/__functional/bind.h:298:20
    #20 0x78d07a5 in arrow::internal::FnOnce::FnImpl&, parquet::arrow::(anonymous namespace)::FileReaderImpl::GetRecordBatchReader(std::__1::vector > const&, std::__1::vector > const&, std::__1::unique_ptr >*)::$_1::operator()()::'lambda'(int)&, int&> >::invoke() /src/gdal/arrow/cpp/src/arrow/util/functional.h:152:42
    #21 0x66b0845 in operator() /src/gdal/arrow/cpp/src/arrow/util/functional.h:140:17
    #22 0x66b0845 in WorkerLoop /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:457:11
    #23 0x66b0845 in operator() /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:618:7
    #24 0x66b0845 in __invoke<(lambda at /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:616:23)> /usr/local/bin/../include/c++/v1/type_traits:3592:23
    #25 0x66b0845 in __thread_execute >, (lambda at /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:616:23)> /usr/local/bin/../include/c++/v1/thread:281:5
    #26 0x66b0845 in void* std::__1::__thread_proxy >, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3> >(void*) /usr/local/bin/../include/c++/v1/thread:292:5
    #27 0x7f9272659608 in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x8608) (BuildId: 0c044ba611aeeeaebb8374e660061f341ebc0bac)

The bug isn't specific of the GDAL integration and can be reproduced with this simple pyarrow.parquet based script:

import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('crash-34fd88d625cc5fef893bcba62aad402883d98f47')
parquet_file.read()

==4171200== Invalid read of size 8
==4171200==    at 0xFBFDE22: parquet::internal::(anonymous namespace)::ByteArrayChunkedRecordReader::ReadValuesSpaced(long, long) (column_reader.cc:2180)
==4171200==    by 0xFC472DE: parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadSpacedForOptionalOrRepeated(long, long*, long*) (column_reader.cc:1957)
==4171200==    by 0xFC3A479: parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadOptionalRecords(long, long*, long*) (column_reader.cc:1910)
==4171200==    by 0xFC31E31: parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadRecordData(long) (column_reader.cc:1983)
==4171200==    by 0xFC2334E: parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadRecords(long) (column_reader.cc:1453)
==4171200==    by 0xFB0D0D1: parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long) (reader.cc:482)
==4171200==    by 0xFB23E38: parquet::arrow::ColumnReaderImpl::NextBatch(long, std::shared_ptr*) (reader.cc:109)
==4171200==    by 0xFB0B951: parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadColumn(int, std::vector > const&, parquet::arrow::ColumnReader*, std::shared_ptr*) (reader.cc:284)
==4171200==    by 0xFB12A48: parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr, std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}::operator()(unsigned long, std::shared_ptr) const (reader.cc:1253)
==4171200==    by 0xFB216A8: std::enable_if<((!std::is_void > >::value)&&(!arrow::detail::is_future::value))&&((!arrow::Future::is_empty)||std::is_same::value), void>::type arrow::detail::ContinueFuture::operator(), std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}&, unsigned long&, std::shared_ptr&, arrow::Result >, arrow::Future >(arrow::detail::is_future, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr, std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}&, unsigned long&, std::shared_ptr&) const (future.h:150)
==4171200==    by 0xFB21166: void std::__invoke_impl >&, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr, std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}&, unsigned long&, std::shared_ptr&>(std::__invoke_other, arrow::detail::ContinueFuture&, arrow::Future >&, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr, std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}&, unsigned long&, std::shared_ptr&) (invoke.h:60)
==4171200==    by 0xFB20A0A: std::__invoke_result >&, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr, std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}&, unsigned long&, std::shared_ptr&>::type std::__invoke >&, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr, std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}&, unsigned long&, std::shared_ptr&>(std::__invoke_result&&, (arrow::detail::ContinueFuture&)...) (invoke.h:95)
==4171200==  Address 0x2b08fff8 is 168 bytes inside a block of size 176 free'd
==4171200==    at 0x483D1CF: operator delete(void*, unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==4171200==    by 0xFD047C2: parquet::(anonymous namespace)::DictByteArrayDecoderImpl::~DictByteArrayDecoderImpl() (encoding.cc:1887)
==4171200==    by 0xFC59125: std::default_delete > >::operator()(parquet::TypedDecoder >*) const (unique_ptr.h:81)
==4171200==    by 0xFC58709: std::unique_ptr >, std::default_delete > > >::~unique_ptr() (unique_ptr.h:292)
==4171200==    by 0xFC57B73: std::pair >, std::default_delete > > > >::~pair() (stl_pair.h:208)
==4171200==    by 0xFC57B97: void __gnu_cxx::new_allocator >, std::default_delete > > > >, false> >::destroy >, std::default_delete > > > > >(std::pair >, std::default_delete > > > >*) (new_allocator.h:152)
==4171200==    by 0xFC56896: void std::allocator_traits >, std::default_delete > > > >, false> > >::destroy >, std::default_delete > > > > >(std::allocator >, std::default_delete > > > >, false> >&, std::pair >, std::default_delete > > > >*) (alloc_traits.h:496)
==4171200==    by 0xFC55418: std::__detail::_Hashtable_alloc >, std::default_delete > > > >, false> > >::_M_deallocate_node(std::__detail::_Hash_node >, std::default_delete > > > >, false>*) (hashtable_policy.h:2102)
==4171200==    by 0xFC53C9D: std::__detail::_Hashtable_alloc >, std::default_delete > > > >, false> > >::_M_deallocate_nodes(std::__detail::_Hash_node >, std::default_delete > > > >, false>*) (hashtable_policy.h:2124)
==4171200==    by 0xFC51DA1: std::_Hashtable >, std::default_delete > > > >, std::allocator >, std::default_delete > > > > >, std::__detail::_Select1st, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits >::clear() (hashtable.h:2063)
==4171200==    by 0xFC621AF: std::unordered_map >, std::default_delete > > >, std::hash, std::equal_to, std::allocator >, std::default_delete > > > > > >::clear() (unordered_map.h:844)
==4171200==    by 0xFC334DD: parquet::internal::(anonymous namespace)::TypedRecordReader >::ResetDecoders() (column_reader.cc:1850)
==4171200==  Block was alloc'd at
==4171200==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==4171200==    by 0xFCF657B: std::_MakeUniq::__single_object std::make_unique(parquet::ColumnDescriptor const*&, arrow::MemoryPool*&) (unique_ptr.h:857)
==4171200==    by 0xFCE9BF8: parquet::detail::MakeDictDecoder(parquet::Type::type, parquet::ColumnDescriptor const*, arrow::MemoryPool*) (encoding.cc:3865)
==4171200==    by 0xFC64D07: std::unique_ptr >, std::default_delete > > > parquet::MakeDictDecoder >(parquet::ColumnDescriptor const*, arrow::MemoryPool*) (encoding.h:456)
==4171200==    by 0xFC46045: parquet::(anonymous namespace)::ColumnReaderImplBase >::ConfigureDictionary(parquet::DictionaryPage const*) (column_reader.cc:772)
==4171200==    by 0xFC39F2F: parquet::(anonymous namespace)::ColumnReaderImplBase >::ReadNewPage() (column_reader.cc:727)
==4171200==    by 0xFC316F6: parquet::(anonymous namespace)::ColumnReaderImplBase >::HasNextInternal() (column_reader.cc:700)
==4171200==    by 0xFC230ED: parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadRecords(long) (column_reader.cc:1409)
==4171200==    by 0xFB0D0D1: parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long) (reader.cc:482)
==4171200==    by 0xFB23E38: parquet::arrow::ColumnReaderImpl::NextBatch(long, std::shared_ptr*) (reader.cc:109)
==4171200==    by 0xFB0B951: parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadColumn(int, std::vector > const&, parquet::arrow::ColumnReader*, std::shared_ptr*) (reader.cc:284)
==4171200==    by 0xFB12A48: parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr, std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}::operator()(unsigned long, std::shared_ptr) const (reader.cc:1253)
==4171200== 
pure virtual method called
terminate called without an active exception

ParquetReader.scan_contents() detects an error, so there's likely a missing validation in the code path followed by DecodeRowGroups() (the fix I propose in #41320 (comment) doesn't help):

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    parquet_file.scan_contents()
  File "/home/even/arrow/python/build/lib.linux-x86_64-3.8/pyarrow/parquet/core.py", line 662, in scan_contents
    return self.reader.scan_contents(column_indices,
  File "pyarrow/_parquet.pyx", line 1702, in pyarrow._parquet.ParquetReader.scan_contents
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: Invalid or corrupted bit_width 254. Maximum allowed is 32.

Component(s)

C++, Parquet

The text was updated successfully, but these errors were encountered:

mapleFU · 2024-04-22T05:57:49Z

What's the version of code are you using? When I read this I got "Invalid or corrupted bit_width", have you select some columns during read?

rouault · 2024-04-22T10:52:45Z

@mapleFU
Reproducable with v15.0.0 and latest master at time of writing (16e20b7)

have you select some columns during read?

The API used select all columns

mapleFU · 2024-04-22T10:56:38Z

Ah nice, that's probably a problem about exception safety, I'll dive into it

mapleFU · 2024-04-22T11:38:41Z

I've check the C++ using sanitizer:

parquet-reader --debug crash-34fd88d625cc5fef893bcba62aad402883d98f47.parquet

This raise the error but doesn't cause memory access. I guess the problem is in PyArrow wrapper

rouault · 2024-04-22T11:55:27Z

I guess the problem is in PyArrow wrapper

no, it is not PyArrow specific. It can also be reproduced using plain C++ Parquet Arrow API arrow::RecordBatchReader::ReadNext(), as done in GDAL.

Pseudo-code:

std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
auto poMemoryPool = std::shared_ptr<arrow::MemoryPool>(arrow::MemoryPool::CreateDefault().release());
parquet::arrow::OpenFile(std::move(infile),  poMemoryPool.get(), &arrow_reader);
const int nNumGroups = arrow_reader->num_row_groups();
for (int i = 0; i < nNumGroups; ++i)
    anRowGroups.push_back(i);
std::shared_ptr<arrow::RecordBatchReader> poRecordBatchReader;
arrow_reader->GetRecordBatchReader(anRowGroups, &poRecordBatchReader);
std::shared_ptr<arrow::RecordBatch> poBatch;
while (true)
{
    poRecordBatchReader->ReadNext(&poBatch);
    if (!poBatch)
          break;
}

mapleFU · 2024-04-22T12:03:29Z

Emmm would you mind check poRecordBatchReader->ReadNext and break when detect error? I try the code below and it still not leak:

  std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
  ARROW_ASSIGN_OR_RAISE(arrow_reader, reader_builder.Build());

  std::shared_ptr<::arrow::RecordBatchReader> rb_reader;
  ARROW_RETURN_NOT_OK(arrow_reader->GetRecordBatchReader(&rb_reader));

  for (arrow::Result<std::shared_ptr<arrow::RecordBatch>> maybe_batch : *rb_reader) {
    // Operate on each batch...
    if (!maybe_batch.ok()) {
      std::cout << "Error reading batch: " << maybe_batch.status().message() << std::endl;
    } else {
      std::shared_ptr<arrow::RecordBatch> batch = maybe_batch.ValueOrDie();
      std::cout << "Read batch with " << batch->num_rows() << " rows" << std::endl;
    }
  }

rouault · 2024-04-22T12:07:09Z

I try the code below

by chance, can you share the source code of the standalone .cpp you use, so I can start from that ? That will make it easier for me to tune it to provide a full reproducer

I try the code below and it still not leak:

This is not a leak, but a heap-use-after-free. And from my understanding the error happens in an auxiliary reading thread. But I'm not familiar enough with the libarrow/libparquet internals

mapleFU · 2024-04-22T12:41:01Z

do you mean the code here: https://github.com/OSGeo/gdal/blob/27b5611353ae0cfe6d4e0244ef3272723845bd14/ogr/ogrsf_frmts/parquet/ogrparquetlayer.cpp

Let me try it

mapleFU · 2024-04-22T12:42:51Z

Oh I've reproduce the problem, let me fix it

mapleFU · 2024-04-22T14:00:47Z

The root cause of this memory access is clear, it doesn't happen during reading a "valid" parquet file.

During decompressing the "corrupt" file, this file has two row-groups:

RowGroup1: [Meta: 3 rows] [ Levels: empty ]
RowGroup1: [Meta: 3 rows] [ Levels: data ]

When decoding the first row, num_values_ would be 3 [1], but when decoding, because "levels" is empty, no records would be decoded, and this is not checked [2] , so it returns "0-rows". And it switch to next row-group [3], during switching, the decoder_ is been cleared[4], and because it has num_values_ left, new decoder would not been created [5]. And during read, this will segfault[6].

Solving: checking levels read is equal to row-group metadata during read levels ( This is used during read values, but read levels doesn't check this)

[1] https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L794
[2] https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L1390-L1426
[3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader.cc#L472-L491
[4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L1802
[5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L699

mapleFU · 2024-04-22T14:01:33Z

@rouault I've find out the reason here #41321 (comment) . I'm a bit tired today and will fix it tomorrow. This will not happen when file is not corrupt

mapleFU · 2024-04-22T14:11:26Z

take

mapleFU · 2024-05-08T16:14:35Z

@rouault I've verified my fixing works on this file, besides, you can also upload some corrupt file to https://github.com/apache/parquet-testing/tree/master/bad_data and this can help other subproject for testing the same problem

rouault · 2024-05-08T16:36:07Z

you can also upload some corrupt file to https://github.com/apache/parquet-testing/tree/master/bad_data and this can help other subproject for testing the same problem

submitted per apache/parquet-testing#48

### Rationale for this change In #41321 , user reports a corrupt when reading from a corrupt parquet file. This is because we lost some checking. Current code works on reading a normal parquet file. But when reading a corrupt file, this need to be more strict. **Currently this patch just enhance the checking on Parquet Level, the correspond value check would be add in later patches** ### What changes are included in this PR? More strict parquet checkings on Level ### Are these changes tested? Already exists test, maybe we can introduce parquet file as test file ### Are there any user-facing changes? More strict checkings * GitHub Issue: #41321 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: mwish <anmmscs_maple@qq.com> Signed-off-by: mwish <maplewish117@gmail.com>

mapleFU · 2024-05-21T10:38:36Z

Issue resolved by pull request 41346
#41346

…ache#41346) ### Rationale for this change In apache#41321 , user reports a corrupt when reading from a corrupt parquet file. This is because we lost some checking. Current code works on reading a normal parquet file. But when reading a corrupt file, this need to be more strict. **Currently this patch just enhance the checking on Parquet Level, the correspond value check would be add in later patches** ### What changes are included in this PR? More strict parquet checkings on Level ### Are these changes tested? Already exists test, maybe we can introduce parquet file as test file ### Are there any user-facing changes? More strict checkings * GitHub Issue: apache#41321 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: mwish <anmmscs_maple@qq.com> Signed-off-by: mwish <maplewish117@gmail.com>

rouault added the Type: bug label Apr 21, 2024

github-actions bot added Component: Parquet Component: C++ labels Apr 21, 2024

github-actions bot assigned mapleFU Apr 22, 2024

mapleFU mentioned this issue Apr 23, 2024

GH-41321: [C++][Parquet] More strict Parquet level checking #41346

Merged

rouault mentioned this issue May 8, 2024

Add corrupted files in bad_data apache/parquet-testing#48

Merged

mapleFU added this to the 17.0.0 milestone May 21, 2024

mapleFU closed this as completed May 21, 2024

don4get mentioned this issue Aug 23, 2024

GH-43605: [Go][Parquet] Recover from panic in file reader #43607

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Crash / heap-use-after-free in ByteArrayChunkedRecordReader::ReadValuesSpaced() on a corrupted Parquet file #41321

[C++][Parquet] Crash / heap-use-after-free in ByteArrayChunkedRecordReader::ReadValuesSpaced() on a corrupted Parquet file #41321

rouault commented Apr 21, 2024

mapleFU commented Apr 22, 2024

rouault commented Apr 22, 2024 •

edited

Loading

mapleFU commented Apr 22, 2024

mapleFU commented Apr 22, 2024

rouault commented Apr 22, 2024 •

edited

Loading

mapleFU commented Apr 22, 2024

rouault commented Apr 22, 2024

mapleFU commented Apr 22, 2024

mapleFU commented Apr 22, 2024

mapleFU commented Apr 22, 2024

mapleFU commented Apr 22, 2024

mapleFU commented Apr 22, 2024

mapleFU commented May 8, 2024

rouault commented May 8, 2024

mapleFU commented May 21, 2024

[C++][Parquet] Crash / heap-use-after-free in ByteArrayChunkedRecordReader::ReadValuesSpaced() on a corrupted Parquet file #41321

[C++][Parquet] Crash / heap-use-after-free in ByteArrayChunkedRecordReader::ReadValuesSpaced() on a corrupted Parquet file #41321

Comments

rouault commented Apr 21, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

mapleFU commented Apr 22, 2024

rouault commented Apr 22, 2024 • edited Loading

mapleFU commented Apr 22, 2024

mapleFU commented Apr 22, 2024

rouault commented Apr 22, 2024 • edited Loading

mapleFU commented Apr 22, 2024

rouault commented Apr 22, 2024

mapleFU commented Apr 22, 2024

mapleFU commented Apr 22, 2024

mapleFU commented Apr 22, 2024

mapleFU commented Apr 22, 2024

mapleFU commented Apr 22, 2024

mapleFU commented May 8, 2024

rouault commented May 8, 2024

mapleFU commented May 21, 2024

rouault commented Apr 22, 2024 •

edited

Loading

rouault commented Apr 22, 2024 •

edited

Loading