Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Crash / heap-use-after-free in ByteArrayChunkedRecordReader::ReadValuesSpaced() on a corrupted Parquet file #41321

Closed
rouault opened this issue Apr 21, 2024 · 15 comments

Comments

@rouault
Copy link
Contributor

rouault commented Apr 21, 2024

Describe the bug, including details regarding any error messages, version, and platform.

While fuzzing the GDAL Parquet reader with a local run of ossfuzz, I got the following crash in ByteArrayChunkedRecordReader::ReadValuesSpaced() on this attached fuzzed parquet file (to be unzipped first) : crash-34fd88d625cc5fef893bcba62aad402883d98f47.zip

==14==ERROR: AddressSanitizer: heap-use-after-free on address 0x60f000046e58 at pc 0x000007a43e97 bp 0x7f926c00d7e0 sp 0x7f926c00d7d8
READ of size 8 at 0x60f000046e58 thread T6
SCARINESS: 51 (8-byte-read-heap-use-after-free)
    #0 0x7a43e96 in parquet::internal::(anonymous namespace)::ByteArrayChunkedRecordReader::ReadValuesSpaced(long, long) /src/gdal/arrow/cpp/src/parquet/column_reader.cc:2135:51
    #1 0x7a3d3e4 in ReadSpacedForOptionalOrRepeated /src/gdal/arrow/cpp/src/parquet/column_reader.cc:1914:5
    #2 0x7a3d3e4 in ReadOptionalRecords /src/gdal/arrow/cpp/src/parquet/column_reader.cc:1870:7
    #3 0x7a3d3e4 in parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadRecordData(long) /src/gdal/arrow/cpp/src/parquet/column_reader.cc:1940:22
    #4 0x7a1fce0 in parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadRecords(long) /src/gdal/arrow/cpp/src/parquet/column_reader.cc
    #5 0x78acf06 in parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long) /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:482:46
    #6 0x78d07a5 in NextBatch /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:109:5
    #7 0x78d07a5 in operator() /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9
    #8 0x78d07a5 in operator()<(lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9) &, int &, arrow::Status, arrow::Future > /src/gdal/arrow/cpp/src/arrow/util/future.h:150:23
    #9 0x78d07a5 in __invoke &, (lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9) &, int &> /usr/local/bin/../include/c++/v1/type_traits:3592:23
    #10 0x78d07a5 in __apply_functor, (lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9), int>, 0UL, 1UL, 2UL, std::__1::tuple<> > /usr/local/bin/../include/c++/v1/__functional/bind.h:263:12
    #11 0x78d07a5 in operator()<> /usr/local/bin/../include/c++/v1/__functional/bind.h:298:20
    #12 0x78d07a5 in arrow::internal::FnOnce::FnImpl&, parquet::arrow::(anonymous namespace)::FileReaderImpl::GetRecordBatchReader(std::__1::vector > const&, std::__1::vector > const&, std::__1::unique_ptr >*)::$_1::operator()()::'lambda'(int)&, int&> >::invoke() /src/gdal/arrow/cpp/src/arrow/util/functional.h:152:42
    #13 0x66b0845 in operator() /src/gdal/arrow/cpp/src/arrow/util/functional.h:140:17
    #14 0x66b0845 in WorkerLoop /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:457:11
    #15 0x66b0845 in operator() /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:618:7
    #16 0x66b0845 in __invoke<(lambda at /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:616:23)> /usr/local/bin/../include/c++/v1/type_traits:3592:23
    #17 0x66b0845 in __thread_execute >, (lambda at /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:616:23)> /usr/local/bin/../include/c++/v1/thread:281:5
    #18 0x66b0845 in void* std::__1::__thread_proxy >, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3> >(void*) /usr/local/bin/../include/c++/v1/thread:292:5
    #19 0x7f9272659608 in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x8608) (BuildId: 0c044ba611aeeeaebb8374e660061f341ebc0bac)
    #20 0x7f927240a352 in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x11f352) (BuildId: eebe5d5f4b608b8a53ec446b63981bba373ca0ca)
DEDUP_TOKEN: parquet::internal::(anonymous namespace)::ByteArrayChunkedRecordReader::ReadValuesSpaced(long, long)--ReadSpacedForOptionalOrRepeated--ReadOptionalRecords
0x60f000046e58 is located 168 bytes inside of 176-byte region [0x60f000046db0,0x60f000046e60)
freed by thread T6 here:
    #0 0x5fb33d in operator delete(void*) /src/llvm-project/compiler-rt/lib/asan/asan_new_delete.cpp:152:3
    #1 0x7a245a9 in operator() /usr/local/bin/../include/c++/v1/__memory/unique_ptr.h:53:5
    #2 0x7a245a9 in reset /usr/local/bin/../include/c++/v1/__memory/unique_ptr.h:314:7
    #3 0x7a245a9 in ~unique_ptr /usr/local/bin/../include/c++/v1/__memory/unique_ptr.h:268:19
    #4 0x7a245a9 in ~pair /usr/local/bin/../include/c++/v1/__utility/pair.h:40:29
    #5 0x7a245a9 in destroy >, std::__1::default_delete > > > >, void, void> /usr/local/bin/../include/c++/v1/__memory/allocator_traits.h:319:15
    #6 0x7a245a9 in __deallocate_node /usr/local/bin/../include/c++/v1/__hash_table:1572:9
    #7 0x7a245a9 in clear /usr/local/bin/../include/c++/v1/__hash_table:1818:9
    #8 0x7a245a9 in clear /usr/local/bin/../include/c++/v1/unordered_map:1346:42
    #9 0x7a245a9 in ResetDecoders /src/gdal/arrow/cpp/src/parquet/column_reader.cc:1810:42
    #10 0x7a245a9 in SetPageReader /src/gdal/arrow/cpp/src/parquet/column_reader.cc:1802:5
    #11 0x7a245a9 in virtual thunk to parquet::internal::(anonymous namespace)::TypedRecordReader >::SetPageReader(std::__1::unique_ptr >) /src/gdal/arrow/cpp/src/parquet/column_reader.cc
    #12 0x78abf1d in parquet::arrow::(anonymous namespace)::LeafReader::NextRowGroup() /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:506:21
    #13 0x78acf3e in parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long) /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:485:9
    #14 0x78d07a5 in NextBatch /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:109:5
    #15 0x78d07a5 in operator() /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9
    #16 0x78d07a5 in operator()<(lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9) &, int &, arrow::Status, arrow::Future > /src/gdal/arrow/cpp/src/arrow/util/future.h:150:23
    #17 0x78d07a5 in __invoke &, (lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9) &, int &> /usr/local/bin/../include/c++/v1/type_traits:3592:23
    #18 0x78d07a5 in __apply_functor, (lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9), int>, 0UL, 1UL, 2UL, std::__1::tuple<> > /usr/local/bin/../include/c++/v1/__functional/bind.h:263:12
    #19 0x78d07a5 in operator()<> /usr/local/bin/../include/c++/v1/__functional/bind.h:298:20
    #20 0x78d07a5 in arrow::internal::FnOnce::FnImpl&, parquet::arrow::(anonymous namespace)::FileReaderImpl::GetRecordBatchReader(std::__1::vector > const&, std::__1::vector > const&, std::__1::unique_ptr >*)::$_1::operator()()::'lambda'(int)&, int&> >::invoke() /src/gdal/arrow/cpp/src/arrow/util/functional.h:152:42
    #21 0x66b0845 in operator() /src/gdal/arrow/cpp/src/arrow/util/functional.h:140:17
    #22 0x66b0845 in WorkerLoop /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:457:11
    #23 0x66b0845 in operator() /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:618:7
    #24 0x66b0845 in __invoke<(lambda at /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:616:23)> /usr/local/bin/../include/c++/v1/type_traits:3592:23
    #25 0x66b0845 in __thread_execute >, (lambda at /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:616:23)> /usr/local/bin/../include/c++/v1/thread:281:5
    #26 0x66b0845 in void* std::__1::__thread_proxy >, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3> >(void*) /usr/local/bin/../include/c++/v1/thread:292:5
    #27 0x7f9272659608 in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x8608) (BuildId: 0c044ba611aeeeaebb8374e660061f341ebc0bac)

The bug isn't specific of the GDAL integration and can be reproduced with this simple pyarrow.parquet based script:

import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('crash-34fd88d625cc5fef893bcba62aad402883d98f47')
parquet_file.read()
==4171200== Invalid read of size 8
==4171200==    at 0xFBFDE22: parquet::internal::(anonymous namespace)::ByteArrayChunkedRecordReader::ReadValuesSpaced(long, long) (column_reader.cc:2180)
==4171200==    by 0xFC472DE: parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadSpacedForOptionalOrRepeated(long, long*, long*) (column_reader.cc:1957)
==4171200==    by 0xFC3A479: parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadOptionalRecords(long, long*, long*) (column_reader.cc:1910)
==4171200==    by 0xFC31E31: parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadRecordData(long) (column_reader.cc:1983)
==4171200==    by 0xFC2334E: parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadRecords(long) (column_reader.cc:1453)
==4171200==    by 0xFB0D0D1: parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long) (reader.cc:482)
==4171200==    by 0xFB23E38: parquet::arrow::ColumnReaderImpl::NextBatch(long, std::shared_ptr*) (reader.cc:109)
==4171200==    by 0xFB0B951: parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadColumn(int, std::vector > const&, parquet::arrow::ColumnReader*, std::shared_ptr*) (reader.cc:284)
==4171200==    by 0xFB12A48: parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr, std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}::operator()(unsigned long, std::shared_ptr) const (reader.cc:1253)
==4171200==    by 0xFB216A8: std::enable_if<((!std::is_void > >::value)&&(!arrow::detail::is_future::value))&&((!arrow::Future::is_empty)||std::is_same::value), void>::type arrow::detail::ContinueFuture::operator(), std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}&, unsigned long&, std::shared_ptr&, arrow::Result >, arrow::Future >(arrow::detail::is_future, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr, std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}&, unsigned long&, std::shared_ptr&) const (future.h:150)
==4171200==    by 0xFB21166: void std::__invoke_impl >&, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr, std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}&, unsigned long&, std::shared_ptr&>(std::__invoke_other, arrow::detail::ContinueFuture&, arrow::Future >&, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr, std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}&, unsigned long&, std::shared_ptr&) (invoke.h:60)
==4171200==    by 0xFB20A0A: std::__invoke_result >&, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr, std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}&, unsigned long&, std::shared_ptr&>::type std::__invoke >&, parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr, std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}&, unsigned long&, std::shared_ptr&>(std::__invoke_result&&, (arrow::detail::ContinueFuture&)...) (invoke.h:95)
==4171200==  Address 0x2b08fff8 is 168 bytes inside a block of size 176 free'd
==4171200==    at 0x483D1CF: operator delete(void*, unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==4171200==    by 0xFD047C2: parquet::(anonymous namespace)::DictByteArrayDecoderImpl::~DictByteArrayDecoderImpl() (encoding.cc:1887)
==4171200==    by 0xFC59125: std::default_delete > >::operator()(parquet::TypedDecoder >*) const (unique_ptr.h:81)
==4171200==    by 0xFC58709: std::unique_ptr >, std::default_delete > > >::~unique_ptr() (unique_ptr.h:292)
==4171200==    by 0xFC57B73: std::pair >, std::default_delete > > > >::~pair() (stl_pair.h:208)
==4171200==    by 0xFC57B97: void __gnu_cxx::new_allocator >, std::default_delete > > > >, false> >::destroy >, std::default_delete > > > > >(std::pair >, std::default_delete > > > >*) (new_allocator.h:152)
==4171200==    by 0xFC56896: void std::allocator_traits >, std::default_delete > > > >, false> > >::destroy >, std::default_delete > > > > >(std::allocator >, std::default_delete > > > >, false> >&, std::pair >, std::default_delete > > > >*) (alloc_traits.h:496)
==4171200==    by 0xFC55418: std::__detail::_Hashtable_alloc >, std::default_delete > > > >, false> > >::_M_deallocate_node(std::__detail::_Hash_node >, std::default_delete > > > >, false>*) (hashtable_policy.h:2102)
==4171200==    by 0xFC53C9D: std::__detail::_Hashtable_alloc >, std::default_delete > > > >, false> > >::_M_deallocate_nodes(std::__detail::_Hash_node >, std::default_delete > > > >, false>*) (hashtable_policy.h:2124)
==4171200==    by 0xFC51DA1: std::_Hashtable >, std::default_delete > > > >, std::allocator >, std::default_delete > > > > >, std::__detail::_Select1st, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits >::clear() (hashtable.h:2063)
==4171200==    by 0xFC621AF: std::unordered_map >, std::default_delete > > >, std::hash, std::equal_to, std::allocator >, std::default_delete > > > > > >::clear() (unordered_map.h:844)
==4171200==    by 0xFC334DD: parquet::internal::(anonymous namespace)::TypedRecordReader >::ResetDecoders() (column_reader.cc:1850)
==4171200==  Block was alloc'd at
==4171200==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==4171200==    by 0xFCF657B: std::_MakeUniq::__single_object std::make_unique(parquet::ColumnDescriptor const*&, arrow::MemoryPool*&) (unique_ptr.h:857)
==4171200==    by 0xFCE9BF8: parquet::detail::MakeDictDecoder(parquet::Type::type, parquet::ColumnDescriptor const*, arrow::MemoryPool*) (encoding.cc:3865)
==4171200==    by 0xFC64D07: std::unique_ptr >, std::default_delete > > > parquet::MakeDictDecoder >(parquet::ColumnDescriptor const*, arrow::MemoryPool*) (encoding.h:456)
==4171200==    by 0xFC46045: parquet::(anonymous namespace)::ColumnReaderImplBase >::ConfigureDictionary(parquet::DictionaryPage const*) (column_reader.cc:772)
==4171200==    by 0xFC39F2F: parquet::(anonymous namespace)::ColumnReaderImplBase >::ReadNewPage() (column_reader.cc:727)
==4171200==    by 0xFC316F6: parquet::(anonymous namespace)::ColumnReaderImplBase >::HasNextInternal() (column_reader.cc:700)
==4171200==    by 0xFC230ED: parquet::internal::(anonymous namespace)::TypedRecordReader >::ReadRecords(long) (column_reader.cc:1409)
==4171200==    by 0xFB0D0D1: parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long) (reader.cc:482)
==4171200==    by 0xFB23E38: parquet::arrow::ColumnReaderImpl::NextBatch(long, std::shared_ptr*) (reader.cc:109)
==4171200==    by 0xFB0B951: parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadColumn(int, std::vector > const&, parquet::arrow::ColumnReader*, std::shared_ptr*) (reader.cc:284)
==4171200==    by 0xFB12A48: parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr, std::vector > const&, std::vector > const&, arrow::internal::Executor*)::{lambda(unsigned long, std::shared_ptr)#1}::operator()(unsigned long, std::shared_ptr) const (reader.cc:1253)
==4171200== 
pure virtual method called
terminate called without an active exception

ParquetReader.scan_contents() detects an error, so there's likely a missing validation in the code path followed by DecodeRowGroups() (the fix I propose in #41320 (comment) doesn't help):

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    parquet_file.scan_contents()
  File "/home/even/arrow/python/build/lib.linux-x86_64-3.8/pyarrow/parquet/core.py", line 662, in scan_contents
    return self.reader.scan_contents(column_indices,
  File "pyarrow/_parquet.pyx", line 1702, in pyarrow._parquet.ParquetReader.scan_contents
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: Invalid or corrupted bit_width 254. Maximum allowed is 32.

Component(s)

C++, Parquet

@mapleFU
Copy link
Member

mapleFU commented Apr 22, 2024

What's the version of code are you using? When I read this I got "Invalid or corrupted bit_width", have you select some columns during read?

@rouault
Copy link
Contributor Author

rouault commented Apr 22, 2024

@mapleFU
Reproducable with v15.0.0 and latest master at time of writing (16e20b7)

have you select some columns during read?

The API used select all columns

@mapleFU
Copy link
Member

mapleFU commented Apr 22, 2024

Ah nice, that's probably a problem about exception safety, I'll dive into it

@mapleFU
Copy link
Member

mapleFU commented Apr 22, 2024

I've check the C++ using sanitizer:

parquet-reader --debug crash-34fd88d625cc5fef893bcba62aad402883d98f47.parquet

This raise the error but doesn't cause memory access. I guess the problem is in PyArrow wrapper

@rouault
Copy link
Contributor Author

rouault commented Apr 22, 2024

I guess the problem is in PyArrow wrapper

no, it is not PyArrow specific. It can also be reproduced using plain C++ Parquet Arrow API arrow::RecordBatchReader::ReadNext(), as done in GDAL.

Pseudo-code:

std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
auto poMemoryPool = std::shared_ptr<arrow::MemoryPool>(arrow::MemoryPool::CreateDefault().release());
parquet::arrow::OpenFile(std::move(infile),  poMemoryPool.get(), &arrow_reader);
const int nNumGroups = arrow_reader->num_row_groups();
for (int i = 0; i < nNumGroups; ++i)
    anRowGroups.push_back(i);
std::shared_ptr<arrow::RecordBatchReader> poRecordBatchReader;
arrow_reader->GetRecordBatchReader(anRowGroups, &poRecordBatchReader);
std::shared_ptr<arrow::RecordBatch> poBatch;
while (true)
{
    poRecordBatchReader->ReadNext(&poBatch);
    if (!poBatch)
          break;
}

@mapleFU
Copy link
Member

mapleFU commented Apr 22, 2024

Emmm would you mind check poRecordBatchReader->ReadNext and break when detect error? I try the code below and it still not leak:

  std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
  ARROW_ASSIGN_OR_RAISE(arrow_reader, reader_builder.Build());

  std::shared_ptr<::arrow::RecordBatchReader> rb_reader;
  ARROW_RETURN_NOT_OK(arrow_reader->GetRecordBatchReader(&rb_reader));

  for (arrow::Result<std::shared_ptr<arrow::RecordBatch>> maybe_batch : *rb_reader) {
    // Operate on each batch...
    if (!maybe_batch.ok()) {
      std::cout << "Error reading batch: " << maybe_batch.status().message() << std::endl;
    } else {
      std::shared_ptr<arrow::RecordBatch> batch = maybe_batch.ValueOrDie();
      std::cout << "Read batch with " << batch->num_rows() << " rows" << std::endl;
    }
  }

@rouault
Copy link
Contributor Author

rouault commented Apr 22, 2024

I try the code below

by chance, can you share the source code of the standalone .cpp you use, so I can start from that ? That will make it easier for me to tune it to provide a full reproducer

I try the code below and it still not leak:

This is not a leak, but a heap-use-after-free. And from my understanding the error happens in an auxiliary reading thread. But I'm not familiar enough with the libarrow/libparquet internals

@mapleFU
Copy link
Member

mapleFU commented Apr 22, 2024

@mapleFU
Copy link
Member

mapleFU commented Apr 22, 2024

Oh I've reproduce the problem, let me fix it

@mapleFU
Copy link
Member

mapleFU commented Apr 22, 2024

The root cause of this memory access is clear, it doesn't happen during reading a "valid" parquet file.

During decompressing the "corrupt" file, this file has two row-groups:

RowGroup1: [Meta: 3 rows] [ Levels: empty ]
RowGroup1: [Meta: 3 rows] [ Levels: data ]

When decoding the first row, num_values_ would be 3 [1], but when decoding, because "levels" is empty, no records would be decoded, and this is not checked [2] , so it returns "0-rows". And it switch to next row-group [3], during switching, the decoder_ is been cleared[4], and because it has num_values_ left, new decoder would not been created [5]. And during read, this will segfault[6].

Solving: checking levels read is equal to row-group metadata during read levels ( This is used during read values, but read levels doesn't check this)

[1] https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L794
[2] https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L1390-L1426
[3] https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader.cc#L472-L491
[4] https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L1802
[5] https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L699

@mapleFU
Copy link
Member

mapleFU commented Apr 22, 2024

@rouault I've find out the reason here #41321 (comment) . I'm a bit tired today and will fix it tomorrow. This will not happen when file is not corrupt

@mapleFU
Copy link
Member

mapleFU commented Apr 22, 2024

take

@mapleFU
Copy link
Member

mapleFU commented May 8, 2024

@rouault I've verified my fixing works on this file, besides, you can also upload some corrupt file to https://github.com/apache/parquet-testing/tree/master/bad_data and this can help other subproject for testing the same problem

@rouault
Copy link
Contributor Author

rouault commented May 8, 2024

you can also upload some corrupt file to https://github.com/apache/parquet-testing/tree/master/bad_data and this can help other subproject for testing the same problem

submitted per apache/parquet-testing#48

mapleFU added a commit that referenced this issue May 21, 2024
### Rationale for this change

In #41321 , user reports a corrupt when reading from a corrupt parquet file. This is because we lost some checking. Current code works on reading a normal parquet file. But when reading a corrupt file, this need to be more strict.

**Currently this patch just enhance the checking on Parquet Level, the correspond value check would be add in later patches**

### What changes are included in this PR?

More strict parquet checkings on Level

### Are these changes tested?

Already exists test, maybe we can introduce parquet file as test file

### Are there any user-facing changes?

More strict checkings

* GitHub Issue: #41321

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: mwish <maplewish117@gmail.com>
@mapleFU mapleFU added this to the 17.0.0 milestone May 21, 2024
@mapleFU
Copy link
Member

mapleFU commented May 21, 2024

Issue resolved by pull request 41346
#41346

@mapleFU mapleFU closed this as completed May 21, 2024
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…ache#41346)

### Rationale for this change

In apache#41321 , user reports a corrupt when reading from a corrupt parquet file. This is because we lost some checking. Current code works on reading a normal parquet file. But when reading a corrupt file, this need to be more strict.

**Currently this patch just enhance the checking on Parquet Level, the correspond value check would be add in later patches**

### What changes are included in this PR?

More strict parquet checkings on Level

### Are these changes tested?

Already exists test, maybe we can introduce parquet file as test file

### Are there any user-facing changes?

More strict checkings

* GitHub Issue: apache#41321

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: mwish <maplewish117@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants