Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] [Parquet] Crash / heap-buffer-overflow in TableBatchReader::ReadNext() on a corrupted Parquet file #41317

Closed
rouault opened this issue Apr 21, 2024 · 2 comments

Comments

@rouault
Copy link
Contributor

rouault commented Apr 21, 2024

Describe the bug, including details regarding any error messages, version, and platform.

While fuzzing the GDAL Parquet reader with a local run of ossfuzz, I got the following crash in TableBatchReader::ReadNext() on this attached fuzzed parquet file (to be unzipped first) : crash-581a7ec06da982291398aa3f63361ecb69fe20fc.zip

==14==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x602000063500 at pc 0x000002c7899c bp 0x7ffdf52157f0 sp 0x7ffdf52157e8
READ of size 8 at 0x602000063500 thread T0
SCARINESS: 23 (8-byte-read-heap-buffer-overflow)
    #0 0x2c7899b in get /usr/local/bin/../include/c++/v1/__memory/shared_ptr.h:801:16
    #1 0x2c7899b in arrow::TableBatchReader::ReadNext(std::__1::shared_ptr*) /src/gdal/arrow/cpp/src/arrow/table.cc:652:60
    #2 0x36ca184 in arrow::RecordBatchReader::Next() /src/gdal/arrow/cpp/src/arrow/record_batch.h:302:5
    #3 0x78d182e in operator() /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1051:58
    #4 0x78d182e in Next /src/gdal/arrow/cpp/src/arrow/util/iterator.h:346:29
    #5 0x78d182e in arrow::Result > arrow::Iterator >::Next > const&, std::__1::vector > const&, std::__1::unique_ptr >*)::$_1::operator()()::'lambda'(), std::__1::shared_ptr > >(void*) /src/gdal/arrow/cpp/src/arrow/util/iterator.h:200:40
    #6 0x78d1fd9 in Next /src/gdal/arrow/cpp/src/arrow/util/iterator.h:110:29
    #7 0x78d1fd9 in arrow::FlattenIterator >::Next() /src/gdal/arrow/cpp/src/arrow/util/iterator.h:541:5
    #8 0x78d1e67 in arrow::Result > arrow::Iterator >::Next > >(void*) /src/gdal/arrow/cpp/src/arrow/util/iterator.h:200:40
    #9 0x78cbe47 in Next /src/gdal/arrow/cpp/src/arrow/util/iterator.h:110:29
    #10 0x78cbe47 in parquet::arrow::(anonymous namespace)::RowGroupRecordBatchReader::ReadNext(std::__1::shared_ptr*) /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:392:21
    #11 0xa1d0ec in OGRParquetLayer::ReadNextBatch() /src/gdal/ogr/ogrsf_frmts/parquet/ogrparquetlayer.cpp:1782:46
    #12 0x9a7f7f in OGRArrowLayer::GetNextRawFeature() /src/gdal/ogr/ogrsf_frmts/parquet/../arrow_common/ograrrowlayer.hpp:3973:19
    #13 0x97ffc7 in GetNextFeature /src/gdal/ogr/ogrsf_frmts/ogrsf_frmts.h:466:45
    #14 0x97ffc7 in OGRArrowLayer::GetNextFeature() /src/gdal/ogr/ogrsf_frmts/parquet/../arrow_common/ogr_arrow.h:269:5
    #15 0xb462d4 in OGR_L_GetNextFeature /src/gdal/ogr/ogrsf_frmts/generic/ogrlayer.cpp:650:63
    #16 0x5fd3a0 in LLVMFuzzerTestOneInput /src/gdal/./fuzzers/ogr_fuzzer.cpp:155:30
    #17 0x4ceb03 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15
    #18 0x4ce2ea in fuzzer::Fuzzer::RunOne(unsigned char const*, unsigned long, bool, fuzzer::InputInfo*, bool, bool*) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:514:3
    #19 0x4cf9b9 in fuzzer::Fuzzer::MutateAndTestOne() /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:757:19
    #20 0x4d0685 in fuzzer::Fuzzer::Loop(std::__Fuzzer::vector >&) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:895:5
    #21 0x4bf9ef in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:912:6
    #22 0x4e9042 in main /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10
    #23 0x7f4f72347082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: eebe5d5f4b608b8a53ec446b63981bba373ca0ca)
    #24 0x4b042d in _start (/out/parquet_fuzzer+0x4b042d)
DEDUP_TOKEN: get--arrow::TableBatchReader::ReadNext(std::__1::shared_ptr*)--arrow::RecordBatchReader::Next()
0x602000063500 is located 0 bytes to the right of 16-byte region [0x6020000634f0,0x602000063500)
allocated by thread T4 here:
    #0 0x5faadd in operator new(unsigned long) /src/llvm-project/compiler-rt/lib/asan/asan_new_delete.cpp:95:3
    #1 0x2c89571 in __libcpp_operator_new /usr/local/bin/../include/c++/v1/new:245:10
    #2 0x2c89571 in __libcpp_allocate /usr/local/bin/../include/c++/v1/new:271:10
    #3 0x2c89571 in allocate /usr/local/bin/../include/c++/v1/__memory/allocator.h:105:38
    #4 0x2c89571 in allocate /usr/local/bin/../include/c++/v1/__memory/allocator_traits.h:262:20
    #5 0x2c89571 in __vallocate /usr/local/bin/../include/c++/v1/vector:931:37
    #6 0x2c89571 in vector /usr/local/bin/../include/c++/v1/vector:1236:9
    #7 0x2c89571 in arrow::ChunkedArray::ChunkedArray(std::__1::shared_ptr) /src/gdal/arrow/cpp/src/arrow/chunked_array.h:80:22
    #8 0x7991577 in __shared_ptr_emplace > /usr/local/bin/../include/c++/v1/__memory/shared_ptr.h:294:37
    #9 0x7991577 in std::__1::shared_ptr std::__1::allocate_shared, std::__1::shared_ptr, void>(std::__1::allocator const&, std::__1::shared_ptr&&) /usr/local/bin/../include/c++/v1/__memory/shared_ptr.h:953:55
    #10 0x7947a65 in make_shared, void> /usr/local/bin/../include/c++/v1/__memory/shared_ptr.h:962:12
    #11 0x7947a65 in parquet::arrow::TransferColumnData(parquet::internal::RecordReader*, std::__1::shared_ptr const&, parquet::ColumnDescriptor const*, arrow::MemoryPool*, std::__1::shared_ptr*) /src/gdal/arrow/cpp/src/parquet/arrow/reader_internal.cc:873:12
    #12 0x78ac9c3 in parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long) /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:488:5
    #13 0x78d0155 in NextBatch /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:109:5
    #14 0x78d0155 in operator() /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9
    #15 0x78d0155 in operator()<(lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9) &, int &, arrow::Status, arrow::Future > /src/gdal/arrow/cpp/src/arrow/util/future.h:150:23
    #16 0x78d0155 in __invoke &, (lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9) &, int &> /usr/local/bin/../include/c++/v1/type_traits:3592:23
    #17 0x78d0155 in __apply_functor, (lambda at /src/gdal/arrow/cpp/src/parquet/arrow/reader.cc:1036:9), int>, 0UL, 1UL, 2UL, std::__1::tuple<> > /usr/local/bin/../include/c++/v1/__functional/bind.h:263:12
    #18 0x78d0155 in operator()<> /usr/local/bin/../include/c++/v1/__functional/bind.h:298:20
    #19 0x78d0155 in arrow::internal::FnOnce::FnImpl&, parquet::arrow::(anonymous namespace)::FileReaderImpl::GetRecordBatchReader(std::__1::vector > const&, std::__1::vector > const&, std::__1::unique_ptr >*)::$_1::operator()()::'lambda'(int)&, int&> >::invoke() /src/gdal/arrow/cpp/src/arrow/util/functional.h:152:42
    #20 0x66b01f5 in operator() /src/gdal/arrow/cpp/src/arrow/util/functional.h:140:17
    #21 0x66b01f5 in WorkerLoop /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:457:11
    #22 0x66b01f5 in operator() /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:618:7
    #23 0x66b01f5 in __invoke<(lambda at /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:616:23)> /usr/local/bin/../include/c++/v1/type_traits:3592:23
    #24 0x66b01f5 in __thread_execute >, (lambda at /src/gdal/arrow/cpp/src/arrow/util/thread_pool.cc:616:23)> /usr/local/bin/../include/c++/v1/thread:281:5
    #25 0x66b01f5 in void* std::__1::__thread_proxy >, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3> >(void*) /usr/local/bin/../include/c++/v1/thread:292:5
    #26 0x7f4f72691608 in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x8608) (BuildId: 0c044ba611aeeeaebb8374e660061f341ebc0bac)

The bug isn't specific of the GDAL integration and can be reproduced with this simple pyarrow.parquet based script:

import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('crash-581a7ec06da982291398aa3f63361ecb69fe20fc')
[_ for _ in parquet_file.iter_batches()]
==4082742== Invalid read of size 8
==4082742==    at 0x71357DB: arrow::TableBatchReader::ReadNext(std::shared_ptr*) (in /home/even/install-arrow/lib/libarrow.so.1500.0.0)
==4082742==    by 0xD36D07A: arrow::Result > arrow::Iterator >::Next > const&, std::vector > const&, std::unique_ptr >*)::{lambda()#1}::operator()()::{lambda()#2}, std::shared_ptr > >(void*) (in /home/even/install-arrow/lib/libparquet.so.1500.0.0)
==4082742==    by 0xD381717: arrow::FlattenIterator >::Next() (in /home/even/install-arrow/lib/libparquet.so.1500.0.0)
==4082742==    by 0xD381921: arrow::Result > arrow::Iterator >::Next > >(void*) (in /home/even/install-arrow/lib/libparquet.so.1500.0.0)
==4082742==    by 0xD36F8B1: parquet::arrow::(anonymous namespace)::RowGroupRecordBatchReader::ReadNext(std::shared_ptr*) (in /home/even/install-arrow/lib/libparquet.so.1500.0.0)
==4082742==    by 0xD212735: __pyx_gb_7pyarrow_8_parquet_13ParquetReader_10generator(__pyx_CoroutineObject*, _ts*, _object*) (in /home/even/arrow/python/build/lib.linux-x86_64-3.8/pyarrow/_parquet.cpython-38-x86_64-linux-gnu.so)
==4082742==    by 0x6564152: __Pyx_Coroutine_SendEx(__pyx_CoroutineObject*, _object*, int) [clone .isra.0] (in /home/even/arrow/python/build/lib.linux-x86_64-3.8/pyarrow/lib.cpython-38-x86_64-linux-gnu.so)
==4082742==    by 0x547634: _PyEval_EvalFrameDefault (in /usr/bin/python3.8)
==4082742==    by 0x5D5845: _PyFunction_Vectorcall (in /usr/bin/python3.8)
==4082742==    by 0x547264: _PyEval_EvalFrameDefault (in /usr/bin/python3.8)
==4082742==    by 0x545529: _PyEval_EvalCodeWithName (in /usr/bin/python3.8)
==4082742==    by 0x684326: PyEval_EvalCode (in /usr/bin/python3.8)
==4082742==  Address 0x128daad0 is 0 bytes after a block of size 16 alloc'd
==4082742==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==4082742==    by 0xD394AFE: parquet::arrow::TransferColumnData(parquet::internal::RecordReader*, std::shared_ptr const&, parquet::ColumnDescriptor const*, arrow::MemoryPool*, std::shared_ptr*) (in /home/even/install-arrow/lib/libparquet.so.1500.0.0)
==4082742==    by 0xD36FA1A: parquet::arrow::(anonymous namespace)::LeafReader::LoadBatch(long) (in /home/even/install-arrow/lib/libparquet.so.1500.0.0)
==4082742==    by 0xD36D189: arrow::internal::FnOnce::FnImpl, parquet::arrow::(anonymous namespace)::FileReaderImpl::GetRecordBatchReader(std::vector > const&, std::vector > const&, std::unique_ptr >*)::{lambda()#1}::operator()()::{lambda(int)#1}, int)> >::invoke() (in /home/even/install-arrow/lib/libparquet.so.1500.0.0)
==4082742==    by 0x7280716: std::thread::_State_impl > >::_M_run() (in /home/even/install-arrow/lib/libarrow.so.1500.0.0)
==4082742==    by 0x8735DF3: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==4082742==    by 0x4A78608: start_thread (pthread_create.c:477)
==4082742==    by 0x499B352: clone (clone.S:95)
==4082742== 

Reproducable with v15.0.0 and latest master at time of writing (16e20b7)

I'm not sure if it is related but trying to read with ParquetReader.read_all() triggers a clean exception:

>>> import pyarrow.parquet as pq
>>> parquet_file = pq.ParquetFile('crash-581a7ec06da982291398aa3f63361ecb69fe20fc')
>>> parquet_file.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/even/arrow/python/build/lib.linux-x86_64-3.8/pyarrow/parquet/core.py", line 624, in read
    return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1675, in pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 18 named timestamp_us_no_tz expected length 5 but got length 2

and another one when using ParquetReader.scan_contents():

>>> import pyarrow.parquet as pq
>>> parquet_file = pq.ParquetFile('crash-581a7ec06da982291398aa3f63361ecb69fe20fc')
>>> parquet_file.scan_contents()
Traceback (most recent call last):
  File "test.py", line 4, in <module>
    parquet_file.scan_contents()
  File "/home/even/arrow/python/build/lib.linux-x86_64-3.8/pyarrow/parquet/core.py", line 662, in scan_contents
    return self.reader.scan_contents(column_indices,
  File "pyarrow/_parquet.pyx", line 1702, in pyarrow._parquet.ParquetReader.scan_contents
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: Parquet error: Total rows among columns do not match

Component(s)

C++, Parquet

@mapleFU
Copy link
Member

mapleFU commented Apr 24, 2024

I found the root-cause is:

/Users/mwish/workspace/CMakeLibs/arrow/cpp/src/arrow/table.cc:622:  Check failed: _s.ok() Operation failed: table_.ValidateFull()
Bad status: Invalid: Column 18 named timestamp_us_no_tz expected length 3 but got length 2

That's because less data is read to the table. I may suggest another way to fix this

rouault added a commit to rouault/arrow that referenced this issue Apr 24, 2024
Lead-authored-by: mwish <maplewish117@gmail.com>
rouault added a commit to rouault/arrow that referenced this issue Apr 24, 2024
rouault added a commit to rouault/arrow that referenced this issue Apr 27, 2024
rouault added a commit to rouault/arrow that referenced this issue Apr 28, 2024
mapleFU pushed a commit that referenced this issue Apr 30, 2024
### Rationale for this change

Fixes the crash detailed in #41317 in TableBatchReader::ReadNext() on a corrupted Parquet file

### What changes are included in this PR?

Add a validation that all read columns have the same size

### Are these changes tested?

I've tested on the reproducer I provided in #41317 that it now triggers a clean error:
```
Traceback (most recent call last):
  File "test.py", line 3, in <module>
    [_ for _ in parquet_file.iter_batches()]
  File "test.py", line 3, in <listcomp>
    [_ for _ in parquet_file.iter_batches()]
  File "pyarrow/_parquet.pyx", line 1587, in iter_batches
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: columns do not have the same size
```
I'm not sure if/how unit tests for corrupted datasets should be added

### Are there any user-facing changes?

No

**This PR contains a "Critical Fix".**

* GitHub Issue: #41317

Authored-by: Even Rouault <even.rouault@spatialys.com>
Signed-off-by: mwish <maplewish117@gmail.com>
@mapleFU mapleFU added this to the 17.0.0 milestone Apr 30, 2024
@mapleFU
Copy link
Member

mapleFU commented Apr 30, 2024

Issue resolved by pull request 41366
#41366

@mapleFU mapleFU closed this as completed Apr 30, 2024
tolleybot pushed a commit to tmct/arrow that referenced this issue May 2, 2024
### Rationale for this change

Fixes the crash detailed in apache#41317 in TableBatchReader::ReadNext() on a corrupted Parquet file

### What changes are included in this PR?

Add a validation that all read columns have the same size

### Are these changes tested?

I've tested on the reproducer I provided in apache#41317 that it now triggers a clean error:
```
Traceback (most recent call last):
  File "test.py", line 3, in <module>
    [_ for _ in parquet_file.iter_batches()]
  File "test.py", line 3, in <listcomp>
    [_ for _ in parquet_file.iter_batches()]
  File "pyarrow/_parquet.pyx", line 1587, in iter_batches
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: columns do not have the same size
```
I'm not sure if/how unit tests for corrupted datasets should be added

### Are there any user-facing changes?

No

**This PR contains a "Critical Fix".**

* GitHub Issue: apache#41317

Authored-by: Even Rouault <even.rouault@spatialys.com>
Signed-off-by: mwish <maplewish117@gmail.com>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
### Rationale for this change

Fixes the crash detailed in apache#41317 in TableBatchReader::ReadNext() on a corrupted Parquet file

### What changes are included in this PR?

Add a validation that all read columns have the same size

### Are these changes tested?

I've tested on the reproducer I provided in apache#41317 that it now triggers a clean error:
```
Traceback (most recent call last):
  File "test.py", line 3, in <module>
    [_ for _ in parquet_file.iter_batches()]
  File "test.py", line 3, in <listcomp>
    [_ for _ in parquet_file.iter_batches()]
  File "pyarrow/_parquet.pyx", line 1587, in iter_batches
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: columns do not have the same size
```
I'm not sure if/how unit tests for corrupted datasets should be added

### Are there any user-facing changes?

No

**This PR contains a "Critical Fix".**

* GitHub Issue: apache#41317

Authored-by: Even Rouault <even.rouault@spatialys.com>
Signed-off-by: mwish <maplewish117@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants