Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test data for RLE with bit_width == 0 #57

Merged
merged 4 commits into from
Aug 23, 2024

Conversation

don4get
Copy link
Contributor

@don4get don4get commented Aug 8, 2024

Data to reproduce apache/arrow#43605.

It is used in this PR to test proper behavior:
apache/arrow#43607

@@ -22,3 +22,5 @@ These are files used for reproducing various bugs that have been reported.

* PARQUET-1481.parquet: tests a case where a schema Thrift value has been
corrupted

* GH_43605.parquet: In Go, file reader goroutine crashed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming isn't terrific as it doesn't say which GH repo this is in. Perhaps ARROW-GH-43605.parquet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it is done.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @don4get

@pitrou
Copy link
Member

pitrou commented Aug 14, 2024

@mapleFU Does this look ok to you?

@mapleFU
Copy link
Member

mapleFU commented Aug 14, 2024

Sorry I just back home I guess we can use same file as previous testing, let me read and check it

@@ -22,3 +22,5 @@ These are files used for reproducing various bugs that have been reported.

* PARQUET-1481.parquet: tests a case where a schema Thrift value has been
corrupted

* ARROW-GH-43605.parquet: In Go, file reader goroutine crashed.
Copy link
Member

@mapleFU mapleFU Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think we should describe more about why this causing crashed, like footer corrupt, page header metadata, invalid content like page size less than metadata described, less column chunk than in metadata description etc. We have lots of "corruput file", but the reason would be differ

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give me some help to know which part of the file is corrupted? I am not expert enough to know why the file is corrupt. I only know it is generated with a specific version of polars, as described in this PR: apache/arrow#43607 (comment)
Latest version of polars does not cause the issue anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait a minute, I'll test this on my local pc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emm I don't know, arrow-c++ can read this file, emm...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If arrow-cpp manages to open it is nice, arrow-go crashed with this before related PR on arrow. Should I move this into valid data folder then?

Copy link
Member

@pitrou pitrou Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The criterion should be: is it a valid Parquet file or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a valid Parquet file and the issue was a bug in Arrow Go preventing to read it, then it should go in data. If it's an invalid Parquet file and the issue was that Arrow Go crashed instead of reporting a regular error, then it should go in bad_data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check this later

Copy link
Member

@mapleFU mapleFU Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: #57 (comment)

Dictionary Rle page with RLE encoding bit-width is 0.

@mapleFU
Copy link
Member

mapleFU commented Aug 23, 2024

I finally catch this. This file has zero-sized dictionary bit-width. With diff:

diff --git a/cpp/src/arrow/util/bit_stream_utils_internal.h b/cpp/src/arrow/util/bit_stream_utils_internal.h
index 811694e43..176d4b1d9 100644
--- a/cpp/src/arrow/util/bit_stream_utils_internal.h
+++ b/cpp/src/arrow/util/bit_stream_utils_internal.h
@@ -22,6 +22,7 @@
 #include <algorithm>
 #include <cstdint>
 #include <cstring>
+#include <iostream>
 
 #include "arrow/util/bit_util.h"
 #include "arrow/util/bpacking.h"
@@ -312,6 +313,7 @@ inline bool BitReader::GetValue(int num_bits, T* v) {
 
 template <typename T>
 inline int BitReader::GetBatch(int num_bits, T* v, int batch_size) {
+  std::cout << "Unpack:" << num_bits << ", batch-size:" << batch_size << '\n';
   DCHECK(buffer_ != NULL);
   DCHECK_LE(num_bits, static_cast<int>(sizeof(T) * 8)) << "num_bits: " << num_bits;
 
diff --git a/cpp/src/parquet/encoding.cc b/cpp/src/parquet/encoding.cc
index 16a1e2492..b76ca1b2d 100644
--- a/cpp/src/parquet/encoding.cc
+++ b/cpp/src/parquet/encoding.cc
@@ -1626,6 +1626,7 @@ class DictDecoderImpl : public DecoderImpl, virtual public DictDecoder<Type> {
       throw ParquetException("Invalid or corrupted bit_width " +
                              std::to_string(bit_width) + ". Maximum allowed is 32.");
     }
+    std::cout << "Dictionary bit-width: " << int(bit_width) << '\n';
     idx_decoder_ = ::arrow::util::RleDecoder(++data, --len, bit_width);
   }

The dictionary bit-width is 0 here in the output.

@mapleFU mapleFU force-pushed the GH43605-test-data branch from abd050c to f061d72 Compare August 23, 2024 17:29
@@ -30,3 +30,4 @@ These are files used for reproducing various bugs that have been reported.
where decoded rep / def levels is less than num_values in page_header.
* ARROW-GH-41317.parquet: test case of https://github.com/apache/arrow/issues/41317
where all columns have not the same size.
* ARROW-GH-43605.parquet: dictionary index page uses rle encoding but 0 as rle bit-width.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge this first, but I don't know whether this is bad or corner of spec

@mapleFU mapleFU changed the title Add test data for crash in go file reader Add test data for rle with bit_width == 0 Aug 23, 2024
@mapleFU mapleFU changed the title Add test data for rle with bit_width == 0 Add test data for RLE with bit_width == 0 Aug 23, 2024
@mapleFU mapleFU merged commit 50af3d8 into apache:master Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants