Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix](parquet-reader) Fix and optimize parquet min-max filtering. #39375

Merged
merged 1 commit into from
Aug 15, 2024

Conversation

kaka11chen
Copy link
Contributor

Backport #38277.

…ache#38277)

## Proposed changes

Refer to trino's implementation

- Some bugs in the historical version paquet-mr. Use
`CorruptStatistics::should_ignore_statistics()` to handle.

- The old version of parquet uses `min` and `max` stats, and later
implements `min_value` and `max_value`. `Min`/`max` stats cannot be used
for some types and in some cases. This is related to the comparison and
sorting method of values.

- If it is double or float, special cases such as NaN, -0, and 0 must be
handled.

- If the string type only has min and max stats, but no min_value or
max_value, use `ParquetPredicate::_try_read_old_utf8_stats()` to expand
the range reading optimization method for optimization.
@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@@ -17,10 +17,12 @@

#pragma once

#include <gen_cpp/parquet_types.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'gen_cpp/parquet_types.h' file not found [clang-diagnostic-error]

#include <gen_cpp/parquet_types.h>
         ^

// specific language governing permissions and limitations
// under the License.

#include <gtest/gtest.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'gtest/gtest.h' file not found [clang-diagnostic-error]

#include <gtest/gtest.h>
         ^

Comment on lines +24 to +25
namespace doris {
namespace vectorized {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: nested namespaces can be concatenated [modernize-concat-nested-namespaces]

Suggested change
namespace doris {
namespace vectorized {
namespace doris::vectorized {

be/test/vec/exec/parquet/parquet_corrupt_statistics_test.cpp:132:

- } // namespace vectorized
- } // namespace doris
+ } // namespace doris

// specific language governing permissions and limitations
// under the License.

#include <gtest/gtest.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'gtest/gtest.h' file not found [clang-diagnostic-error]

#include <gtest/gtest.h>
         ^

Comment on lines +24 to +25
namespace doris {
namespace vectorized {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: nested namespaces can be concatenated [modernize-concat-nested-namespaces]

Suggested change
namespace doris {
namespace vectorized {
namespace doris::vectorized {

be/test/vec/exec/parquet/parquet_statistics_test.cpp:-1:

+ }

ParquetStatisticsTest() = default;
};

TEST_F(ParquetStatisticsTest, test_try_read_old_utf8_stats) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'TEST_F' exceeds recommended size/complexity thresholds [readability-function-size]

TEST_F(ParquetStatisticsTest, test_try_read_old_utf8_stats) {
^
Additional context

be/test/vec/exec/parquet/parquet_statistics_test.cpp:30: 121 lines including whitespace and comments (threshold 80)

TEST_F(ParquetStatisticsTest, test_try_read_old_utf8_stats) {
^

// specific language governing permissions and limitations
// under the License.

#include <gtest/gtest.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'gtest/gtest.h' file not found [clang-diagnostic-error]

#include <gtest/gtest.h>
         ^

Comment on lines +24 to +25
namespace doris {
namespace vectorized {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: nested namespaces can be concatenated [modernize-concat-nested-namespaces]

Suggested change
namespace doris {
namespace vectorized {
namespace doris::vectorized {

be/test/vec/exec/parquet/parquet_version_test.cpp:219:

- } // namespace vectorized
- } // namespace doris
+ } // namespace doris

ParquetVersionTest() = default;
};

TEST_F(ParquetVersionTest, test_version_parser) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'TEST_F' exceeds recommended size/complexity thresholds [readability-function-size]

TEST_F(ParquetVersionTest, test_version_parser) {
^
Additional context

be/test/vec/exec/parquet/parquet_version_test.cpp:30: 91 lines including whitespace and comments (threshold 80)

TEST_F(ParquetVersionTest, test_version_parser) {
^

@kaka11chen
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.54% (9341/25562)
Line Coverage: 28.04% (76537/272968)
Region Coverage: 26.86% (39372/146556)
Branch Coverage: 23.61% (20010/84762)
Coverage Report: http://coverage.selectdb-in.cc/coverage/4f994ec0e9aa1f9c6a4df86054046d4cfcbbb4e4_4f994ec0e9aa1f9c6a4df86054046d4cfcbbb4e4/report/index.html

@yiguolei yiguolei merged commit a44a274 into apache:branch-2.1 Aug 15, 2024
20 of 22 checks passed
@yiguolei yiguolei mentioned this pull request Sep 5, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants