-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix](parquet-reader) Fix and optimize parquet min-max filtering. #39375
[Fix](parquet-reader) Fix and optimize parquet min-max filtering. #39375
Conversation
…ache#38277) ## Proposed changes Refer to trino's implementation - Some bugs in the historical version paquet-mr. Use `CorruptStatistics::should_ignore_statistics()` to handle. - The old version of parquet uses `min` and `max` stats, and later implements `min_value` and `max_value`. `Min`/`max` stats cannot be used for some types and in some cases. This is related to the comparison and sorting method of values. - If it is double or float, special cases such as NaN, -0, and 0 must be handled. - If the string type only has min and max stats, but no min_value or max_value, use `ParquetPredicate::_try_read_old_utf8_stats()` to expand the range reading optimization method for optimization.
Thank you for your contribution to Apache Doris. Since 2024-03-18, the Document has been moved to doris-website. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
@@ -17,10 +17,12 @@ | |||
|
|||
#pragma once | |||
|
|||
#include <gen_cpp/parquet_types.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'gen_cpp/parquet_types.h' file not found [clang-diagnostic-error]
#include <gen_cpp/parquet_types.h>
^
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
#include <gtest/gtest.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'gtest/gtest.h' file not found [clang-diagnostic-error]
#include <gtest/gtest.h>
^
namespace doris { | ||
namespace vectorized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: nested namespaces can be concatenated [modernize-concat-nested-namespaces]
namespace doris { | |
namespace vectorized { | |
namespace doris::vectorized { |
be/test/vec/exec/parquet/parquet_corrupt_statistics_test.cpp:132:
- } // namespace vectorized
- } // namespace doris
+ } // namespace doris
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
#include <gtest/gtest.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'gtest/gtest.h' file not found [clang-diagnostic-error]
#include <gtest/gtest.h>
^
namespace doris { | ||
namespace vectorized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: nested namespaces can be concatenated [modernize-concat-nested-namespaces]
namespace doris { | |
namespace vectorized { | |
namespace doris::vectorized { |
be/test/vec/exec/parquet/parquet_statistics_test.cpp:-1:
+ }
ParquetStatisticsTest() = default; | ||
}; | ||
|
||
TEST_F(ParquetStatisticsTest, test_try_read_old_utf8_stats) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: function 'TEST_F' exceeds recommended size/complexity thresholds [readability-function-size]
TEST_F(ParquetStatisticsTest, test_try_read_old_utf8_stats) {
^
Additional context
be/test/vec/exec/parquet/parquet_statistics_test.cpp:30: 121 lines including whitespace and comments (threshold 80)
TEST_F(ParquetStatisticsTest, test_try_read_old_utf8_stats) {
^
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
#include <gtest/gtest.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'gtest/gtest.h' file not found [clang-diagnostic-error]
#include <gtest/gtest.h>
^
namespace doris { | ||
namespace vectorized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: nested namespaces can be concatenated [modernize-concat-nested-namespaces]
namespace doris { | |
namespace vectorized { | |
namespace doris::vectorized { |
be/test/vec/exec/parquet/parquet_version_test.cpp:219:
- } // namespace vectorized
- } // namespace doris
+ } // namespace doris
ParquetVersionTest() = default; | ||
}; | ||
|
||
TEST_F(ParquetVersionTest, test_version_parser) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: function 'TEST_F' exceeds recommended size/complexity thresholds [readability-function-size]
TEST_F(ParquetVersionTest, test_version_parser) {
^
Additional context
be/test/vec/exec/parquet/parquet_version_test.cpp:30: 91 lines including whitespace and comments (threshold 80)
TEST_F(ParquetVersionTest, test_version_parser) {
^
run buildall |
TeamCity be ut coverage result: |
Backport #38277.