Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the SortBuffer's noMoreInput called twice when enable smj #10614

Closed
wants to merge 2 commits into from

Conversation

JkSelf
Copy link
Collaborator

@JkSelf JkSelf commented Jul 31, 2024

We encountered an exception while executing Q22 on a 1TB TPC-H dataset using sort merge join. The issue arises because the SortBuffer#noMoreInput() method is invoked multiple times. By eliminating this redundant check, Q22 executes successfully.

Caused by: org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Retriable: False
Expression: !noMoreInput_
Context: Operator: OrderBy[1] 1
Function: noMoreInput
File: /mnt/DP_disk3/jk/projects/gluten/ep/build-velox/build/velox_ep/velox/exec/SortBuffer.cpp
Line: 101
Stack trace:
# 0  facebook::velox::VeloxException::VeloxException(char const*, unsigned long, char const*, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, bool, facebook::velox::VeloxException::Type, std::basic_string_view<char, std::char_traits<char> >)
# 1  void facebook::velox::detail::veloxCheckFail<facebook::velox::VeloxRuntimeError, facebook::velox::detail::CompileTimeEmptyString>(facebook::velox::detail::VeloxCheckFailArgs const&, facebook::velox::detail::CompileTimeEmptyString)
# 2  facebook::velox::exec::SortBuffer::noMoreInput()
# 3  facebook::velox::exec::OrderBy::noMoreInput()
# 4  facebook::velox::exec::Driver::runInternal(std::shared_ptr<facebook::velox::exec::Driver>&, std::shared_ptr<facebook::velox::exec::BlockingState>&, std::shared_ptr<facebook::velox::RowVector>&)
# 5  facebook::velox::exec::Driver::next(std::shared_ptr<facebook::velox::exec::BlockingState>&)
# 6  facebook::velox::exec::Task::next(folly::SemiFuture<folly::Unit>*)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 31, 2024
Copy link

netlify bot commented Jul 31, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 70d65f6
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/66bb141cbbcbe700088cc5bc

@JkSelf
Copy link
Collaborator Author

JkSelf commented Jul 31, 2024

@pedroerp Can you help to review? Thanks for your help.

@pedroerp
Copy link
Contributor

pedroerp commented Aug 1, 2024

@JkSelf thanks for looking into this. While this seems like a simple fix, I wonder in this case if the caller is doing something wrong? Did you check which code is actually calling this function twice (which it shouldn't)?

@JkSelf
Copy link
Collaborator Author

JkSelf commented Aug 2, 2024

@pedroerp The orderBy#noMoreInput() method appears to be invoked twice at here, with the second invocation resulting in this failure.

@jinchengchenghh
Copy link
Contributor

I think the check is right, we should not sort the data twice and release the pool twice https://github.com/facebookincubator/velox/pull/10614/files#diff-f80347d211a617e028feceac2e8d028932b9232c73fe841b8c2f4eb3cc61a870L130.

Why is it invoked twice, can we avoid invoking twice?

@JkSelf JkSelf force-pushed the noMoreInput branch 2 times, most recently from d8e1295 to 50c8cc7 Compare August 6, 2024 07:19
@jinchengchenghh
Copy link
Contributor

Thanks for your investigate. Looks good. Can you add a unit test?

@JkSelf JkSelf changed the title Remove the noMoreInput = false check in SortBuffer#noMoreInput() method Fix the SortBuffer's noMoreInput called twice when enable smj Aug 6, 2024
@JkSelf
Copy link
Collaborator Author

JkSelf commented Aug 6, 2024

@pedroerp @jinchengchenghh
This is a bug with the left anti join. When we call getOutput(), we need to follow the same logic as a non-anti join that includes a filter. We have to check if the result after applying the filter is null, and if it is, we need to continue processing the subsequent data instead of returning nullptr.

@JkSelf
Copy link
Collaborator Author

JkSelf commented Aug 8, 2024

@pedroerp Can you help to look this PR? Passed in my local test.

@pedroerp
Copy link
Contributor

Thanks @JkSelf for digging into this! Could you add a unit test that reproduces this error? I'm concerned that if we need to refactor this code in the future we may break this again unless we have a unit test to protect the behavior.

@JkSelf
Copy link
Collaborator Author

JkSelf commented Aug 13, 2024

Thanks @JkSelf for digging into this! Could you add a unit test that reproduces this error? I'm concerned that if we need to refactor this code in the future we may break this again unless we have a unit test to protect the behavior.

@pedroerp Yes. Added the failed unit test. Can you help to review again? Thanks.

Copy link
Contributor

@pedroerp pedroerp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now, thanks @JkSelf

@pedroerp pedroerp added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Aug 13, 2024
@facebook-github-bot
Copy link
Contributor

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@xiaoxmeng merged this pull request in 37aac8a.

Copy link

Conbench analyzed the 1 benchmark run on commit 37aac8a9.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants