-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix crash problem when stop process. #705
Conversation
…pend io thread pool, but the io thread pool will stopped first by gServer
the reason is that when call gServer->stop(), the gServer will stop all the io thread, but the metaclient and raft part need io thead pool send message out, so the process will crash. |
Jenkins go. |
Unit testing passed. |
Unit testing failed. |
Unit testing passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could please extract these shit logics out for us?
meta client has a background thread to send heart beat, and depend the io thread pool in main, the thrift server depend the io thread pool too, but when we stop the thrift server, the thrift server will stop all threads in the io pool, but the background thread in meta client is not stop, then if meta client send heartbeat, it will call auto* evb = ioThreadPool_->getEventBase(); to get eb, the function like follow: me and ths are nullptr, so the process crash. |
when I fix this problem, I find some complex cyclic dependence like worker thread in thrift and kvstore in main, the accept thread pool and io thread pool. |
Unit testing passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR looks good to me. Thanks for taking care of it.
Do you try it? Is it worked?
auto ioPool = std::make_shared<folly::IOThreadPoolExecutor>(FLAGS_num_io_threads); | ||
auto acceptThreadPool = std::make_shared<folly::IOThreadPoolExecutor>(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make it configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, I think about this problem, but I get the result is there is no need configure, 1 thread is too much, because accept thread pool just accept connection from client which is our process like graphd or meta, and they a long connection, so the performance is not high, what do you think about?
void RaftexService::initThriftServer(std::shared_ptr<folly::IOThreadPoolExecutor> pool, | ||
uint16_t port) { | ||
void RaftexService::initThriftServer(std::shared_ptr<folly::IOThreadPoolExecutor> ioPool, | ||
std::shared_ptr<folly::IOThreadPoolExecutor> acceptPool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alignment
yes, I test the case that let main sleep some second after main exit from gServer->serve(), when meta client send heart beat, process not crash. |
, localHost_(localHost) | ||
, sendHeartBeat_(sendHeartBeat) { | ||
ioThreadPool_ = std::make_unique<folly::IOThreadPoolExecutor>(FLAGS_meta_client_io_thread_num); | ||
CHECK(ioThreadPool_ != nullptr) << "IOThreadPool is required"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seems necessary anymore.
auto ioPool = std::make_shared<folly::IOThreadPoolExecutor>(FLAGS_num_io_threads); | ||
auto acceptThreadPool = std::make_shared<folly::IOThreadPoolExecutor>(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Why MetaDeamon share the acceptors with NebulaStore? Same question with StorageDeamon.
- Why do we create acceptors with 1 worker? The ThriftServer create acceptors with 1 worker by default after all.
- Can acceptors with 1 worker meet our performance requirements?
I knew almost everything on these. But what I meant is that we cannot live with fix and fix and fix forever. Generally, fixing in this way is not clean and not OK to me. |
thanks very much, this is very good view. I think about it too, we can not just fix problem for fix problem, we need find the problem about the design, and avoid the dependence problem from whole architecture. |
like what dutor say, this fix is very ugly, I am very ashamed of myself, and I close this pr and open an issue. this is an open source project, we must take some good ideal to our code. |
* persist peer info when balancing * add part with peers * add cache lib dependency on common test Co-authored-by: pengwei.song <90180021+pengweisong@users.noreply.github.com> Co-authored-by: Sophie <84560950+Sophie-Xie@users.noreply.github.com>
fix crash problem when stop process that meta client and raft part depend io thread pool, but the io thread pool will stopped first by gServer:
1, raft bt.
(gdb) bt
#0 0x000000000208f587 in folly::IOThreadPoolExecutor::getEventBase (this=) at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#1 0x0000000001b48970 in nebula::raftex::RaftPart::appendLogsInternal (this=0x7fbfdc168c10, iter=..., termId=8) at /home/wade.liu/rd/nebula/src/kvstore/raftex/RaftPart.cpp:512
#2 0x0000000001b47dd0 in nebula::raftex::RaftPart::appendLogAsync (this=0x7fbfdc168c10, source=0 '\000', logType=nebula::raftex::LogType::NORMAL, log="")
at /home/wade.liu/rd/nebula/src/kvstore/raftex/RaftPart.cpp:452
#3 0x0000000001b501ff in nebula::raftex::RaftPart::sendHeartbeat (this=0x7fbfdc168c10) at /home/wade.liu/rd/nebula/src/kvstore/raftex/RaftPart.cpp:1270
#4 0x0000000001b4ce4f in nebula::raftex::RaftPart::statusPolling (this=0x7fbfdc168c10) at /home/wade.liu/rd/nebula/src/kvstore/raftex/RaftPart.cpp:940
#5 0x0000000001b4c9d8 in nebula::raftex::RaftPart::<lambda()>::operator()(void) const (__closure=0x7fbfd7ce7400) at /home/wade.liu/rd/nebula/src/kvstore/raftex/RaftPart.cpp:949
#6 0x0000000001b5d36c in std::__invoke_impl<void, nebula::raftex::RaftPart::statusPolling()::<lambda()>&>(std::__invoke_other, nebula::raftex::RaftPart::<lambda()> &) (__f=...)
at /usr/include/c++/8/bits/invoke.h:60
#7 0x0000000001b5d2a1 in std::__invoke<nebula::raftex::RaftPart::statusPolling()::<lambda()>&>(nebula::raftex::RaftPart::<lambda()> &) (__fn=...) at /usr/include/c++/8/bits/invoke.h:95
#8 0x0000000001b5d19a in std::_Bind<nebula::raftex::RaftPart::statusPolling()::<lambda()>()>::__call(std::tuple<> &&, std::_Index_tuple<>) (this=0x7fbfd7ce7400, __args=...)
at /usr/include/c++/8/functional:400
#9 0x0000000001b5ccaa in std::_Bind<nebula::raftex::RaftPart::statusPolling()::<lambda()>()>::operator()<>(void) (this=0x7fbfd7ce7400) at /usr/include/c++/8/functional:484
#10 0x0000000001b5c647 in std::_Function_handler<void(), std::_Bind<nebula::raftex::RaftPart::statusPolling()::<lambda()>()> >::_M_invoke(const std::_Any_data &) (__functor=...)
at /usr/include/c++/8/bits/std_function.h:297
2, meta bt.
#0 0x000000000208dc37 in folly::IOThreadPoolExecutor::getEventBase (this=) at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#1 0x00000000018169eb in nebula::meta::MetaClient::getResponse<nebula::meta::cpp2::HBReq, nebula::meta::MetaClient::heartbeat()::<lambda(auto:110, auto:111)>, nebula::meta::MetaClient::heartbeat()::<lambda(nebula::meta::cpp2::HBResp&&)> >(nebula::meta::cpp2::HBReq, nebula::meta::MetaClient::<lambda(auto:110, auto:111)>, nebula::meta::MetaClient::<lambda(nebula::meta::cpp2::HBResp&&)>, bool) (
this=0x7f7642d60600, req=..., remoteFunc=..., respGen=..., toLeader=true) at /home/wade.liu/rd/nebula/src/meta/client/MetaClient.cpp:254
#2 0x000000000180d6f6 in nebula::meta::MetaClient::heartbeat (this=0x7f7642d60600) at /home/wade.liu/rd/nebula/src/meta/client/MetaClient.cpp:987
#3 0x0000000001806204 in nebula::meta::MetaClient::heartBeatThreadFunc (this=0x7f7642d60600) at /home/wade.liu/rd/nebula/src/meta/client/MetaClient.cpp:85
#4 0x00000000018876da in std::__invoke_impl<void, void (nebula::meta::MetaClient::&)(), nebula::meta::MetaClient&> (
__f=@0x7f7642dc1ea0: (void (nebula::meta::MetaClient::*)(nebula::meta::MetaClient * const)) 0x18061e0 nebula::meta::MetaClient::heartBeatThreadFunc(), __t=@0x7f7642dc1eb0: 0x7f7642d60600)
at /usr/include/c++/8/bits/invoke.h:73