Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dead lock when adding child memory pool with already existing name #10342

Closed
zhztheplayer opened this issue Jun 28, 2024 · 4 comments
Closed
Labels
bug Something isn't working triage Newly created issue that needs attention.

Comments

@zhztheplayer
Copy link
Contributor

zhztheplayer commented Jun 28, 2024

This causes the intended error Child memory pool {} already exists in {} never get raised.

Code that acquired lock:

std::unique_lock guard{poolMutex_};
VELOX_CHECK_EQ(
children_.count(name),
0,
"Child memory pool {} already exists in {}",
name,
toString());

Code that awaits for lock:

void MemoryPool::visitChildren(
const std::function<bool(MemoryPool*)>& visitor) const {
std::vector<std::shared_ptr<MemoryPool>> children;
{
std::shared_lock guard{poolMutex_};
children.reserve(children_.size());
for (auto& entry : children_) {
auto child = entry.second.lock();
if (child != nullptr) {
children.push_back(std::move(child));
}
}
}

Call stack:

syscall 0x00007f09e70f073d
folly::detail::nativeFutexWaitImpl Futex.cpp:126
folly::detail::futexWaitImpl Futex.cpp:254
folly::detail::futexWait<…> Futex-inl.h:94
folly::SharedMutexImpl::WaitForever::doWait SharedMutex.h:738
folly::SharedMutexImpl::futexWaitForZeroBits<…> SharedMutex.h:1191
folly::SharedMutexImpl::yieldWaitForZeroBits<…> SharedMutex.h:1158
folly::SharedMutexImpl::waitForZeroBits<…> SharedMutex.h:1122
folly::SharedMutexImpl::waitForZeroBits<…> SharedMutex.h:1111
folly::SharedMutexImpl::lockSharedImpl<…> SharedMutex.h:1486
folly::SharedMutexImpl::lockSharedImpl<…> SharedMutex.h:1359
folly::SharedMutexImpl::lock_shared SharedMutex.h:491
std::shared_lock::lock SharedMutex.h:1684
std::shared_lock::shared_lock SharedMutex.h:1634
facebook::velox::memory::MemoryPool::visitChildren(const std::function<…> &) const MemoryPool.cpp:280
facebook::velox::memory::MemoryPoolImpl::usedBytes MemoryPool.cpp:699
facebook::velox::memory::MemoryPoolImpl::toStringLocked[abi:cxx11]() const MemoryPool.h:944
facebook::velox::memory::MemoryPoolImpl::toString[abi:cxx11]() const MemoryPool.h:670
facebook::velox::memory::MemoryPool::addAggregateChild MemoryPool.cpp:350
@zhztheplayer zhztheplayer added bug Something isn't working triage Newly created issue that needs attention. labels Jun 28, 2024
@zhztheplayer
Copy link
Contributor Author

cc @xiaoxmeng

@zhztheplayer zhztheplayer changed the title Dead lock when adding aggregate child memory pool with already existing name Dead lock when adding child memory pool with already existing name Jun 28, 2024
@lingbin
Copy link
Contributor

lingbin commented Jul 4, 2024

@zhztheplayer The call stack you gave is very helpful, and I feel like I already know the cause of the problem, I can write a cause analysis of this problem soon.

I can give a quick fix if you are not working on this bug.

In addition, I am curious: in what scenario would you try to add a child node with a duplicate name? Can you share a little bit?

@lingbin
Copy link
Contributor

lingbin commented Jul 4, 2024

The root cause of this problem comes from MemoryPool::poolMutex_ (not MemoryPoolImpl::mutex_).

MemoryPool::poolMutex_ is used to protect MemoryPool::children_ in the following 4 member methods: addLeafChild()/addAggregateChild()/getChildCount()/visitChildren.

In other words, to avoid deadlock:

  1. First of all, we can't call the above 4 methods while holding poolMutex_.
  2. Furthermore, we cannot call methods that use these 4 methods while holding poolMutex_;
  3. Finally, there cannot be any direct or indirect calls between these 4 methods. (This is the root cause of this problem).
addAggregateChild() ------ will acquire 'poolMutex_'
--> toString() [When duplicate names appear, used to print error messages]
    --> usedBytes()
        --> visitChildren() ------ [When reservedBytes() != 0] will acquire 'poolMutex_' again, **DEADLOCK!!**

So, addAggregateChild() indirectly calls visitChildren() [When reservedBytes() != 0], causing a deadlock.

To solve this problem, the easiest way is to simplify the error message printed when duplicate names appear.
After all, we only need to know the name of the parent pool, just like many other places in this class do.

@lingbin
Copy link
Contributor

lingbin commented Jul 4, 2024

@zhztheplayer I created #10400 to fix this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Newly created issue that needs attention.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants