Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Do not publish to connections without context #3873

Merged
merged 2 commits into from
Oct 8, 2024

Conversation

chakaz
Copy link
Collaborator

@chakaz chakaz commented Oct 6, 2024

This is a rare case where a closed connection is kept alive while the handling fiber yields, therefore leaving cc_ (the connection context) pointing to null for other fibers to see.

As far as I can see, this can only happen during server shutdown, but there could be other cases that I have missed.

The test on its own does not reproduce the crash, however with added ThisFiber::SleepFor()s I could reproduce the crash:

  • Right before DispatchBrief() here
  • Right after connection context reset() here

In any case, calling SendPubMessageAsync() to a connection where cc_ is null is a bug, and we fix that here.

This is a rare case where a closed connection is kept alive while the
handling fiber yields, therefore leaving `cc_` (the connection context)
pointing to null for other fibers to see.

As far as I can see, this can only happen during server shutdown, but
there could be other cases that I have missed.

The test on its own does _not_ reproduce the crash, however with added
`ThisFiber::SleepFor()`s I could reproduce the crash:

* Right before `DispatchBrief()`
  [here](https://github.com/dragonflydb/dragonfly/blob/e3214cb603fce42c8bb52a2c220c3115b7f8cecf/src/server/channel_store.cc#L154)
* Right after connection context `reset()`
  [here](https://github.com/dragonflydb/dragonfly/blob/2ab480e160dece516b908584f1933b25b760abfb/src/facade/dragonfly_connection.cc#L750)

In any case, calling `SendPubMessageAsync()` to a connection where `cc_`
is null is a bug, and we fix that here.
romange
romange previously approved these changes Oct 6, 2024
@@ -146,7 +146,8 @@ unsigned ChannelStore::SendMessages(std::string_view channel, facade::ArgRange m
auto it = lower_bound(subscribers_ptr->begin(), subscribers_ptr->end(), idx,
ChannelStore::Subscriber::ByThreadId);
while (it != subscribers_ptr->end() && it->Thread() == idx) {
if (auto* ptr = it->Get(); ptr)
// A connection might have closed or be in the process of closing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// A connection might have closed or be in the process of closing
// if ptr->cntx() is null, a connection might have closed or be in the process of closing

@@ -146,7 +146,8 @@ unsigned ChannelStore::SendMessages(std::string_view channel, facade::ArgRange m
auto it = lower_bound(subscribers_ptr->begin(), subscribers_ptr->end(), idx,
ChannelStore::Subscriber::ByThreadId);
while (it != subscribers_ptr->end() && it->Thread() == idx) {
if (auto* ptr = it->Get(); ptr)
// if ptr->cntx() is null, a connection might have closed or be in the process of closing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be not null yet the state is_closing ? If thast's the case shouldn't we also cover this here ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure which is_closing state you are referring to, there doesn't seem to be such a state in facade::Connection nor in util::Connection?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we close a connection we set the member field of conn_context called conn_closing. I am asking if we also need to poll that here?

Copy link
Contributor

@kostasrim kostasrim Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ptr->cntx()->conn_closing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, conn_closing case is already supported and handled inside SendAsync, so we're good with that regards

@chakaz chakaz requested a review from adiholden October 7, 2024 19:16
@chakaz chakaz merged commit b2ebfd0 into main Oct 8, 2024
12 checks passed
@chakaz chakaz deleted the chakaz/pubsub-null-conn branch October 8, 2024 11:45
chakaz added a commit that referenced this pull request Nov 14, 2024
This is a rare condition, which we know can happen during shutdown (see
[here](#3873 (comment)))
chakaz added a commit that referenced this pull request Nov 14, 2024
* fix: Do not use `cc_` in connection if it's null

This is a rare condition, which we know can happen during shutdown (see
[here](#3873 (comment)))

* add comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants