-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make graph-node more robust in the face of shards being down or restarting #2815
Conversation
5e4d5f9
to
c073e3a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, great PR 👏 It took me some time to review haha.
Well, I've posted multiple questions and suggestions, some are more style related though.
Some questions I have about the PR description are:
Even so, it's probably a good idea to restart graph-node after a shard restarted.
I'm curious, why is that?
queries should only fail if they involve a failed shard
What part of this scenario is not covered today by #2727? Also, after reading through your code, it seems that the behavior is to try to get a connection from a shard and if that fails, we try on the next one. How would a failed shard get a query invoked? 🤔 (maybe I just misunderstood your statement above)
indexing operations in a failed shard retry and block indefinitely
What happens today on the indexing part? I think I'm more aware on the query side.
Awesome test cases by the way, that must've been a lot of work 😅
Thanks!
There's at least two things that I am worried about: (1) I am not 100% confident that this catches all ways in which indexing could fail when a shard goes down (especially when the shard with the block store goes down at the wrong moment) and (2) because listeners reconnect after some delay, it's possible that they miss assignment events, and, e.g., a subgraph deployed during that time doesn't start syncing.
#2727 does not cover the scenario where the primary fails - without this PR, when the primary fails, most queries will fail, since the primary is needed to determine where a subgraph is stored. There's a bit of caching in memory of this data, but it's with a 120s TTL and doesn't cover the resolution of subgraph names to hashes. The only data that we try getting from different shards is whatever is mentioned in
The subgraph will just fail and needs to be restarted. |
Addressed all review comments (I think) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving because it seems correct and I think I understood what most of your code does. However I'm not sure if you want anyone else to take a look just to to be more sure that none of the new code can have unexpected behavior on the database connection/query handling/execution.
08e5bf8
to
78cb1a8
Compare
Rebased to latest master |
@@ -81,6 +82,11 @@ impl NotificationListener { | |||
/// channel. | |||
/// | |||
/// Must call `.start()` to begin receiving notifications on the returned receiver. | |||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small thing, but I think there should be another /
here.
Also, one thing I forgot to ask on the first review is about the |
Instead of crashing the process, try to reconnect indefinitely if we lose the connection to the database.
Manually insert/delete rows from the primary's deployment_schemas and chains table. We can't use logical replication because we need to support Postgres 9.6. Add a job that refreshes the mirror every 5 minutes
Use that facility for the places where the BlockStore reads from the primary
We read the assignments for a node early during startup; with this change, a node will start up even when the primary is down
This is in preparation of retrying operations: if we pass owned data, we need to clone before every attempt. Instead of forcing a clone in the common case of the first try succeeding, work with references. This also requires that we use Cow<Entity> because in rare cases we have to modify the entity to add fulltext search fields; in the common case where there is no fulltext search, we do not want to clone.
78cb1a8
to
18cb404
Compare
When we transact block operations, or revert a block, ignore errors when sending store events. Since store events sent for these operations are only used to update subscriptions, it is better to carry on than to fail the operation and with that the subgraph.
This can block indefinitely and therefore lead to a lot of work on query nodes queueing up; rather than that, operations should fail when we can not get a connection.
18cb404
to
393ce06
Compare
What changed with retrying is that we need two ways to access the entities we are about to write to the store: one for forming the actual |
I made a mistake when merging this (not pushing the updated |
This PR is a continuation of PR #2727 and rounds out the story of making
graph-node
handle shards being down during startup and during normal operation more gracefully, including coming back to a sane state when a failed shard comes back up. Even so, it's probably a good idea to restartgraph-node
after a shard restarted.The basic ideas behind these changes are:
Testing these things is unfortunately a very manual process. Testing was done under very little load, and it's possible that this did not reveal issues that only appear under high loads. Here are my notes on what I tested:
Setup
primary
,sharda
, andshardb
primary
dice2win
) indexing in each shardNotes
Scenarios
sharda
is down on startupprimary
andshardb
worksharda
fail fastprimary
andshardb
continuessharda
work after it comes back upsharda
resumes whensharda
comes back upsharda
goes down during operationprimary
andshardb
worksharda
fail fastprimary
andshardb
continuessharda
work after it comes back upsharda
resumes whensharda
comes back upprimary
is down on startupsharda
andshardb
workprimary
fail fastprimary
work after it comes back upprimary
comes back upprimary
goes down during operationsharda
andshardb
workprimary
fail fastprimary
work after it comes back upprimary
comes back up