-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
blacklist servers which are down for several days #5113
Comments
See https://gist.github.com/Sharparam/b144e294189d78ee6c73df0e109ee2af for one week of data showing the number of times per day that @Sharparam's server contacted a bunch of dead servers. See convo starting from https://matrix.to/#/!HsxjoYRFsDtWBgDQPh:matrix.org/$155655145216EFGaa:matrix.sharparam.com?via=matrix.org&via=chat.weho.st&via=hackerspaces.be |
Note that we already have an exponential-backoff algorithm, but it tops out at a 24hr retry period. After the first failed request, we back off for 10 minutes. We then increase the backoff by a factor of between 4 and 7 after each failed request, until we get to 24 hours. The retry intervals are therefore:
An easy way to implement this would be to go to 999 years after 24 hours. |
I should also mention that each request is retried several times (normally 10), with its own exponential-backoff loop. The per-host exponential backoff is only increased after a request fails completely. |
The worst backoff time of a server should be calculated to know how long Also, if a server is down for one week or so this shouldn't result in a blacklist if it is no easily reversible. An observation: To do that, I created a |
@mguentner What is your server hostname so I can check the logs more closely for errors related to it? |
@Sharparam: My server (roeckx.be) seems to have high numbers in that file, but the server has always been up. |
@kroeckx Your server is responding with Edit: The latest entry from the dumped logs:
|
Looking in my logs, I see lots of:
2019-05-01 21:48:21,978 - synapse.access.https.8448 - 233 - INFO - PUT-4120380- 80.245.199.234 - 8448 - Received request: PUT /_matrix/federation/v1/send/1554208235995
2019-05-01 21:48:21,979 - synapse.http.server - 85 - INFO - PUT-4120380- <SynapseRequest at 0x7f4ad26864a8 method='PUT' uri='/_matrix/federation/v1/send/1554208235995' clientproto='HTTP/1.1' site=8448> SynapseError: 400 - Unrecognized request
2019-05-01 21:48:21,980 - synapse.access.https.8448 - 302 - INFO - PUT-4120380- 80.245.199.234 - 8448 - {None} Processed request: 0.002sec/0.001sec (0.004sec, 0.000sec) (0.000sec/0.000sec/0) 59B 400 "PUT /_matrix/federation/v1/send/1554208235995 HTTP/1.1" "Synapse/0.99.3" [0 dbevts]
Why is generating that error, and why doesn't the federation tester give any error?
|
For the record, some discussion related to this in #synapse:matrix.org starting from this message. Edit: Looks like you might need to update your Synapse instance @kroeckx. You are on 0.99.2 and the latest Synapse is 0.99.3, apparently something regarding federation was changed. |
@Sharparam Aha! My instance was still on 0.99.2 as well but now runs on 0.99.3 |
@mguentner Yeah, my analysis only looks at the errors logged by Synapse and that doesn't take into account that special case where there is a failed call immediately followed by a successful one. The blacklisting code would have to take this into account somehow. (This actually might resolve itself if the servers are removed from blacklist as soon as a successful call is made or received.) |
the current backoff code ignores 400s, ftr. |
I'm not sure if that's a solution though, there could be 400 errors that are not resolved by sending again with a slash. |
Seems like the things that cause a backoff have not been updated in several years https://github.com/matrix-org/synapse/blob/develop/synapse/util/retryutils.py#L177 I just pulled my last 4 hours of logs from 1.0.0rc1: 33% of the WARNING lines are and a few It doesn't seem like these backoff even though I believe most of these should. |
@aaronraimist: what makes you think that those do not cause backoff? As far as I know they all should (most of them do not derive from |
Essentially the intention here is to end up blacklisting servers which never respond to federation requests. Fixes #5113.
Essentially the intention here is to end up blacklisting servers which never respond to federation requests. Fixes #5113.
Did this happen? |
yes, valid requests received will reset the backoff. |
Why don't we add servers which fail to respond to federation requests for several days to a blacklist, and stop trying?
This could significantly reduce network traffic, CPU usage, and the amount of cruft that gets logged.
We would need to unblock such servers when we receive a valid request from them.
The text was updated successfully, but these errors were encountered: