Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Add metrics to track how the rate limiter is affecting requests (sleep/reject) #13534

Merged
merged 6 commits into from
Aug 17, 2022
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.d/13534.misc
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Add metrics to track how the rate limiter is affecting requests (sleep/reject).
15 changes: 13 additions & 2 deletions synapse/util/ratelimitutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@
import typing
from typing import Any, DefaultDict, Iterator, List, Set

from prometheus_client.core import Counter

from twisted.internet import defer

from synapse.api.errors import LimitExceededError
Expand All @@ -35,6 +37,11 @@
logger = logging.getLogger(__name__)


rate_limit_sleep_counter = Counter("synapse_rate_limit_sleep", "", ["host"])

rate_limit_reject_counter = Counter("synapse_rate_limit_reject", "", ["host"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote for not adding "host" here, given that is an unbounded variable. If we add some logging for when we're sleeping hosts then the metric would allow us to see that we've triggered it and the logging which hosts are affected?

Copy link
Contributor Author

@MadLittleMods MadLittleMods Aug 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense 👍. Was focused about what was in the doc, metrics per-host.

Currently, the logs are mostly debug here so we'll have to turn them on when we want to see.


Affected host count per time-period might be interesting to see so we can differentiate one really noisy homeserver from a general ratelimit tuning problem across the federation. I guess we would add a gauge for this. I can follow-up in another PR for this -> #13541



class FederationRateLimiter:
def __init__(self, clock: Clock, config: FederationRatelimitSettings):
def new_limiter() -> "_PerHostRatelimiter":
Expand All @@ -59,7 +66,7 @@ def ratelimit(self, host: str) -> "_GeneratorContextManager[defer.Deferred[None]
Returns:
context manager which returns a deferred.
"""
return self.ratelimiters[host].ratelimit()
return self.ratelimiters[host].ratelimit(host)
Copy link
Contributor Author

@MadLittleMods MadLittleMods Aug 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better way to pass the host string to the _PerHostRateLimiter?

There doesn't seem to be a good way to make DefaultDict pass the string key host on, https://stackoverflow.com/questions/2912231/is-there-a-clever-way-to-pass-the-key-to-defaultdicts-default-factory



class _PerHostRatelimiter:
Expand Down Expand Up @@ -94,12 +101,14 @@ def __init__(self, clock: Clock, config: FederationRatelimitSettings):
self.request_times: List[int] = []

@contextlib.contextmanager
def ratelimit(self) -> "Iterator[defer.Deferred[None]]":
def ratelimit(self, host: str) -> "Iterator[defer.Deferred[None]]":
# `contextlib.contextmanager` takes a generator and turns it into a
# context manager. The generator should only yield once with a value
# to be returned by manager.
# Exceptions will be reraised at the yield.

self.host = host

request_id = object()
ret = self._on_enter(request_id)
try:
Expand All @@ -119,6 +128,7 @@ def _on_enter(self, request_id: object) -> "defer.Deferred[None]":
# sleeping or in the ready queue).
queue_size = len(self.ready_request_queue) + len(self.sleeping_requests)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be good to track queue_size?

if queue_size > self.reject_limit:
rate_limit_reject_counter.labels(self.host).inc()
raise LimitExceededError(
retry_after_ms=int(self.window_size / self.sleep_limit)
)
Expand Down Expand Up @@ -146,6 +156,7 @@ def queue_request() -> "defer.Deferred[None]":

if len(self.request_times) > self.sleep_limit:
logger.debug("Ratelimiter: sleeping request for %f sec", self.sleep_sec)
rate_limit_sleep_counter.labels(self.host).inc()
ret_defer = run_in_background(self.clock.sleep, self.sleep_sec)

self.sleeping_requests.add(request_id)
Expand Down