-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add rate limiting to the lotus gateway #8517
Conversation
Ran a couple of quick tests and dirty tests Without the rate limiter Queries to get the genesis tipset respond quickly at about 6.7k qps on my test setup. With the rate limiter, but with Actually, the performance improved a little bit... but near enough consider the performance identical |
Codecov Report
@@ Coverage Diff @@
## master #8517 +/- ##
==========================================
- Coverage 40.95% 40.77% -0.18%
==========================================
Files 687 686 -1
Lines 75809 75817 +8
==========================================
- Hits 31049 30918 -131
- Misses 39418 39538 +120
- Partials 5342 5361 +19
|
gateway/node.go
Outdated
"github.com/filecoin-project/lotus/node/impl/full" | ||
) | ||
|
||
const ( | ||
DefaultLookbackCap = time.Hour * 24 | ||
DefaultStateWaitLookbackLimit = abi.ChainEpoch(20) | ||
DefaultRateLimitTimeout = time.Minute * 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean a hanging request for 10 minutes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would override this.
There is a leaky-bucket rate limiter so when there is a burst of traffic, requests are queued and handled at a specific rate to avoid overloading the backend.
Without the timeout, requests would be allowd to queue indefinitely. This timeout sets a maximum amount of time we would permit a request to wait, and the gateway responds with a server busy.
I think it's worthwhile to note that requests do not actually wait for this entire time. Because the are handled at a specific rate during bursty periods, the system can predict whether the context timeout will expire before it is handled, and the error is returned immediately.
-
If a request comes in and there are enough tokens available in the bucket, the request is handled right away.
-
if a request comes in during high load and there are not enough tokens, the request will queue until enough tokens are available, then process the request.
-
if there are not enough tokens available in the bucket, and there are so many requests enqueued that the context timeout will expire before the request is handled, respond with an error right away. Dont wait for the timeout.
With that explanation...10 minutes is unreasonable for a real timeout. A more reasonable timeout is probably 1-5 seconds. By default, I didn't want the rate limiter to be enabled, but I need the context to be set to something. I can change this to 1 Second if this makes more sense, but it's definitely something you'd have to tune to your situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd set this to some reasonably high value; 5s is probably good enough
Just a note: this appears to be a global rate limit. So while it will protect our backend, it does mean that the system can still be DOS'd very easily just simply making a request to |
This is true. This will protect the backend from being overworked so it can continue processing messages under heavy load, but it does not prevent DOS attacks. There is still some benefit to load shedding at the API layer even if only to allow the backend to not to become overwhelmed. we can add some additional limits for each connnection and this would partially mitigate the problem. At least then a single individual cannot DOS the service by issuing Filecoin.Version as you suggest. Is that what you are thinking @travisperson ? DDoS mitigation would likely have to be handled with scalability. As you suggest, this rate limiter would not help in that case except to keep the backend working. |
I think it's just something we need to understand and determine if it's acceptable. If one user exhausting all tokens is unacceptable then we need a better solution to limit per connection / ip. |
I'm moving this back to draft due to concerns about using only global rate limits. I'll implement a per-connection rate limiting and will produce some performance benchmarks before I put this back out for review. |
@travisperson can you give this a try? There's two new rate limiters in addition to the global one. One limits the number of HTTP connections you're allowed to open in a single minute. The other adds limits for RPC call over an established websocket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a high-level review pass, generally looks good.
Conflicts need resolving.
gateway/node.go
Outdated
"github.com/filecoin-project/lotus/node/impl/full" | ||
) | ||
|
||
const ( | ||
DefaultLookbackCap = time.Hour * 24 | ||
DefaultStateWaitLookbackLimit = abi.ChainEpoch(20) | ||
DefaultRateLimitTimeout = time.Minute * 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd set this to some reasonably high value; 5s is probably good enough
} | ||
|
||
h.mu.Lock() | ||
seen, ok := h.ipmap[host] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make this not abuseable with ipv6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does work the same way on ipv4 or ipv6, this is keyed by just simply a string, which will be something like "1.2.3.4" when ipv4 is used or "1111:2222:3333:4444:5555:6666:7777:8888" when using ipv6.
This is the third kind of rate limiting on this PR, and I could be convinced that this should be done differently.
The global rate limiter and per-connection limiter protect the backend. The former is persistent for the lifetime of the process and the latter persists for the lifetime of a single connection.
The connection rate limiter, however, is intended to prevent abuse from scripts opening several connections in quick succession. the ipmap grows when there new connections, and shrinks again some time later.
Alternatively, we could do this as a max simultanious connections rather than "connections per minute"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main worry is that with ipv6 you get a /64 subnet, which is a lot of IPs which are tracked separately; but yes, the other limits should help here
Add rate limiting to the lotus gateway.
This allows us to limit the number of requests the lotus gateway will allow.
Each API call can have its own cost depending on how much load is likely to be induced. API calls are permitted to go through, so long as there are enough tokens available in the bucket before the timeout expires.
If there are insufficient tokens, an error is returned and the API call is not performed.