Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add rate limiting to the lotus gateway #8517

Merged
merged 17 commits into from
Jun 10, 2022
Merged

Conversation

coryschwartz
Copy link

Add rate limiting to the lotus gateway.

This allows us to limit the number of requests the lotus gateway will allow.

Each API call can have its own cost depending on how much load is likely to be induced. API calls are permitted to go through, so long as there are enough tokens available in the bucket before the timeout expires.

If there are insufficient tokens, an error is returned and the API call is not performed.

@coryschwartz
Copy link
Author

Ran a couple of quick tests and dirty tests

Without the rate limiter Queries to get the genesis tipset respond quickly at about 6.7k qps on my test setup.

With the rate limiter, but with --rate-limit 0, and there was no impact to performance.

Actually, the performance improved a little bit... but near enough consider the performance identical

@codecov
Copy link

codecov bot commented Apr 21, 2022

Codecov Report

Merging #8517 (e679083) into master (effee8c) will decrease coverage by 0.17%.
The diff coverage is 14.03%.

❗ Current head e679083 differs from pull request most recent head b305483. Consider uploading reports for the commit b305483 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8517      +/-   ##
==========================================
- Coverage   40.95%   40.77%   -0.18%     
==========================================
  Files         687      686       -1     
  Lines       75809    75817       +8     
==========================================
- Hits        31049    30918     -131     
- Misses      39418    39538     +120     
- Partials     5342     5361      +19     
Impacted Files Coverage Δ
cmd/lotus-gateway/main.go 0.00% <0.00%> (ø)
metrics/metrics.go 100.00% <ø> (ø)
gateway/node.go 34.72% <14.95%> (-10.42%) ⬇️
node/modules/lp2p/rcmgr.go 4.51% <0.00%> (-25.57%) ⬇️
itests/kit/blockminer.go 70.65% <0.00%> (-19.03%) ⬇️
extern/sector-storage/worker_tracked.go 80.50% <0.00%> (-6.78%) ⬇️
miner/miner.go 54.75% <0.00%> (-4.27%) ⬇️
chain/stmgr/call.go 69.18% <0.00%> (-3.25%) ⬇️
markets/loggers/loggers.go 89.28% <0.00%> (-1.90%) ⬇️
node/impl/full/mpool.go 46.95% <0.00%> (-1.74%) ⬇️
... and 34 more

@coryschwartz coryschwartz marked this pull request as ready for review April 21, 2022 04:10
@coryschwartz coryschwartz requested a review from a team as a code owner April 21, 2022 04:10
gateway/node.go Outdated
"github.com/filecoin-project/lotus/node/impl/full"
)

const (
DefaultLookbackCap = time.Hour * 24
DefaultStateWaitLookbackLimit = abi.ChainEpoch(20)
DefaultRateLimitTimeout = time.Minute * 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean a hanging request for 10 minutes?

Copy link
Author

@coryschwartz coryschwartz Apr 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would override this.

There is a leaky-bucket rate limiter so when there is a burst of traffic, requests are queued and handled at a specific rate to avoid overloading the backend.

Without the timeout, requests would be allowd to queue indefinitely. This timeout sets a maximum amount of time we would permit a request to wait, and the gateway responds with a server busy.

I think it's worthwhile to note that requests do not actually wait for this entire time. Because the are handled at a specific rate during bursty periods, the system can predict whether the context timeout will expire before it is handled, and the error is returned immediately.

  • If a request comes in and there are enough tokens available in the bucket, the request is handled right away.

  • if a request comes in during high load and there are not enough tokens, the request will queue until enough tokens are available, then process the request.

  • if there are not enough tokens available in the bucket, and there are so many requests enqueued that the context timeout will expire before the request is handled, respond with an error right away. Dont wait for the timeout.

With that explanation...10 minutes is unreasonable for a real timeout. A more reasonable timeout is probably 1-5 seconds. By default, I didn't want the rate limiter to be enabled, but I need the context to be set to something. I can change this to 1 Second if this makes more sense, but it's definitely something you'd have to tune to your situation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd set this to some reasonably high value; 5s is probably good enough

@travisperson
Copy link
Contributor

Just a note: this appears to be a global rate limit. So while it will protect our backend, it does mean that the system can still be DOS'd very easily just simply making a request to Filecoin.Version which will eat 1 token. This can easily exhaust the system of tokens for all other users.

@coryschwartz
Copy link
Author

Just a note: this appears to be a global rate limit. So while it will protect our backend, it does mean that the system can still be DOS'd very easily just simply making a request to Filecoin.Version which will eat 1 token. This can easily exhaust the system of tokens for all other users.

This is true. This will protect the backend from being overworked so it can continue processing messages under heavy load, but it does not prevent DOS attacks.

There is still some benefit to load shedding at the API layer even if only to allow the backend to not to become overwhelmed.

we can add some additional limits for each connnection and this would partially mitigate the problem. At least then a single individual cannot DOS the service by issuing Filecoin.Version as you suggest. Is that what you are thinking @travisperson ?

DDoS mitigation would likely have to be handled with scalability. As you suggest, this rate limiter would not help in that case except to keep the backend working.

@travisperson
Copy link
Contributor

I think it's just something we need to understand and determine if it's acceptable. If one user exhausting all tokens is unacceptable then we need a better solution to limit per connection / ip.

@coryschwartz
Copy link
Author

I'm moving this back to draft due to concerns about using only global rate limits.

I'll implement a per-connection rate limiting and will produce some performance benchmarks before I put this back out for review.

@coryschwartz coryschwartz marked this pull request as draft May 9, 2022 14:26
@coryschwartz
Copy link
Author

@travisperson can you give this a try?

There's two new rate limiters in addition to the global one.

One limits the number of HTTP connections you're allowed to open in a single minute.

The other adds limits for RPC call over an established websocket.

@coryschwartz coryschwartz marked this pull request as ready for review May 20, 2022 10:45
@magik6k magik6k changed the title Add rate limiting to the lotus gateway feat: Add rate limiting to the lotus gateway May 31, 2022
Copy link
Contributor

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a high-level review pass, generally looks good.

Conflicts need resolving.

gateway/node.go Outdated
"github.com/filecoin-project/lotus/node/impl/full"
)

const (
DefaultLookbackCap = time.Hour * 24
DefaultStateWaitLookbackLimit = abi.ChainEpoch(20)
DefaultRateLimitTimeout = time.Minute * 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd set this to some reasonably high value; 5s is probably good enough

}

h.mu.Lock()
seen, ok := h.ipmap[host]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this not abuseable with ipv6

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does work the same way on ipv4 or ipv6, this is keyed by just simply a string, which will be something like "1.2.3.4" when ipv4 is used or "1111:2222:3333:4444:5555:6666:7777:8888" when using ipv6.

This is the third kind of rate limiting on this PR, and I could be convinced that this should be done differently.

The global rate limiter and per-connection limiter protect the backend. The former is persistent for the lifetime of the process and the latter persists for the lifetime of a single connection.

The connection rate limiter, however, is intended to prevent abuse from scripts opening several connections in quick succession. the ipmap grows when there new connections, and shrinks again some time later.

Alternatively, we could do this as a max simultanious connections rather than "connections per minute"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main worry is that with ipv6 you get a /64 subnet, which is a lot of IPs which are tracked separately; but yes, the other limits should help here

@magik6k magik6k enabled auto-merge June 10, 2022 11:01
@magik6k magik6k merged commit e899f73 into master Jun 10, 2022
@magik6k magik6k deleted the rate-limit-gateway branch June 10, 2022 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants