-
Notifications
You must be signed in to change notification settings - Fork 869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Object Store: S3 IP address selection is biased #7117
Comments
Hi, I suggest using crates like reqwest-hickory-resolver (developed by me 😆). I've added shuffle support, but it hasn't been released yet. Feel free to share your feedback here: GitHub Commit. I believe this approach can help avoid adding more flags to |
@Xuanwo |
At the time when reqwest-hickory-resolver was being developed, reqwest was still using the old |
I think that I'll file a PR for that. |
Is there any way for downstream users of object_store to opt-in to hickory? |
I believe if you add a dependency on Although I guess following #7123 you will need to disable the random_address feature |
Ah, closer reading of the reqwest docstring the second time proves you're correct:
Ah good to know, esp because |
Problem Description
This is specific to AWS S3. Note that S3 only supports HTTP/1.1, so no connection multiplexing will happening. This means that two concurrent requests will use different TCP+TLS connections.
If you issue two or more requests to S3 at the same time (to the same region + bucket), all of these will use the same S3 IP address, even though S3 advertises multiple addresses in the DNS response (see DNS analysis below). This happens even when these requests are issued from different
ObjectStore
instances (see resolver analysis on why this is happening). This behavior was confirmed using network traffic analysis using Wireshark. This is bad for the following reasons:Performance
It is way more likely that you overload a single S3 server.
Latency Racing (= Racing Reads)
In theory an
object_store
user could race two requests (esp.GET
requests) to the same object hoping that one of them will be faster. There's evidence that this works:Note that this trades cost (via number of requests) for improved tail latency. However if you connect to the same S3 server on all racing parts, this is way less likely to work.
Fault Tolerance
Since an S3 server might be down, concentrating all requests on one server may elevate this issue.
Persistence
Since the HTTP/1.1 connections are kept alive (mostly until the AWS side terminates them), this server pinning can persist long after the first requests are made.
Technical Analysis
To understand why this is happening, we need to look at different parts of the stack.
DNS
Resolving the S3 IP looks like this on the DNS layer (captured using Wireshark)
i.e. that's 8 different IPs with a 5s TTL.
If we ask again later, we'll get a slightly different response:
I've search through the DNS-related RFCs but couldn't find a suggestion if the order is important or not, but the internet (1, 2, 3) suggests that most implementation use the IPs in order (using the next one with a timeout) but that the standard actually makes NO claim on that front.
Resolver
reqwest
-- which is the high-level HTTP client library thatobject_store
uses -- has a high-level interface calledResolve
which resolves one host name to multiple IP addresses.By default
reqwest
usesgetaddrinfo
(see 1, 2, 3), i.e. the system resolver. That one will very likely cache resolution based on the 5s TTL (see above). In fact I can see that behavior using Wireshark.Address Usage
Now how are these multiple addresses used: If you search through the code, you'll eventually get here and see that
hyper-util
(used byreqwest
for the wiring of low-level components) will try to connect to the IP addresses in order and will only continue of the connection cannot be established or a timeout occurs. So in the happy path this will always connect to the first address.Solutions
I think we should keep using
reqwest
since in general it serves us well. So a natural way to change the current behavior would be using the aforementionedResolve
interface. I see two general options, both as extensions toClientOptions
.A: Expose
Resolve
Add a way for users to specify their own
Resolve
implementation.Pros:
Cons:
B: Add
randomize_addrs
flagAdd a flag
randomize_addrs
. If it is set totrue
(by default?), thenobject_store
will wrap the default resolver and shuffle the addresses before returning it back toreqwest
.Pros:
Cons:
The text was updated successfully, but these errors were encountered: