Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Object Store: S3 IP address selection is biased #7117

Closed
crepererum opened this issue Feb 11, 2025 · 7 comments · Fixed by #7123
Closed

Object Store: S3 IP address selection is biased #7117

crepererum opened this issue Feb 11, 2025 · 7 comments · Fixed by #7123
Labels
enhancement Any new improvement worthy of a entry in the changelog object-store Object Store Interface

Comments

@crepererum
Copy link
Contributor

Problem Description

This is specific to AWS S3. Note that S3 only supports HTTP/1.1, so no connection multiplexing will happening. This means that two concurrent requests will use different TCP+TLS connections.

If you issue two or more requests to S3 at the same time (to the same region + bucket), all of these will use the same S3 IP address, even though S3 advertises multiple addresses in the DNS response (see DNS analysis below). This happens even when these requests are issued from different ObjectStore instances (see resolver analysis on why this is happening). This behavior was confirmed using network traffic analysis using Wireshark. This is bad for the following reasons:

Performance

It is way more likely that you overload a single S3 server.

Latency Racing (= Racing Reads)

In theory an object_store user could race two requests (esp. GET requests) to the same object hoping that one of them will be faster. There's evidence that this works:

Note that this trades cost (via number of requests) for improved tail latency. However if you connect to the same S3 server on all racing parts, this is way less likely to work.

Fault Tolerance

Since an S3 server might be down, concentrating all requests on one server may elevate this issue.

Persistence

Since the HTTP/1.1 connections are kept alive (mostly until the AWS side terminates them), this server pinning can persist long after the first requests are made.

Technical Analysis

To understand why this is happening, we need to look at different parts of the stack.

DNS

Resolving the S3 IP looks like this on the DNS layer (captured using Wireshark)

Domain Name System (response)
    Transaction ID: 0x07d6
    Flags: 0x8180 Standard query response, No error
    Questions: 1
    Answer RRs: 8
    Authority RRs: 0
    Additional RRs: 1
    Queries
        s3.us-east-1.amazonaws.com: type A, class IN
            Name: s3.us-east-1.amazonaws.com
            [Name Length: 26]
            [Label Count: 4]
            Type: A (1) (Host Address)
            Class: IN (0x0001)
    Answers
        s3.us-east-1.amazonaws.com: type A, class IN, addr 16.182.97.32
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 16.182.97.32
        s3.us-east-1.amazonaws.com: type A, class IN, addr 52.217.46.62
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 52.217.46.62
        s3.us-east-1.amazonaws.com: type A, class IN, addr 52.217.4.118
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 52.217.4.118
        s3.us-east-1.amazonaws.com: type A, class IN, addr 52.216.36.80
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 52.216.36.80
        s3.us-east-1.amazonaws.com: type A, class IN, addr 52.216.38.224
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 52.216.38.224
        s3.us-east-1.amazonaws.com: type A, class IN, addr 52.216.51.128
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 52.216.51.128
        s3.us-east-1.amazonaws.com: type A, class IN, addr 3.5.1.11
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 3.5.1.11
        s3.us-east-1.amazonaws.com: type A, class IN, addr 54.231.204.96
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 54.231.204.96
    Additional records
        <Root>: type OPT
            Name: <Root>
            Type: OPT (41) 
            UDP payload size: 1232
            Higher bits in extended RCODE: 0x00
            EDNS0 version: 0
            Z: 0x0000
                0... .... .... .... = DO bit: Cannot handle DNSSEC security RRs
                .000 0000 0000 0000 = Reserved: 0x0000
            Data length: 285
            Option: PADDING
    [Request In: 1627]
    [Time: 0.072296785 seconds]

i.e. that's 8 different IPs with a 5s TTL.

If we ask again later, we'll get a slightly different response:

Domain Name System (response)
    Transaction ID: 0x817a
    Flags: 0x8180 Standard query response, No error
    Questions: 1
    Answer RRs: 8
    Authority RRs: 0
    Additional RRs: 1
    Queries
        s3.us-east-1.amazonaws.com: type A, class IN
            Name: s3.us-east-1.amazonaws.com
            [Name Length: 26]
            [Label Count: 4]
            Type: A (1) (Host Address)
            Class: IN (0x0001)
    Answers
        s3.us-east-1.amazonaws.com: type A, class IN, addr 16.182.102.192
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 16.182.102.192
        s3.us-east-1.amazonaws.com: type A, class IN, addr 52.216.136.77
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 52.216.136.77
        s3.us-east-1.amazonaws.com: type A, class IN, addr 52.217.204.8
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 52.217.204.8
        s3.us-east-1.amazonaws.com: type A, class IN, addr 54.231.196.248
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 54.231.196.248
        s3.us-east-1.amazonaws.com: type A, class IN, addr 54.231.236.208
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 54.231.236.208
        s3.us-east-1.amazonaws.com: type A, class IN, addr 52.217.201.80
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 52.217.201.80
        s3.us-east-1.amazonaws.com: type A, class IN, addr 3.5.31.42
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 3.5.31.42
        s3.us-east-1.amazonaws.com: type A, class IN, addr 52.217.128.224
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 52.217.128.224
    Additional records
        <Root>: type OPT
            Name: <Root>
            Type: OPT (41) 
            UDP payload size: 1232
            Higher bits in extended RCODE: 0x00
            EDNS0 version: 0
            Z: 0x0000
                0... .... .... .... = DO bit: Cannot handle DNSSEC security RRs
                .000 0000 0000 0000 = Reserved: 0x0000
            Data length: 285
            Option: PADDING
    [Request In: 67958]
    [Time: 0.140540958 seconds]

I've search through the DNS-related RFCs but couldn't find a suggestion if the order is important or not, but the internet (1, 2, 3) suggests that most implementation use the IPs in order (using the next one with a timeout) but that the standard actually makes NO claim on that front.

Resolver

reqwest -- which is the high-level HTTP client library that object_store uses -- has a high-level interface called Resolve which resolves one host name to multiple IP addresses.

By default reqwest uses getaddrinfo (see 1, 2, 3), i.e. the system resolver. That one will very likely cache resolution based on the 5s TTL (see above). In fact I can see that behavior using Wireshark.

Address Usage

Now how are these multiple addresses used: If you search through the code, you'll eventually get here and see that hyper-util (used by reqwest for the wiring of low-level components) will try to connect to the IP addresses in order and will only continue of the connection cannot be established or a timeout occurs. So in the happy path this will always connect to the first address.

Solutions

I think we should keep using reqwest since in general it serves us well. So a natural way to change the current behavior would be using the aforementioned Resolve interface. I see two general options, both as extensions to ClientOptions.

A: Expose Resolve

Add a way for users to specify their own Resolve implementation.

Pros:

  • users can also implement other resolver sources, caching, metrics & logs (e.g. to debug broken DNS setup)

Cons:

  • users need to write more code to get an arguably "reasonable" behavior

B: Add randomize_addrs flag

Add a flag randomize_addrs. If it is set to true (by default?), then object_store will wrap the default resolver and shuffle the addresses before returning it back to reqwest.

Pros:

  • sensible default

Cons:

  • less extensible
@crepererum crepererum added enhancement Any new improvement worthy of a entry in the changelog object-store Object Store Interface labels Feb 11, 2025
@Xuanwo
Copy link
Member

Xuanwo commented Feb 11, 2025

Hi, I suggest using crates like reqwest-hickory-resolver (developed by me 😆).

I've added shuffle support, but it hasn't been released yet. Feel free to share your feedback here: GitHub Commit.

I believe this approach can help avoid adding more flags to object_store.

@crepererum
Copy link
Contributor Author

@Xuanwo reqwest also has a builtin hickory-based resolver. How's your crate different? (my guess is shuffling, but are there other differences?)

@Xuanwo
Copy link
Member

Xuanwo commented Feb 12, 2025

@Xuanwo reqwest also has a builtin hickory-based resolver. How's your crate different? (my guess is shuffling, but are there other differences?)

At the time when reqwest-hickory-resolver was being developed, reqwest was still using the old trust_dns. Additionally, reqwest-hickory-resolver allows users to share the same DNS cache across different client instances.

@crepererum
Copy link
Contributor Author

I think that hickory is a great piece of software and I've used it for other clients as well. However I don't think it's a good fit here. object_store in general tries to maintain a very low dependency footprint and hickory isn't exactly lite on that front (and they don't have to, they are after all a full-blown DNS implementation). That said, I think we can just use the system DNS resolver -- which BTW. already implements caching sufficiently well -- and just shuffle the results from that one. That also ensures that DNS resolution uses the very same mechanism as any other standard Rust and C library call and -- by extension -- the entire ecosystem.

I'll file a PR for that.

crepererum added a commit to crepererum/arrow-rs that referenced this issue Feb 12, 2025
crepererum added a commit to crepererum/arrow-rs that referenced this issue Feb 12, 2025
crepererum added a commit to influxdata/arrow-rs that referenced this issue Feb 12, 2025
@kylebarron
Copy link
Contributor

Is there any way for downstream users of object_store to opt-in to hickory?

@tustvold
Copy link
Contributor

Is there any way for downstream users of object_store to opt-in to hickory?

I believe if you add a dependency on reqwest and enable the hickory-dns feature, it will default to using that.

Although I guess following #7123 you will need to disable the random_address feature

@kylebarron
Copy link
Contributor

I believe if you add a dependency on reqwest and enable the hickory-dns feature, it will default to using that.

Ah, closer reading of the reqwest docstring the second time proves you're correct:

If the hickory-dns feature is turned on, the default option is enabled.

Although I guess following #7123 you will need to disable the random_address feature

Ah good to know, esp because random_access is True by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog object-store Object Store Interface
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants