Object Store: S3 IP address selection is biased #7117

crepererum · 2025-02-11T12:10:49Z

Problem Description

This is specific to AWS S3. Note that S3 only supports HTTP/1.1, so no connection multiplexing will happening. This means that two concurrent requests will use different TCP+TLS connections.

If you issue two or more requests to S3 at the same time (to the same region + bucket), all of these will use the same S3 IP address, even though S3 advertises multiple addresses in the DNS response (see DNS analysis below). This happens even when these requests are issued from different ObjectStore instances (see resolver analysis on why this is happening). This behavior was confirmed using network traffic analysis using Wireshark. This is bad for the following reasons:

Performance

It is way more likely that you overload a single S3 server.

Latency Racing (= Racing Reads)

In theory an object_store user could race two requests (esp. GET requests) to the same object hoping that one of them will be faster. There's evidence that this works:

Note that this trades cost (via number of requests) for improved tail latency. However if you connect to the same S3 server on all racing parts, this is way less likely to work.

Fault Tolerance

Since an S3 server might be down, concentrating all requests on one server may elevate this issue.

Persistence

Since the HTTP/1.1 connections are kept alive (mostly until the AWS side terminates them), this server pinning can persist long after the first requests are made.

Technical Analysis

To understand why this is happening, we need to look at different parts of the stack.

DNS

Resolving the S3 IP looks like this on the DNS layer (captured using Wireshark)

Domain Name System (response)
    Transaction ID: 0x07d6
    Flags: 0x8180 Standard query response, No error
    Questions: 1
    Answer RRs: 8
    Authority RRs: 0
    Additional RRs: 1
    Queries
        s3.us-east-1.amazonaws.com: type A, class IN
            Name: s3.us-east-1.amazonaws.com
            [Name Length: 26]
            [Label Count: 4]
            Type: A (1) (Host Address)
            Class: IN (0x0001)
    Answers
        s3.us-east-1.amazonaws.com: type A, class IN, addr 16.182.97.32
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 16.182.97.32
        s3.us-east-1.amazonaws.com: type A, class IN, addr 52.217.46.62
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 52.217.46.62
        s3.us-east-1.amazonaws.com: type A, class IN, addr 52.217.4.118
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 52.217.4.118
        s3.us-east-1.amazonaws.com: type A, class IN, addr 52.216.36.80
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 52.216.36.80
        s3.us-east-1.amazonaws.com: type A, class IN, addr 52.216.38.224
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 52.216.38.224
        s3.us-east-1.amazonaws.com: type A, class IN, addr 52.216.51.128
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 52.216.51.128
        s3.us-east-1.amazonaws.com: type A, class IN, addr 3.5.1.11
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 3.5.1.11
        s3.us-east-1.amazonaws.com: type A, class IN, addr 54.231.204.96
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 54.231.204.96
    Additional records
        <Root>: type OPT
            Name: <Root>
            Type: OPT (41) 
            UDP payload size: 1232
            Higher bits in extended RCODE: 0x00
            EDNS0 version: 0
            Z: 0x0000
                0... .... .... .... = DO bit: Cannot handle DNSSEC security RRs
                .000 0000 0000 0000 = Reserved: 0x0000
            Data length: 285
            Option: PADDING
    [Request In: 1627]
    [Time: 0.072296785 seconds]

i.e. that's 8 different IPs with a 5s TTL.

If we ask again later, we'll get a slightly different response:

Domain Name System (response)
    Transaction ID: 0x817a
    Flags: 0x8180 Standard query response, No error
    Questions: 1
    Answer RRs: 8
    Authority RRs: 0
    Additional RRs: 1
    Queries
        s3.us-east-1.amazonaws.com: type A, class IN
            Name: s3.us-east-1.amazonaws.com
            [Name Length: 26]
            [Label Count: 4]
            Type: A (1) (Host Address)
            Class: IN (0x0001)
    Answers
        s3.us-east-1.amazonaws.com: type A, class IN, addr 16.182.102.192
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 16.182.102.192
        s3.us-east-1.amazonaws.com: type A, class IN, addr 52.216.136.77
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 52.216.136.77
        s3.us-east-1.amazonaws.com: type A, class IN, addr 52.217.204.8
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 52.217.204.8
        s3.us-east-1.amazonaws.com: type A, class IN, addr 54.231.196.248
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 54.231.196.248
        s3.us-east-1.amazonaws.com: type A, class IN, addr 54.231.236.208
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 54.231.236.208
        s3.us-east-1.amazonaws.com: type A, class IN, addr 52.217.201.80
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 52.217.201.80
        s3.us-east-1.amazonaws.com: type A, class IN, addr 3.5.31.42
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 3.5.31.42
        s3.us-east-1.amazonaws.com: type A, class IN, addr 52.217.128.224
            Name: s3.us-east-1.amazonaws.com
            Type: A (1) (Host Address)
            Class: IN (0x0001)
            Time to live: 5 (5 seconds)
            Data length: 4
            Address: 52.217.128.224
    Additional records
        <Root>: type OPT
            Name: <Root>
            Type: OPT (41) 
            UDP payload size: 1232
            Higher bits in extended RCODE: 0x00
            EDNS0 version: 0
            Z: 0x0000
                0... .... .... .... = DO bit: Cannot handle DNSSEC security RRs
                .000 0000 0000 0000 = Reserved: 0x0000
            Data length: 285
            Option: PADDING
    [Request In: 67958]
    [Time: 0.140540958 seconds]

I've search through the DNS-related RFCs but couldn't find a suggestion if the order is important or not, but the internet (1, 2, 3) suggests that most implementation use the IPs in order (using the next one with a timeout) but that the standard actually makes NO claim on that front.

Resolver

reqwest -- which is the high-level HTTP client library that object_store uses -- has a high-level interface called Resolve which resolves one host name to multiple IP addresses.

By default reqwest uses getaddrinfo (see 1, 2, 3), i.e. the system resolver. That one will very likely cache resolution based on the 5s TTL (see above). In fact I can see that behavior using Wireshark.

Address Usage

Now how are these multiple addresses used: If you search through the code, you'll eventually get here and see that hyper-util (used by reqwest for the wiring of low-level components) will try to connect to the IP addresses in order and will only continue of the connection cannot be established or a timeout occurs. So in the happy path this will always connect to the first address.

Solutions

I think we should keep using reqwest since in general it serves us well. So a natural way to change the current behavior would be using the aforementioned Resolve interface. I see two general options, both as extensions to ClientOptions.

A: Expose `Resolve`

Add a way for users to specify their own Resolve implementation.

Pros:

users can also implement other resolver sources, caching, metrics & logs (e.g. to debug broken DNS setup)

Cons:

users need to write more code to get an arguably "reasonable" behavior

B: Add `randomize_addrs` flag

Add a flag randomize_addrs. If it is set to true (by default?), then object_store will wrap the default resolver and shuffle the addresses before returning it back to reqwest.

Pros:

sensible default

Cons:

less extensible

The text was updated successfully, but these errors were encountered:

Xuanwo · 2025-02-11T13:11:30Z

Hi, I suggest using crates like reqwest-hickory-resolver (developed by me 😆).

I've added shuffle support, but it hasn't been released yet. Feel free to share your feedback here: GitHub Commit.

I believe this approach can help avoid adding more flags to object_store.

crepererum · 2025-02-11T15:38:59Z

@Xuanwo reqwest also has a builtin hickory-based resolver. How's your crate different? (my guess is shuffling, but are there other differences?)

Xuanwo · 2025-02-12T01:43:56Z

@Xuanwo reqwest also has a builtin hickory-based resolver. How's your crate different? (my guess is shuffling, but are there other differences?)

At the time when reqwest-hickory-resolver was being developed, reqwest was still using the old trust_dns. Additionally, reqwest-hickory-resolver allows users to share the same DNS cache across different client instances.

crepererum · 2025-02-12T09:46:05Z

I think that hickory is a great piece of software and I've used it for other clients as well. However I don't think it's a good fit here. object_store in general tries to maintain a very low dependency footprint and hickory isn't exactly lite on that front (and they don't have to, they are after all a full-blown DNS implementation). That said, I think we can just use the system DNS resolver -- which BTW. already implements caching sufficiently well -- and just shuffle the results from that one. That also ensures that DNS resolution uses the very same mechanism as any other standard Rust and C library call and -- by extension -- the entire ecosystem.

I'll file a PR for that.

Closes apache#7117.

kylebarron · 2025-02-16T20:55:39Z

Is there any way for downstream users of object_store to opt-in to hickory?

tustvold · 2025-02-16T21:38:27Z

Is there any way for downstream users of object_store to opt-in to hickory?

I believe if you add a dependency on reqwest and enable the hickory-dns feature, it will default to using that.

Although I guess following #7123 you will need to disable the random_address feature

kylebarron · 2025-02-16T21:42:35Z

I believe if you add a dependency on reqwest and enable the hickory-dns feature, it will default to using that.

Ah, closer reading of the reqwest docstring the second time proves you're correct:

If the hickory-dns feature is turned on, the default option is enabled.

Although I guess following #7123 you will need to disable the random_address feature

Ah good to know, esp because random_access is True by default.

crepererum added enhancement Any new improvement worthy of a entry in the changelog object-store Object Store Interface labels Feb 11, 2025

crepererum added a commit to crepererum/arrow-rs that referenced this issue Feb 12, 2025

feat(object_store): random IP address selection

6b159a3

Closes apache#7117.

crepererum mentioned this issue Feb 12, 2025

feat(object_store): Override DNS Resolution to Randomize IP Selection #7123

Merged

crepererum added a commit to crepererum/arrow-rs that referenced this issue Feb 12, 2025

feat(object_store): random IP address selection

60732d4

Closes apache#7117.

tustvold closed this as completed in #7123 Feb 12, 2025

tustvold closed this as completed in d3a875f Feb 12, 2025

crepererum added a commit to influxdata/arrow-rs that referenced this issue Feb 12, 2025

feat(object_store): random IP address selection

33e88c2

Closes apache#7117.

kylebarron mentioned this issue Feb 16, 2025

Benchmark with hickory-dns DNS caching developmentseed/obstore#273

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Object Store: S3 IP address selection is biased #7117

Object Store: S3 IP address selection is biased #7117

crepererum commented Feb 11, 2025

Xuanwo commented Feb 11, 2025

crepererum commented Feb 11, 2025

Xuanwo commented Feb 12, 2025

crepererum commented Feb 12, 2025

kylebarron commented Feb 16, 2025

tustvold commented Feb 16, 2025

kylebarron commented Feb 16, 2025

Object Store: S3 IP address selection is biased #7117

Object Store: S3 IP address selection is biased #7117

Comments

crepererum commented Feb 11, 2025

Problem Description

Performance

Latency Racing (= Racing Reads)

Fault Tolerance

Persistence

Technical Analysis

DNS

Resolver

Address Usage

Solutions

A: Expose Resolve

B: Add randomize_addrs flag

Xuanwo commented Feb 11, 2025

crepererum commented Feb 11, 2025

Xuanwo commented Feb 12, 2025

crepererum commented Feb 12, 2025

kylebarron commented Feb 16, 2025

tustvold commented Feb 16, 2025

kylebarron commented Feb 16, 2025

A: Expose `Resolve`

B: Add `randomize_addrs` flag