-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTPS GET requests to S3 hang forever when doing a lot of them #517
Comments
I can't seem to reproduce this locally; I ran the script 4-5 times and saw 0 errors; so it might be network connection related? The broken pipe error indeed shows up when there's some kind of network interruption. It might be worth some other people trying to see if they can reproduce this on various kinds of networks. |
Hi, Thanks for trying to reproduce! I just ran it multiple times on my local machine, and it indeed finished, then I tried running on Azure machine and AWS machine, and both hang as described. I observed that the local run is a lot slower (6 mbps) than Azure ( 93 mbps) /AWS ( 364 mbps) (by the time they reach a hanging state, the local run isn't even halfway done) I'm guessing that when doing this high rate download maybe S3 starts to throttle / close the socket, which doesn't happen when running via a slower connection. Would it be possible for you trying to reproduce on AWS/Azure machine? |
I can reproduce this on a local machine that has high bandwidth to AWS. I was hoping #528 might aid in debugging. |
So first things first - let's look at what's up with all these EPIPE errors. Here's the packet capture for this
So we're acking the server reponse, and then we send the next request as two packets, the first being
the second
And we get a TCP FIN/ACK in the middle. So the server really did shut down the connection there. That's mostly fine, we should just retry with a new connection and not even be verbose about it. |
That said, it's a bit odd that we're splitting the HTTP header over two TCP packets. I wouldn't be surprised if the server would have still served that request if it arrived as one packet. |
I applied this patch, which helped cut the number of EPIPE errors quite a bit:
I did still see it here:
So we got a TLS level connection abort, but apparently ignored it and sent another request. I assume what happened here is that we just didn't look for any more pending data on the SSL socket, so we didn't pick up the close message. |
With this patch to MBedTLS (in addition to the patch above), I don't see any errors logged at all:
|
@Keno thanks a lot! I updated all relevant packages. Now i don't see any errors but unfortunately I'm still able to reproduce the original problem, running the above example on an aws machine usually reproduces on the second run. It just hangs forever. :( (I though maybe it's some threading thing and I tried on Julia 1.4 and 1.5, same results) |
We (the JuliaHub team) have just hit something rather similar-looking to this in production. Our application wasn't doing a massive pile of S3 requests, but at some point it appeared as if all requests to S3 started hanging (seeming indefinitely). I've also observed something similar for PUTs when uploading a smaller number of large files. I can confirm that a minor variation of the original script hangs for me, though even more easily - I only need N=1000 jobs with using HTTP
# 25 byte file (for no particular reason)
# aws s3 presign s3://some-bucket/blah --expires-in 604800
const _url = "..."
function get_data(done_list, i)
try
data = HTTP.get(_url, require_ssl_verification=false)
@info "got" i=i data=size(data.body)
done_list[i] = true
catch e
@error "error" exception=e
end
end
done_list = nothing
function main(N=1000)
global done_list = falses(N)
asyncmap(i->get_data(done_list, i), 1:N, ntasks=20)
end
main()
I had a go at reproducing this with a local nginx server to handle the https traffic rather than S3, but it seems to work fine in that case. (I managed to get SSL working with the |
I was wondering whether this could be related to the connection pool. Setting using HTTP
# Dummy data
const _url = "https://datasets-jl-test.s3.amazonaws.com/foo.txt?AWSAccessKeyId=ASIAZK43HGP34BZT5BTD&Signature=EYFYnxfvnV%2F2yrk8MBfqTtgbPfs%3D&x-amz-security-token=FwoGZXIvYXdzECwaDITnkSRSAAnM6zSHSiKGAfY4EPAezrGovUqdVuBloHpxfc9x2Onm4Y1v2mO8Hop%2FPVVCTcNinYfLIzfjrfzEo2QKXACH%2B4hrAsYp%2FqK9%2FmeGwI0OthUjerZbr8mIbNjYx3qsKYqWz44%2BL%2BalIi%2B17eA9kaGBcsNQO6opsQe1FvFH9eZrwkooF%2FRBEmouQdv9MCX9Ag%2FRKJj3uocGMijhHdYdt5vWmAy4yFF4f1vcowW7GslhSnoJ9290DDNRN0vr5VSnzE7Y&Expires=1626863359"
function main2(; N=1000, ntasks=20)
M = N÷ntasks
tasks = []
for i=1:ntasks
let ti = i
t = @async for j=1:M
try
data = HTTP.get(_url, require_ssl_verification=false, reuse_limit=0)
@info "got" i=ti j data=sizeof(data.body)
catch
@error "task $ti failed"
rethrow()
end
end
push!(tasks, t)
end
end
tasks
end After running
|
Looks like there's something messed up with the connection pool. When setting I'm trying to figure out what's actually going on now. The connection pooling code is fairly subtle. |
Reading through the ConnectionPool code I've noticed a few weird-seeming things.
|
Possibly |
I've some changes which seem to fix this by disabling HTTP/1.1 pipelining. Rather than trying to fix pipelining I think we should just remove it because major HTTP implementations have deemed it not worthwhile in practice due to:
In particular,
|
Follows up on the discussion and work started in #732. The gist is this: supporting "pipelined requests" is not well-supported across the web these days and severely complicates the threadsafe client implementation in HTTP.jl. And as has been pointed out in various discussions, the attempt to support pipelining is affecting our ability to avoid thread-safety issues in general (see #517). This commit has a few key pieces: * Splits "connection pooling" logic into a new connectionpools.jl file * Removes pipeline-related fields/concepts from the ConnectionPool.jl file and specifically the `Connection`/`Transaction` data structures * Attempts to simplify a lot of the work/logic around managing a `Transaction`'s lifecycle Things I haven't done yet: * Actually deprecated the `pipeline_limit` keyword argument * Got all tests passing; I think the websockets/server code isn't quite ironed out yet, but I'll dig into it some more tomorrow Big thanks to @nickrobinson251 for help reviewing the connectionpools.jl logic. Pinging people to help review: @c42f, @vtjnash, @s2maki, @fredrikekre
* Remove client support for pipelined requests Follows up on the discussion and work started in #732. The gist is this: supporting "pipelined requests" is not well-supported across the web these days and severely complicates the threadsafe client implementation in HTTP.jl. And as has been pointed out in various discussions, the attempt to support pipelining is affecting our ability to avoid thread-safety issues in general (see #517). This commit has a few key pieces: * Splits "connection pooling" logic into a new connectionpools.jl file * Removes pipeline-related fields/concepts from the ConnectionPool.jl file and specifically the `Connection`/`Transaction` data structures * Attempts to simplify a lot of the work/logic around managing a `Transaction`'s lifecycle Things I haven't done yet: * Actually deprecated the `pipeline_limit` keyword argument * Got all tests passing; I think the websockets/server code isn't quite ironed out yet, but I'll dig into it some more tomorrow Big thanks to @nickrobinson251 for help reviewing the connectionpools.jl logic. Pinging people to help review: @c42f, @vtjnash, @s2maki, @fredrikekre * get more tests passing; remove Transaction * Get all tests passing * update Julai compat to 1.6 * fix 32-bit and docs * fix 32-bit * fix 32-bit * Address review comments * Added `GC.@preserve` annotations * Removed calls to `hash` in `hashconn` and renamed to `connectionkey` * Removed enforcement of `reuse_limit` * Renamed some fields/variables for clarity * Cleaned up comments for clarification in connectionpools.jl * Added an additional `acquire` method that accepts an already created connection object for `Pod` insertion/tracking
Closing because we've overhauled the request internals and connection pool logic and have heard reports of similar errors being resolved when upgrading to HTTP#master |
The following code never finishes:
while running the following exception pop up
https://public-test-bucket-quatrix.s3.amazonaws.com/hey.txt
ArgumentError
exception, but not the broken pipe exception.https
tohttp
seems to make it always not hang - withhttp
also tried doubling the iterations and increasing payload to 400k, all finishes many times.readtimeout=5
doesn't seem to have an effect, it just hangs forever regardlessntasks
to 2, also hangs, but takes more timeI'm testing this from an Azure machine downloading objects from AWS.
this is the output when hitting ctrl+c
Julia 1.3.1
HTTP.jl 0.8.12
MbedTLS.jl 0.7.0
The text was updated successfully, but these errors were encountered: