-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pip not retrying downloads #11150
Comments
A |
Well yes, problem is that it's most likely not the server in this case. It's probably an iffy cable in a LAG on a switch or some specific firewall in that cluster having issues. Unfortunately we changed our PyPi hosting at the same time IT did some upgrades to our WAN connections and in general getting them to figure out network hiccups is always challenge even if it's often enough that we can reproduce it consistently. We're also seeing occasionally slow connections in other services. It's just super rare and most of the other services get past it with a longer timeout or retries. We download thousands of packages a day and only see this error a handful of times so I'm pretty confident a retry will suffice in this case, and the docs indicate it is the expected behavior of pip. |
@uranusjr Why did you close this without even acknowledging that the retry logic isn't behaving as expected? Regardless of what you think the root cause of the error might be, the retry logic should be working. Think I have the time to go through the code and dig into myself if need be and put up a PR but my first step is always going to be to appeal to the subject matter experts first and potentially get some insights and leg up on figuring out the problem. |
You're framing this issue as an error. Can you explain why you think it's an error? I checked the docs and couldn't find anywhere where we claim that timeouts get retried, so I don't think it's reasonable to call this an error - at best it's a difference of opinion in what's the most useful behaviour. Is there any reason you can't simply use I'm against retrying on timeouts, because that would mean that with default settings, pip would wait for an unresponsive server for 75 seconds (5 retries, 15 second timeout). That's not a reasonable length of time. And in my experience, unresponsive servers are much more common than servers that have occasional "blips" of long response time. Plus, such a "blip" is easily addressed by manually retrying the pip command. (And an automated script can easily check for a timeout error and implement its own retry). So in my view, pip's current behaviour is an entirely reasonable compromise, and we shouldn't change it. |
So I'm reading the documentation for retries here which simply states that "Maximum number of retries each connection should attempt (default 5 times)." which to me means it's going to retry anything and everything. I don't see anywhere in the doc where it explicitly places any restrictions on the type of connection errors it will retry. It's a download operation, it's idempotent, let it try, all you're losing is time. 75 seconds is absolutely fine in the context of a 5 hour build, think it's even more actually with the .25 back off. Point being we'd infinitely rather give it a few extra seconds for a retry 2 hours into a 5 hour build process than be picky about why it'd retrying, also it's a network issue, networks fail, often intermittently, retries are expected and are designed into every layer of the network stack. In my experience with APIs & network operations in general it is super super dangerous to make any assumptions about what errors may or may not be recoverable. Load balancers, firewalls, CDNs, caching are all facts of life and SOP in web architecture today and it's nearly impossible to predict all their failure modes. We could also look at adding retries in our code that does the pip install, that's next on my list. As far as the timeout option, it's possible that might work. I'm hitting this problem from multiple angles. An intermittent error like this should be recoverable at multiple layers so I'm looking at fixing multiple layers and dealing with a lot of finger pointing with everyone at each layer. The speeds I've seen on other services on our network that we've hit this on are pretty abysmal so I'd rather let it hit a timeout and retry. There's also not quite enough information in the pip failure message for me to gauge what timeout value might be appropriate, and it's also just a huge PITA to go find everywhere we invoke pip and fix that timeout when the retries should be covering it - its working fine for our poetry installs, no errors from them but we've also been preferring to do pip installs lately for other reasons. The only place I can find in the code that configures the retry behavior is here. My reading of the Retry class docstring is that by only the total retry count it will retry all error types. Looking at the call stack of the error in the debugger and the log output I pasted above, I don't see that Retry class anywhere which is what I haven't quite figured out yet. |
I've also been trying to find any unit tests for the retry logic but coming up empty, am I missing them somewhere? Was hoping they'd provide an easier way to reproduce the issue or provide insights on how that Retry class is actually supposed to be invoked by the various libraries involved in the requests. |
Looks like the retry logic just relies on whatever request/urllib3 provides for retries which really isn't enough based on our observations and some comments in the respective GitHub repos for those projects. Not sure what the existing behavior was really designed to catch, looks like maybe it was designed primarily for iffy status codes? The retries really need to be in download.py. Assuming you all aren't particularly keen on fundamentally changing the existing behavior for everybody, would you all accept a PR for a new --download-retries argument that retries the download as a whole if anything fails as well as some doc updates & unit tests for both options? I'd recommend migrating the existing --retries argument to something that better describes its behavior, say --connect-retries but that's really hear nor there. |
I'd like to see evidence that this would be useful to more users than just yourself before accepting such a PR. There's a maintenance cost to dealing with issues caused by people not understanding the difference between To my knowledge, before now no-one has ever raised an issue claiming that the current behaviour is a problem (suggesting that you're in a tiny minority here). |
@pfmoore Yeah, the more I thought about it the more I dislike the split options. I have no doubts were in a minority with what I've seen of the broader Python ecosystem. While our corporate internet connection certainly has some hiccups (having fun running those down) and I can look at deploying some config changes to extend timeouts, as I sketch out what a high availability setup would look like for a Pip repo, this retry hole keeps rearing its head. For example we're looking to stand up some local caches with round-robin DNS, which provides some fault tolerance but as long as the client can't retry a failed partial download there will always be failures if a server goes down without cleanly wrapping up existing connections which simply won't always happen. We run thousands of builds a day, and Pip issues consistently rise to the top of our daily infrastructure failure reports any time the network hiccups at all. We simply don't see these issues with other packaging systems and standing up a local cache server near all of our build farms is going to be a PITA and way more work when just getting retries working would get us in much better shape and I maintain is what the client should be doing. It's an idempotent network operation, it needs retry support for any failure modes, full stop. I've talked with colleagues that work on our internal packaging systems and they've said the same thing: You simply can't account for all the random errors you might see, you can handle some specially if it's recognizable but to actually be reliable you just gotta retry anything. |
Technically though I don't see any opinion in your response. Does the retry hole concern you all? Is it expected behavior? Did I miss the unit tests somewhere? |
My (personal) opinion is that we should do nothing in pip and you should use the existing |
Description
For whatever reason we're seeing timeouts when downloading from our private PyPi repository every so often, still running down the root cause on that, but pip doesn't seem to be retrying like I thought it should be. It's dying with:
pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='jfrog-prod-use1-shared-virginia-main.s3.amazonaws.com', port=443): Read timed out.
Expected behavior
Pip should retry the download operation a few times, 5 by default according to the documentation.
pip version
21.3.1
Python version
3.7
OS
Ubuntu
How to Reproduce
I'm seeing this in various build logs across our build farms, don't have an easily reproducible test case at the moment unfortunately.
Output
Code of Conduct
The text was updated successfully, but these errors were encountered: