Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pip not retrying downloads #11150

Closed
1 task done
davidcorrigan714 opened this issue May 27, 2022 · 11 comments
Closed
1 task done

Pip not retrying downloads #11150

davidcorrigan714 opened this issue May 27, 2022 · 11 comments
Labels
type: bug A confirmed bug or unintended behavior

Comments

@davidcorrigan714
Copy link

Description

For whatever reason we're seeing timeouts when downloading from our private PyPi repository every so often, still running down the root cause on that, but pip doesn't seem to be retrying like I thought it should be. It's dying with:

pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='jfrog-prod-use1-shared-virginia-main.s3.amazonaws.com', port=443): Read timed out.

Expected behavior

Pip should retry the download operation a few times, 5 by default according to the documentation.

pip version

21.3.1

Python version

3.7

OS

Ubuntu

How to Reproduce

I'm seeing this in various build logs across our build farms, don't have an easily reproducible test case at the moment unfortunately.

Output

2022-05-25T12:25:57.6306129Z   Downloading https://internal-stuff.url.systems/packages/packages/6c/10/a7d0fa5baea8fe7b50f448ab742f26f52b80bfca85ac2be9d35cdd9a3246/pyparsing-3.0.9-py3-none-any.whl (98 kB)
2022-05-25T12:25:58.3305288Z Collecting tenacity
2022-05-25T12:25:58.5194483Z   Downloading https://internal-stuff.url.systems/packages/packages/f2/a5/f86bc8d67c979020438c8559cc70cfe3a1643fd160d35e09c9cca6a09189/tenacity-8.0.1-py3-none-any.whl (24 kB)
2022-05-25T12:25:59.2982655Z Collecting p4python
2022-05-25T12:25:59.7298319Z   Downloading https://internal-stuff.url.systems/p4python/2019.1.1858212/p4python-2019.1.1858212-cp37-cp37m-manylinux2010_x86_64.whl (5.3 MB)
2022-05-25T12:26:32.7190243Z ERROR: Exception:
2022-05-25T12:26:32.7191013Z Traceback (most recent call last):
2022-05-25T12:26:32.7196379Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_vendor/urllib3/response.py", line 438, in _error_catcher
2022-05-25T12:26:32.7196894Z     yield
2022-05-25T12:26:32.7197438Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_vendor/urllib3/response.py", line 519, in read
2022-05-25T12:26:32.7197818Z     data = self._fp.read(amt) if not fp_closed else b""
2022-05-25T12:26:32.7198385Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_vendor/cachecontrol/filewrapper.py", line 62, in read
2022-05-25T12:26:32.7198725Z     data = self.__fp.read(amt)
2022-05-25T12:26:32.7198999Z   File "/home/rfmibuild/.pyenv/versions/3.7.13/lib/python3.7/http/client.py", line 465, in read
2022-05-25T12:26:32.7199285Z     n = self.readinto(b)
2022-05-25T12:26:32.7199555Z   File "/home/rfmibuild/.pyenv/versions/3.7.13/lib/python3.7/http/client.py", line 509, in readinto
2022-05-25T12:26:32.7199846Z     n = self.fp.readinto(b)
2022-05-25T12:26:32.7200114Z   File "/home/rfmibuild/.pyenv/versions/3.7.13/lib/python3.7/socket.py", line 589, in readinto
2022-05-25T12:26:32.7200392Z     return self._sock.recv_into(b)
2022-05-25T12:26:32.7200695Z   File "/home/rfmibuild/.pyenv/versions/3.7.13/lib/python3.7/ssl.py", line 1071, in recv_into
2022-05-25T12:26:32.7200978Z     return self.read(nbytes, buffer)
2022-05-25T12:26:32.7201249Z   File "/home/rfmibuild/.pyenv/versions/3.7.13/lib/python3.7/ssl.py", line 929, in read
2022-05-25T12:26:32.7201543Z     return self._sslobj.read(len, buffer)
2022-05-25T12:26:32.7201773Z socket.timeout: The read operation timed out
2022-05-25T12:26:32.7201883Z 
2022-05-25T12:26:32.7202105Z During handling of the above exception, another exception occurred:
2022-05-25T12:26:32.7202244Z 
2022-05-25T12:26:32.7202410Z Traceback (most recent call last):
2022-05-25T12:26:32.7202973Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_internal/cli/base_command.py", line 164, in exc_logging_wrapper
2022-05-25T12:26:32.7203322Z     status = run_func(*args)
2022-05-25T12:26:32.7203930Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_internal/cli/req_command.py", line 205, in wrapper
2022-05-25T12:26:32.7204303Z     return func(self, options, args)
2022-05-25T12:26:32.7204820Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_internal/commands/install.py", line 339, in run
2022-05-25T12:26:32.7205201Z     reqs, check_supported_wheels=not options.target_dir
2022-05-25T12:26:32.7205765Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_internal/resolution/resolvelib/resolver.py", line 93, in resolve
2022-05-25T12:26:32.7206173Z     collected.requirements, max_rounds=try_to_avoid_resolution_too_deep
2022-05-25T12:26:32.7206757Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_vendor/resolvelib/resolvers.py", line 482, in resolve
2022-05-25T12:26:32.7207140Z     state = resolution.resolve(requirements, max_rounds=max_rounds)
2022-05-25T12:26:32.7207877Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_vendor/resolvelib/resolvers.py", line 374, in resolve
2022-05-25T12:26:32.7208621Z     failure_causes = self._attempt_to_pin_criterion(name)
2022-05-25T12:26:32.7209242Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_vendor/resolvelib/resolvers.py", line 214, in _attempt_to_pin_criterion
2022-05-25T12:26:32.7209634Z     criteria = self._get_updated_criteria(candidate)
2022-05-25T12:26:32.7210265Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_vendor/resolvelib/resolvers.py", line 205, in _get_updated_criteria
2022-05-25T12:26:32.7210670Z     self._add_to_criteria(criteria, requirement, parent=candidate)
2022-05-25T12:26:32.7211246Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_vendor/resolvelib/resolvers.py", line 173, in _add_to_criteria
2022-05-25T12:26:32.7211623Z     if not criterion.candidates:
2022-05-25T12:26:32.7212140Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_vendor/resolvelib/structs.py", line 151, in __bool__
2022-05-25T12:26:32.7212501Z     return bool(self._sequence)
2022-05-25T12:26:32.7213053Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 155, in __bool__
2022-05-25T12:26:32.7213407Z     return any(self)
2022-05-25T12:26:32.7213974Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 143, in <genexpr>
2022-05-25T12:26:32.7214386Z     return (c for c in iterator if id(c) not in self._incompatible_ids)
2022-05-25T12:26:32.7215002Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 47, in _iter_built
2022-05-25T12:26:32.7215358Z     candidate = func()
2022-05-25T12:26:32.7215917Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 206, in _make_candidate_from_link
2022-05-25T12:26:32.7216303Z     version=version,
2022-05-25T12:26:32.7216829Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 287, in __init__
2022-05-25T12:26:32.7217196Z     version=version,
2022-05-25T12:26:32.7217726Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 156, in __init__
2022-05-25T12:26:32.7218083Z     self.dist = self._prepare()
2022-05-25T12:26:32.7218635Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 225, in _prepare
2022-05-25T12:26:32.7219006Z     dist = self._prepare_distribution()
2022-05-25T12:26:32.7219603Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 292, in _prepare_distribution
2022-05-25T12:26:32.7220050Z     return preparer.prepare_linked_requirement(self._ireq, parallel_builds=True)
2022-05-25T12:26:32.7220664Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_internal/operations/prepare.py", line 482, in prepare_linked_requirement
2022-05-25T12:26:32.7221104Z     return self._prepare_linked_requirement(req, parallel_builds)
2022-05-25T12:26:32.7221703Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_internal/operations/prepare.py", line 528, in _prepare_linked_requirement
2022-05-25T12:26:32.7222133Z     link, req.source_dir, self._download, self.download_dir, hashes
2022-05-25T12:26:32.7222696Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_internal/operations/prepare.py", line 217, in unpack_url
2022-05-25T12:26:32.7223101Z     hashes=hashes,
2022-05-25T12:26:32.7223654Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_internal/operations/prepare.py", line 94, in get_http_url
2022-05-25T12:26:32.7224076Z     from_path, content_type = download(link, temp_dir.path)
2022-05-25T12:26:32.7224640Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_internal/network/download.py", line 145, in __call__
2022-05-25T12:26:32.7224970Z     for chunk in chunks:
2022-05-25T12:26:32.7225481Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_internal/cli/progress_bars.py", line 144, in iter
2022-05-25T12:26:32.7225797Z     for x in it:
2022-05-25T12:26:32.7226287Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_internal/network/utils.py", line 87, in response_chunks
2022-05-25T12:26:32.7226639Z     decode_content=False,
2022-05-25T12:26:32.7227139Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_vendor/urllib3/response.py", line 576, in stream
2022-05-25T12:26:32.7227514Z     data = self.read(amt=amt, decode_content=decode_content)
2022-05-25T12:26:32.7228066Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_vendor/urllib3/response.py", line 541, in read
2022-05-25T12:26:32.7228446Z     raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
2022-05-25T12:26:32.7228796Z   File "/home/rfmibuild/.pyenv/versions/3.7.13/lib/python3.7/contextlib.py", line 130, in __exit__
2022-05-25T12:26:32.7229089Z     self.gen.throw(type, value, traceback)
2022-05-25T12:26:32.7229620Z   File "/home/rfmibuild/myagent/_work/_temp/checkout_venv/lib/python3.7/site-packages/pip/_vendor/urllib3/response.py", line 443, in _error_catcher
2022-05-25T12:26:32.7230018Z     raise ReadTimeoutError(self._pool, None, "Read timed out.")
2022-05-25T12:26:32.7230608Z pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='blah.s3.amazonaws.com', port=443): Read timed out.

Code of Conduct

@davidcorrigan714 davidcorrigan714 added S: needs triage Issues/PRs that need to be triaged type: bug A confirmed bug or unintended behavior labels May 27, 2022
@uranusjr
Copy link
Member

A ReadTimeoutError means your server is likely too slow to respond, and a retry is not likely to help (because it still needs the server to respond). You should probably do something with the server you’re working with, or try --timeout.

@uranusjr uranusjr closed this as not planned Won't fix, can't repro, duplicate, stale May 27, 2022
@davidcorrigan714
Copy link
Author

A ReadTimeoutError means your server is likely too slow to respond, and a retry is not likely to help (because it still needs the server to respond). You should probably do something with the server you’re working with, or try --timeout.

Well yes, problem is that it's most likely not the server in this case. It's probably an iffy cable in a LAG on a switch or some specific firewall in that cluster having issues. Unfortunately we changed our PyPi hosting at the same time IT did some upgrades to our WAN connections and in general getting them to figure out network hiccups is always challenge even if it's often enough that we can reproduce it consistently. We're also seeing occasionally slow connections in other services. It's just super rare and most of the other services get past it with a longer timeout or retries. We download thousands of packages a day and only see this error a handful of times so I'm pretty confident a retry will suffice in this case, and the docs indicate it is the expected behavior of pip.

@davidcorrigan714
Copy link
Author

@uranusjr Why did you close this without even acknowledging that the retry logic isn't behaving as expected? Regardless of what you think the root cause of the error might be, the retry logic should be working. Think I have the time to go through the code and dig into myself if need be and put up a PR but my first step is always going to be to appeal to the subject matter experts first and potentially get some insights and leg up on figuring out the problem.

@pfmoore
Copy link
Member

pfmoore commented May 27, 2022

You're framing this issue as an error. Can you explain why you think it's an error? I checked the docs and couldn't find anywhere where we claim that timeouts get retried, so I don't think it's reasonable to call this an error - at best it's a difference of opinion in what's the most useful behaviour.

Is there any reason you can't simply use --timeout to set a timeout more appropriate for your server?

I'm against retrying on timeouts, because that would mean that with default settings, pip would wait for an unresponsive server for 75 seconds (5 retries, 15 second timeout). That's not a reasonable length of time. And in my experience, unresponsive servers are much more common than servers that have occasional "blips" of long response time. Plus, such a "blip" is easily addressed by manually retrying the pip command. (And an automated script can easily check for a timeout error and implement its own retry).

So in my view, pip's current behaviour is an entirely reasonable compromise, and we shouldn't change it.

@davidcorrigan714
Copy link
Author

So I'm reading the documentation for retries here which simply states that "Maximum number of retries each connection should attempt (default 5 times)." which to me means it's going to retry anything and everything. I don't see anywhere in the doc where it explicitly places any restrictions on the type of connection errors it will retry. It's a download operation, it's idempotent, let it try, all you're losing is time. 75 seconds is absolutely fine in the context of a 5 hour build, think it's even more actually with the .25 back off. Point being we'd infinitely rather give it a few extra seconds for a retry 2 hours into a 5 hour build process than be picky about why it'd retrying, also it's a network issue, networks fail, often intermittently, retries are expected and are designed into every layer of the network stack. In my experience with APIs & network operations in general it is super super dangerous to make any assumptions about what errors may or may not be recoverable. Load balancers, firewalls, CDNs, caching are all facts of life and SOP in web architecture today and it's nearly impossible to predict all their failure modes. We could also look at adding retries in our code that does the pip install, that's next on my list.

As far as the timeout option, it's possible that might work. I'm hitting this problem from multiple angles. An intermittent error like this should be recoverable at multiple layers so I'm looking at fixing multiple layers and dealing with a lot of finger pointing with everyone at each layer. The speeds I've seen on other services on our network that we've hit this on are pretty abysmal so I'd rather let it hit a timeout and retry. There's also not quite enough information in the pip failure message for me to gauge what timeout value might be appropriate, and it's also just a huge PITA to go find everywhere we invoke pip and fix that timeout when the retries should be covering it - its working fine for our poetry installs, no errors from them but we've also been preferring to do pip installs lately for other reasons.

The only place I can find in the code that configures the retry behavior is here. My reading of the Retry class docstring is that by only the total retry count it will retry all error types. Looking at the call stack of the error in the debugger and the log output I pasted above, I don't see that Retry class anywhere which is what I haven't quite figured out yet.

@davidcorrigan714
Copy link
Author

I've also been trying to find any unit tests for the retry logic but coming up empty, am I missing them somewhere? Was hoping they'd provide an easier way to reproduce the issue or provide insights on how that Retry class is actually supposed to be invoked by the various libraries involved in the requests.

@davidcorrigan714
Copy link
Author

Looks like the retry logic just relies on whatever request/urllib3 provides for retries which really isn't enough based on our observations and some comments in the respective GitHub repos for those projects. Not sure what the existing behavior was really designed to catch, looks like maybe it was designed primarily for iffy status codes? The retries really need to be in download.py.

Assuming you all aren't particularly keen on fundamentally changing the existing behavior for everybody, would you all accept a PR for a new --download-retries argument that retries the download as a whole if anything fails as well as some doc updates & unit tests for both options? I'd recommend migrating the existing --retries argument to something that better describes its behavior, say --connect-retries but that's really hear nor there.

@pfmoore
Copy link
Member

pfmoore commented May 28, 2022

I'd like to see evidence that this would be useful to more users than just yourself before accepting such a PR. There's a maintenance cost to dealing with issues caused by people not understanding the difference between --retries and --download-retries (even if we renamed --retries, which would have a compatibility cost as well).

To my knowledge, before now no-one has ever raised an issue claiming that the current behaviour is a problem (suggesting that you're in a tiny minority here).

@davidcorrigan714
Copy link
Author

@pfmoore Yeah, the more I thought about it the more I dislike the split options.

I have no doubts were in a minority with what I've seen of the broader Python ecosystem. While our corporate internet connection certainly has some hiccups (having fun running those down) and I can look at deploying some config changes to extend timeouts, as I sketch out what a high availability setup would look like for a Pip repo, this retry hole keeps rearing its head. For example we're looking to stand up some local caches with round-robin DNS, which provides some fault tolerance but as long as the client can't retry a failed partial download there will always be failures if a server goes down without cleanly wrapping up existing connections which simply won't always happen. We run thousands of builds a day, and Pip issues consistently rise to the top of our daily infrastructure failure reports any time the network hiccups at all. We simply don't see these issues with other packaging systems and standing up a local cache server near all of our build farms is going to be a PITA and way more work when just getting retries working would get us in much better shape and I maintain is what the client should be doing. It's an idempotent network operation, it needs retry support for any failure modes, full stop. I've talked with colleagues that work on our internal packaging systems and they've said the same thing: You simply can't account for all the random errors you might see, you can handle some specially if it's recognizable but to actually be reliable you just gotta retry anything.

@davidcorrigan714
Copy link
Author

Technically though I don't see any opinion in your response. Does the retry hole concern you all? Is it expected behavior? Did I miss the unit tests somewhere?

@pfmoore
Copy link
Member

pfmoore commented Jun 6, 2022

My (personal) opinion is that we should do nothing in pip and you should use the existing --timeout option. If it were a serious issue, we would have seen more people report it. And it’s not at all clear to me that the proposed change won’t be a problem for our other users, as @uranusjr mentioned. Sorry, I appreciate that this is a non-trivial problem for you, but we have to prioritise here - pip is a purely volunteer developed project with extremely limited resources.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 7, 2022
@pradyunsg pradyunsg removed the S: needs triage Issues/PRs that need to be triaged label Mar 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type: bug A confirmed bug or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants