Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated timeouts in GitHub Actions fetching wheel for large packages #1912

Closed
adamtheturtle opened this issue Feb 23, 2024 · 18 comments
Closed
Labels
bug Something isn't working

Comments

@adamtheturtle
Copy link
Contributor

In the last few days since switching to uv, I have seen errors that I have not seen before with pip.

I see:

error: Failed to download distributions
  Caused by: Failed to fetch wheel: torch==2.2.1
  Caused by: Failed to extract source distribution
  Caused by: request or response body error: operation timed out
  Caused by: operation timed out
Error: Process completed with exit code 2.

I see this on the CI for vws-python-mock, which requires installing 150 packages:

uv pip install --upgrade --editable .[dev]
...
Resolved 150 packages in 1.65s
Downloaded 141 packages in 21.41s
Installed 150 packages in 283ms

I do this in parallel across many jobs on GitHub Actions, mostly on ubuntu-latest.

This happened with torch 2.2.0 before the recent release of torch 2.2.1.
It has not happened with any other dependencies.
The wheels for torch are pretty huge: https://pypi.org/project/torch/#files.

uv is always at the latest version as I run curl -LsSf https://astral.sh/uv/install.sh | sh. In the most recent example, this is uv 0.1.9.

Failures:

@adamtheturtle
Copy link
Contributor Author

Perhaps I just need to use UV_HTTP_TIMEOUT and I will, but I thought that this might be worth pointing out:

  • If so, the error message could helpfully point to UV_HTTP_TIMEOUT
  • Perhaps the default is too small if using GitHub Actions + a popular package times out

@zanieb
Copy link
Member

zanieb commented Feb 23, 2024

Thanks for the feedback, I've opened issues for your requests

@adamtheturtle
Copy link
Contributor Author

Thank you @zanieb ! I don't know the value of having this issue open, but I'll leave it to you to close if desired.

@zanieb
Copy link
Member

zanieb commented Feb 23, 2024

In #1921 my co-worker noted that this might be a bug in the way we're specifying the timeout so I'll recategorize this one and leave it open.

@zanieb zanieb added bug Something isn't working and removed question Asking for clarification or support labels Feb 23, 2024
@konstin konstin self-assigned this Feb 28, 2024
@konstin
Copy link
Member

konstin commented Feb 28, 2024

Looking at the actions runs, all the passing actions take ~30s, while the failing ones error after 5min, which is our default timeout, so this looks like a network failure (in either github actions or rust)

@konstin
Copy link
Member

konstin commented Mar 1, 2024

I'm not seeing any timeouts anymore with the two most recent versions (https://github.com/konstin/vws-python-mock/actions). Could you check if this now solved?

@adamtheturtle
Copy link
Contributor Author

I have not seen this issue since posting. Thank you for looking into this.

@konstin
Copy link
Member

konstin commented Mar 1, 2024

I'll close it for now, please feel free to reopen should it reoccur

@konstin konstin closed this as completed Mar 1, 2024
@adamtheturtle
Copy link
Contributor Author

@konstin I do not have permissions to re-open this issue. I can create a new one, but it is probably easier if you re-open this.

This failure has reoccurred:

@konstin konstin reopened this Mar 4, 2024
@konstin konstin removed their assignment Mar 4, 2024
@hmc-cs-mdrissi
Copy link

hmc-cs-mdrissi commented Mar 5, 2024

I'm seeing very similar error message for non pytorch package that's also pretty large. It's ~400 MB wheel and consistently gives me,

(bento_uv2) pa-loaner@C02DVAQNMD6R training-platform % uv pip install --index-url=$REGISTRY_INDEX data-mesh-cli==0.0.66
error: Failed to download: data-mesh-cli==0.0.66
  Caused by: The wheel data_mesh_cli-0.0.66-py3-none-any.whl is not a valid zip file
  Caused by: an upstream reader returned an error: request or response body error: operation timed out
  Caused by: request or response body error: operation timed out
  Caused by: operation timed out

Package is company internal one though, but I think only notable thing is very large size (it vendors spark/java stuff).

edit: Pytorch weirdly installs fine for me pretty fast.

@adamtheturtle adamtheturtle changed the title Repeated timeouts in GitHub Actions fetching wheel for torch Repeated timeouts in GitHub Actions fetching wheel for large packages Mar 13, 2024
@adamtheturtle
Copy link
Contributor Author

I have changed the title of this to not reference torch. It recently happened with nvidia-cudnn-cu12, another large download.

As another example, https://github.com/VWS-Python/vws-python-mock/actions/runs/8262236134 has 7 failures in one run.

@astrojuanlu
Copy link

It can happen on Read the Docs as well, not only GHA https://beta.readthedocs.org/projects/kedro-datasets/builds/23790543/

@astrojuanlu
Copy link

Spotted it locally today inside a local Docker image running under QEMU

error: Failed to download distributions
  Caused by: Failed to fetch wheel: nvidia-cublas-cu12==12.1.3.1
  Caused by: Failed to extract archive
  Caused by: Failed to download distribution due to network timeout. Try increasing UV_HTTP_TIMEOUT (current value: 300s).

eginhard added a commit to idiap/coqui-ai-TTS that referenced this issue Apr 2, 2024
Reverts c59f0ca (#13)

Too many CI test timeouts from installing torch/nvidia packages with uv:
astral-sh/uv#1912
eginhard added a commit to idiap/coqui-ai-TTS that referenced this issue Apr 3, 2024
Reverts c59f0ca (#13)

Too many CI test timeouts from installing torch/nvidia packages with uv:
astral-sh/uv#1912
charliermarsh added a commit that referenced this issue Apr 19, 2024
…3144)

## Summary

This leverages the new `read_timeout` property, which ensures that (like
pip) our timeout is not applied to the _entire_ request, but rather, to
each individual read operation.

Closes: #1921.

See: #1912.
@njzjz
Copy link

njzjz commented Apr 19, 2024

I encountered the problem when I used either uv or pip to download large wheels (for pip, the issue is pypa/pip#4796 and pypa/pip#11153), so I think the root cause is the network. However, I am wondering if uv can be smarter to retry automatically, like something in pypa/pip#11180.

@astrojuanlu
Copy link

Worth trying 0.1.35, which includes #3144

@zanieb
Copy link
Member

zanieb commented Apr 21, 2024

It seems likely that this is resolved by #3144

@OneCyrus
Copy link

I encountered the problem when I used either uv or pip to download large wheels (for pip, the issue is pypa/pip#4796 and pypa/pip#11153), so I think the root cause is the network. However, I am wondering if uv can be smarter to retry automatically, like something in pypa/pip#11180.

that would be a great feature. we have our dev environments behind TLS inspection and some packages often run into a timeout due too slow inspection. we can reproduce this with a browser and the download gets stuck until a timeout. in the browser we can just click resume and the browser reconnects snd downloads the remaining part. with uv we don't have a retry with resume. so it starts from scratch and gets stuck again.

+1 for retry with resume

@charliermarsh
Copy link
Member

Going to close for now, but we can re-open if this comes up again post-changing the timeout semantics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants