Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some dependencies of pytorch missing from lockfile #18936

Closed
gautiervarjo opened this issue May 8, 2023 · 5 comments
Closed

Some dependencies of pytorch missing from lockfile #18936

gautiervarjo opened this issue May 8, 2023 · 5 comments
Labels
backend: Python Python backend-related issues bug

Comments

@gautiervarjo
Copy link

gautiervarjo commented May 8, 2023

Describe the bug
After adding torch==2.0.0 to my python requirements and re-generating the lockfile, I encounter missing requirements in the PEX environment when running code or tests:

Failed to resolve requirements from PEX environment @ /home/me/.cache/pants/named_caches/pex_root/unzipped_pexes/2896a51e4b74da2fdbe283e0122b36807e7275ff.
Needed cp38-cp38-manylinux_2_31_x86_64 compatible dependencies for: 
 1: nvidia-cuda-nvrtc-cu11==11.7.99; platform_system == "Linux" and platform_machine == "x86_64"                                        
    Required by:                                                    
      torch 2.0.0                                                   
    But this pex had no ProjectName(raw='nvidia-cuda-nvrtc-cu11', normalized='nvidia-cuda-nvrtc-cu11') distributions.                   

<... and many more ...>

Pants version
2.14
In the repro repository below I used 2.17.0.dev4 to check if this was fixed, but no luck.

OS
Linux, Ubuntu 20.04.

Additional info

  • I made a minimal repository for reproducing the issue: https://github.com/gautiervarjo/pants-torch-missing-reqs
  • The generated lockfile indeed does not contain the various nvidia-XXX requirements, so the error above seems to be a surface-level symptom.
  • The METADATA file inside the downloaded wheel in the pants cache does list these requirements, so I'm not sure why they don't make it into the lockfile. Is it because of those platform_system and platform_machine attributes?
    Requires-Dist: nvidia-cuda-nvrtc-cu11 (==11.7.99) ; platform_system == "Linux" and platform_machine == "x86_64"
    Requires-Dist: nvidia-cuda-runtime-cu11 (==11.7.99) ; platform_system == "Linux" and platform_machine == "x86_64"
    Requires-Dist: nvidia-cuda-cupti-cu11 (==11.7.101) ; platform_system == "Linux" and platform_machine == "x86_64"
    Requires-Dist: nvidia-cudnn-cu11 (==8.5.0.96) ; platform_system == "Linux" and platform_machine == "x86_64"
    Requires-Dist: nvidia-cublas-cu11 (==11.10.3.66) ; platform_system == "Linux" and platform_machine == "x86_64"
    Requires-Dist: nvidia-cufft-cu11 (==10.9.0.58) ; platform_system == "Linux" and platform_machine == "x86_64"
    Requires-Dist: nvidia-curand-cu11 (==10.2.10.91) ; platform_system == "Linux" and platform_machine == "x86_64"
    Requires-Dist: nvidia-cusolver-cu11 (==11.4.0.1) ; platform_system == "Linux" and platform_machine == "x86_64"
    Requires-Dist: nvidia-cusparse-cu11 (==11.7.4.91) ; platform_system == "Linux" and platform_machine == "x86_64"
    Requires-Dist: nvidia-nccl-cu11 (==2.14.3) ; platform_system == "Linux" and platform_machine == "x86_64"
    Requires-Dist: nvidia-nvtx-cu11 (==11.7.91) ; platform_system == "Linux" and platform_machine == "x86_64"
    Requires-Dist: triton (==2.0.0) ; platform_system == "Linux" and platform_machine == "x86_64"
    
  • Installing torch==2.0.0 in a virtual environment with pip does pull all of these nvidia-XXX packages, so everything works.
  • Previously I had been installing pytorch from the project's own python index. This worked because those wheels bundle all the CUDA they need. But now I'm trying to use the regular PyPI pytorch and running into this issue.

Apologies if this is a duplicate issue; I've found plenty of mentions of pytorch and of missing requirements, but nothing that seemed to match!

@thejcannon thejcannon added the backend: Python Python backend-related issues label May 8, 2023
@jsirois
Copy link
Contributor

jsirois commented May 9, 2023

@gautiervarjo this is because torch has inconsistent metadata.

I downloaded all 2.0.0 artifacts:

jsirois@Gill-Windows:~/support/pants/issue-18936 $ ls -lrt
total 4395400
-rw-r--r-- 1 jsirois jsirois  55834652 May  8 17:29 torch-2.0.0-cp311-none-macosx_11_0_arm64.whl
-rw-r--r-- 1 jsirois jsirois  63204594 May  8 17:30 torch-2.0.0-cp311-cp311-manylinux2014_aarch64.whl
-rw-r--r-- 1 jsirois jsirois  55832064 May  8 17:30 torch-2.0.0-cp310-none-macosx_11_0_arm64.whl
-rw-r--r-- 1 jsirois jsirois 139533501 May  8 17:30 torch-2.0.0-cp311-none-macosx_10_9_x86_64.whl
-rw-r--r-- 1 jsirois jsirois  63204166 May  8 17:31 torch-2.0.0-cp310-cp310-manylinux2014_aarch64.whl
-rw-r--r-- 1 jsirois jsirois  55833824 May  8 17:31 torch-2.0.0-cp39-none-macosx_11_0_arm64.whl
-rw-r--r-- 1 jsirois jsirois  63203859 May  8 17:31 torch-2.0.0-cp39-cp39-manylinux2014_aarch64.whl
-rw-r--r-- 1 jsirois jsirois  55830206 May  8 17:31 torch-2.0.0-cp38-none-macosx_11_0_arm64.whl
-rw-r--r-- 1 jsirois jsirois  63206408 May  8 17:32 torch-2.0.0-cp38-cp38-manylinux2014_aarch64.whl
-rw-r--r-- 1 jsirois jsirois  74257960 May  8 17:32 torch-2.0.0-1-cp311-cp311-manylinux2014_aarch64.whl
-rw-r--r-- 1 jsirois jsirois  74256435 May  8 17:32 torch-2.0.0-1-cp310-cp310-manylinux2014_aarch64.whl
-rw-r--r-- 1 jsirois jsirois  74259956 May  8 17:32 torch-2.0.0-1-cp39-cp39-manylinux2014_aarch64.whl
-rw-r--r-- 1 jsirois jsirois  74263470 May  8 17:32 torch-2.0.0-1-cp38-cp38-manylinux2014_aarch64.whl
-rw-r--r-- 1 jsirois jsirois 172305868 May  8 17:32 torch-2.0.0-cp311-cp311-win_amd64.whl
-rw-r--r-- 1 jsirois jsirois 139828552 May  8 17:32 torch-2.0.0-cp310-none-macosx_10_9_x86_64.whl
-rw-r--r-- 1 jsirois jsirois 139828281 May  8 17:33 torch-2.0.0-cp39-none-macosx_10_9_x86_64.whl
-rw-r--r-- 1 jsirois jsirois 139531957 May  8 17:33 torch-2.0.0-cp38-none-macosx_10_9_x86_64.whl
-rw-r--r-- 1 jsirois jsirois 172307267 May  8 17:33 torch-2.0.0-cp310-cp310-win_amd64.whl
-rw-r--r-- 1 jsirois jsirois 172333273 May  8 17:33 torch-2.0.0-cp39-cp39-win_amd64.whl
-rw-r--r-- 1 jsirois jsirois 172333281 May  8 17:33 torch-2.0.0-cp38-cp38-win_amd64.whl
-rw-r--r-- 1 jsirois jsirois 619895084 May  8 17:36 torch-2.0.0-cp311-cp311-manylinux1_x86_64.whl
-rw-r--r-- 1 jsirois jsirois 619894634 May  8 17:36 torch-2.0.0-cp310-cp310-manylinux1_x86_64.whl
-rw-r--r-- 1 jsirois jsirois 619883578 May  8 17:36 torch-2.0.0-cp39-cp39-manylinux1_x86_64.whl
-rw-r--r-- 1 jsirois jsirois 619877846 May  8 17:36 torch-2.0.0-cp38-cp38-manylinux1_x86_64.whl

That's 4.2 GB worth:

$ du -sh
4.2G    .

And the deps are:

jsirois@Gill-Windows:~/support/pants/issue-18936 $ for wheel in *.whl; do echo -e "\n\n${wheel} ->" && unzip -qc $wheel torch-2.0.0.dist-info/METADATA | grep Requires-Dist | sort; done


torch-2.0.0-1-cp310-cp310-manylinux2014_aarch64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: typing-extensions


torch-2.0.0-1-cp311-cp311-manylinux2014_aarch64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: typing-extensions


torch-2.0.0-1-cp38-cp38-manylinux2014_aarch64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: typing-extensions


torch-2.0.0-1-cp39-cp39-manylinux2014_aarch64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: typing-extensions


torch-2.0.0-cp310-cp310-manylinux1_x86_64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: nvidia-cublas-cu11 (==11.10.3.66) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-cupti-cu11 (==11.7.101) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-nvrtc-cu11 (==11.7.99) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-runtime-cu11 (==11.7.99) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cudnn-cu11 (==8.5.0.96) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cufft-cu11 (==10.9.0.58) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-curand-cu11 (==10.2.10.91) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusolver-cu11 (==11.4.0.1) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusparse-cu11 (==11.7.4.91) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nccl-cu11 (==2.14.3) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nvtx-cu11 (==11.7.91) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: triton (==2.0.0) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: typing-extensions


torch-2.0.0-cp310-cp310-manylinux2014_aarch64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: typing-extensions


torch-2.0.0-cp310-cp310-win_amd64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: typing-extensions


torch-2.0.0-cp310-none-macosx_10_9_x86_64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: typing-extensions


torch-2.0.0-cp310-none-macosx_11_0_arm64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: typing-extensions


torch-2.0.0-cp311-cp311-manylinux1_x86_64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: nvidia-cublas-cu11 (==11.10.3.66) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-cupti-cu11 (==11.7.101) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-nvrtc-cu11 (==11.7.99) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-runtime-cu11 (==11.7.99) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cudnn-cu11 (==8.5.0.96) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cufft-cu11 (==10.9.0.58) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-curand-cu11 (==10.2.10.91) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusolver-cu11 (==11.4.0.1) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusparse-cu11 (==11.7.4.91) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nccl-cu11 (==2.14.3) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nvtx-cu11 (==11.7.91) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: triton (==2.0.0) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: typing-extensions


torch-2.0.0-cp311-cp311-manylinux2014_aarch64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: typing-extensions


torch-2.0.0-cp311-cp311-win_amd64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: typing-extensions


torch-2.0.0-cp311-none-macosx_10_9_x86_64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: typing-extensions


torch-2.0.0-cp311-none-macosx_11_0_arm64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: typing-extensions


torch-2.0.0-cp38-cp38-manylinux1_x86_64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: nvidia-cublas-cu11 (==11.10.3.66) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-cupti-cu11 (==11.7.101) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-nvrtc-cu11 (==11.7.99) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-runtime-cu11 (==11.7.99) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cudnn-cu11 (==8.5.0.96) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cufft-cu11 (==10.9.0.58) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-curand-cu11 (==10.2.10.91) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusolver-cu11 (==11.4.0.1) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusparse-cu11 (==11.7.4.91) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nccl-cu11 (==2.14.3) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nvtx-cu11 (==11.7.91) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: triton (==2.0.0) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: typing-extensions


torch-2.0.0-cp38-cp38-manylinux2014_aarch64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: typing-extensions


torch-2.0.0-cp38-cp38-win_amd64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: typing-extensions


torch-2.0.0-cp38-none-macosx_10_9_x86_64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: typing-extensions


torch-2.0.0-cp38-none-macosx_11_0_arm64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: typing-extensions


torch-2.0.0-cp39-cp39-manylinux1_x86_64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: nvidia-cublas-cu11 (==11.10.3.66) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-cupti-cu11 (==11.7.101) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-nvrtc-cu11 (==11.7.99) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-runtime-cu11 (==11.7.99) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cudnn-cu11 (==8.5.0.96) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cufft-cu11 (==10.9.0.58) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-curand-cu11 (==10.2.10.91) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusolver-cu11 (==11.4.0.1) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusparse-cu11 (==11.7.4.91) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nccl-cu11 (==2.14.3) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nvtx-cu11 (==11.7.91) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: triton (==2.0.0) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: typing-extensions


torch-2.0.0-cp39-cp39-manylinux2014_aarch64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: typing-extensions


torch-2.0.0-cp39-cp39-win_amd64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: typing-extensions


torch-2.0.0-cp39-none-macosx_10_9_x86_64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: typing-extensions


torch-2.0.0-cp39-none-macosx_11_0_arm64.whl ->
Requires-Dist: filelock
Requires-Dist: jinja2
Requires-Dist: networkx
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'
Requires-Dist: sympy
Requires-Dist: typing-extensions

This presents an untenable situation for resolvers which must assume the requirement METADATA is accurate or be forced to download ~4GB of artifacts (per-version!) to do an accurate resolve. What torch should be doing is listing the same METADATA in each artifact and using environment markers to say when each applies: https://peps.python.org/pep-0508/#environment-markers They very nearly do this! I can only assume they are unaware of the damage they're doing in the real Python ecosystem that exists on the ground. Perhaps raise a voice over there? I don't see any viable path forward to solve this on the Pants / Pex end that is not extremely hacky / specialized to the torch case.

@gautiervarjo
Copy link
Author

@jsirois thanks for taking a look so quickly! Do you know offhand how pip ends up pulling those deps? Is it because the pip resolver picks 1 specific platform (x86 linux) immediately whereas Pex tries to build a multi-platform lockfile?

I'll poke the pytorch people to see about getting this fixed.

As for workarounds in my repo, I suppose the most straightforward way is to add those dependencies to my requirements and tell Pants torch depends on them.

Previously I used torch wheels with statically-linked CUDA from the torch package index, but this no longer works in Pants for torch 2.0: I'm apparently hitting issue #13401 but when running tests (not just when building PEX files), which means the layout="packed" workaround doesn't apply. Seems like this issue will be fixed in Pants 2.17 so that'll be nice!

@thejcannon
Copy link
Member

To chime in, it's quite dirty of torch to silently say if your platform happens to download the Linux x86_64 wheel, congrats you downloaded CUDA.

@jsirois
Copy link
Contributor

jsirois commented May 9, 2023

@gautiervarjo exactly.

@gautiervarjo
Copy link
Author

Alright then, thanks again for your help! Closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: Python Python backend-related issues bug
Projects
None yet
Development

No branches or pull requests

3 participants