Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch-bin: 1.10.0 -> 1.11.0 #164712

Merged
merged 4 commits into from
Mar 24, 2022
Merged

Conversation

rehno-lindeque
Copy link
Contributor

@rehno-lindeque rehno-lindeque commented Mar 18, 2022

Description of changes

This is still a work in progress (needs testing, etc)

1.11 Release of PyTorch:

Nix changes:

  • Bumped pytorch as well as pytorch-bin to 1.11
  • Added darwin dowloads to the prefetch script
  • Added cp310 to the prefetch script for linux
  • Enable pytorch-bin for (linux, python 3.10)

TODO

-- Found CUDA: /nix/store/rclmj9izhxwhgwrgjk97mka3ndm4bzyn-cudatoolkit-10.1.243 (found version "10.1") 
-- Caffe2: CUDA detected: 10.1
-- Caffe2: CUDA nvcc is: /nix/store/rclmj9izhxwhgwrgjk97mka3ndm4bzyn-cudatoolkit-10.1.243/bin/nvcc
-- Caffe2: CUDA toolkit directory: /nix/store/rclmj9izhxwhgwrgjk97mka3ndm4bzyn-cudatoolkit-10.1.243
CMake Error at cmake/public/cuda.cmake:42 (message):
  PyTorch requires CUDA 10.2 or above.
Call Stack (most recent call first):
  cmake/Dependencies.cmake:1191 (include)
  CMakeLists.txt:653 (include)
  • Needs building, testing, etc (in progress)
Things done
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 22.05 Release Notes (or backporting 21.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
    • (Release notes changes) Ran nixos/doc/manual/md-to-db.sh to update generated release notes
  • Fits CONTRIBUTING.md.

@ofborg ofborg bot added the 8.has: package (new) This PR adds a new package label Mar 18, 2022
Copy link
Member

@junjihashimoto junjihashimoto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for creating a PR!
torchvision-bin and torchaudio-bin also depend on this.
We should include them in another commit or require another PR.

@junjihashimoto
Copy link
Member

Did prefetch.sh work fine?
When I tried it with libtorch, it stopped for some reason.
The problem is not well understood, but it is reproducible.

@junjihashimoto
Copy link
Member

Technically, the binary does not need nvcc with cudatoolkit.

@rehno-lindeque
Copy link
Contributor Author

Did prefetch.sh work fine? When I tried it with libtorch, it stopped for some reason. The problem is not well understood, but it is reproducible.

I'll double check it (I ran it in two steps before, so possible I made a mistake somewhere in updating the script)

@rehno-lindeque rehno-lindeque changed the title pytorch: 1.10.2 -> 1.11.0 pytorch: 1.10.2 -> 1.11.0 [WIP] Mar 19, 2022
@junjihashimoto
Copy link
Member

I think the title and commit should be 'python3Packages.pytorch-bin: 1.10.2 -> 1.11.0'.
pytorch is another derivation.

@junjihashimoto
Copy link
Member

I was a little misunderstood. This includes both full build and binary derivations.

@samuela
Copy link
Member

samuela commented Mar 21, 2022

CUDA appears to need updating

cc @NixOS/cuda-maintainers

This failure is due to the CUDA version being picked up from the cudnn package instead of cudatoolkit. It's a bit messy but the latest cuDNN version supported by pytorch v1.10.2 was cudnn v7.6.5 AFAICT. And cudnn 7.6.5 supported at most CUDA 10.1 officially (it seems to unofficially work with CUDA 10.2 as well). In nix we try to enforce the official version constraints as much as possible.

Long story short, try the following in python-packages.nix:

  pytorch = callPackage ../development/python-modules/pytorch {
    cudaSupport = pkgs.config.cudaSupport or false;
    cudatoolkit = pkgs.cudatoolkit_11;
    cudnn = pkgs.cudnn_8_3_cudatoolkit_11;
  };

if that doesn't work, we'll need to figure out the latest version of cuDNN that pytorch supports compiling against. cuDNN v8.3.5 is currently the latest upstream and the latest packaged in nixpkgs. But there are other options as well. Don't hesitate to reach out with any questions! I guarantee that the CUDA/cuDNN versions you need are packaged... just a matter of figuring out what works.

@mweinelt
Copy link
Member

mweinelt commented Mar 22, 2022

Source build is already on staging. 5446ad8

@rehno-lindeque
Copy link
Contributor Author

rehno-lindeque commented Mar 22, 2022

Ah I missed comments here - glad to see others are beating me to it

(I'll turn this into a pytorch-bin only bump unless I'm too slow again :))

@junjihashimoto
Copy link
Member

@mweinelt
Awesome!
Could you tell me where is the job of pytorch-1.11?
It looks like it's not in hydra's job.
https://hydra.nixos.org/search?query=pytorch

@samuela
Copy link
Member

samuela commented Mar 22, 2022

@junjihashimoto nixpkgs-unstable has been failing for the last 7 days, so we likely won't be able to see any builds until that is cleared up. Check out https://status.nixos.org for more info.

@rehno-lindeque rehno-lindeque force-pushed the pytorch-1.11.0 branch 2 times, most recently from deba706 to b8c24e1 Compare March 22, 2022 13:26
@rehno-lindeque rehno-lindeque changed the title pytorch: 1.10.2 -> 1.11.0 [WIP] pytorch-bin: 1.10.2 -> 1.11.0 [WIP] Mar 22, 2022
@rehno-lindeque rehno-lindeque changed the title pytorch-bin: 1.10.2 -> 1.11.0 [WIP] pytorch-bin: 1.10.0 -> 1.11.0 [WIP] Mar 22, 2022
@rehno-lindeque
Copy link
Contributor Author

rehno-lindeque commented Mar 23, 2022

Is pytorch and/or pytorch-bin important enough to have a pytorch_1_10 and/or pytorch-bin_1_10? I'm happy to keep at it, just curious if others would prefer more granular PRs for rolling pytorch ecosystem forward.

@samuela
Copy link
Member

samuela commented Mar 23, 2022

Is pytorch and/or pytorch-bin important enough to have a pytorch_1_10 and/or pytorch-bin_1_10? I'm happy to keep at it, just curious if others would prefer more granular PRs for rolling pytorch ecosystem forward.

My 2c is that I think it's best to avoid the complexity associated with multiple versions if possible, but I'm not a pytorch maintainer so I'll leave that to @junjihashimoto @teh @thoughtpolice and @tscholak.

Copy link
Member

@junjihashimoto junjihashimoto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping multiple versions may just put off the problem.
We may need more than one version for another reason.
I'm not sure if we should manage the set of packages or follow conda or anything else.

@rehno-lindeque
Copy link
Contributor Author

rehno-lindeque commented Mar 23, 2022

There is no source build for torchaudio.

I had a draft PR for torchaudio source that I never quite got to finishing up properly. #160210 with some test dependencies #160206 and #160197

It had a lot of failing tests that took crazy times to run, so it was a bit difficult to verify. (I don't use torchaudio myself either, and really don't want to be a maintainer).

Anyway, if anyone is so inclined, feel free to reappropriate it

@junjihashimoto
Copy link
Member

I also tried it, but I couldn't complete many dependencies other than python.

@ofborg ofborg bot requested a review from junjihashimoto March 23, 2022 15:26
@rehno-lindeque
Copy link
Contributor Author

Squashed superfluous commits & rebased

@samuela
Copy link
Member

samuela commented Mar 23, 2022

Result of nixpkgs-review pr 164712 run on x86_64-linux 1

1 package failed to build:
  • python310Packages.coqui-trainer
8 packages built:
  • python310Packages.pytorch-bin
  • python310Packages.torchaudio-bin
  • python310Packages.torchvision-bin
  • python39Packages.coqui-trainer
  • python39Packages.pytorch-bin
  • python39Packages.torchaudio-bin
  • python39Packages.torchvision-bin
  • tts

@samuela
Copy link
Member

samuela commented Mar 23, 2022

coqui-trainer isn't building on python310 but I don't view that as a blocker for merging this. Lots of packages still failing on python310.

error: builder for '/nix/store/1x2jysbhgkd22dpaaa5rwq0z7zjwpfxr-python3.10-coqui-trainer-0.0.5.drv' failed with exit code 1;
       last 10 log lines:
       >     File "/nix/store/5gdg9a31yh18gsky303g6id7w2da9pvk-python3.10-setuptools-57.2.0/lib/python3.10/site-packages/setuptools/build_meta.py", line 258, in run_setup
       >       super(_BuildMetaLegacyBackend,
       >     File "/nix/store/5gdg9a31yh18gsky303g6id7w2da9pvk-python3.10-setuptools-57.2.0/lib/python3.10/site-packages/setuptools/build_meta.py", line 150, in run_setup
       >       exec(compile(code, __file__, 'exec'), locals())
       >     File "setup.py", line 36, in <module>
       >       raise RuntimeError(
       >   RuntimeError: Coqui-Trainer requires python >= 3.6 and <=3.10 but your Python version is 3.10.2 (main, Jan 13 2022, 19:06:22) [GCC 10.3.0]
       >   Preparing metadata (pyproject.toml) ... error
       > WARNING: Discarding file:///build/source. Command errored out with exit status 1: /nix/store/7mv9crg6y9bxgn39ynhkkwi3lhhsqhaj-python3-3.10.2/bin/python3.10 /nix/store/gximmp563q6s55srwds37hvx5mhacsxy-python3.10-pip-21.3.1/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py prepare_metadata_for_build_wheel /build/tmp3s5v5if6 Check the logs for full command output.
       > ERROR: Command errored out with exit status 1: /nix/store/7mv9crg6y9bxgn39ynhkkwi3lhhsqhaj-python3-3.10.2/bin/python3.10 /nix/store/gximmp563q6s55srwds37hvx5mhacsxy-python3.10-pip-21.3.1/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py prepare_metadata_for_build_wheel /build/tmp3s5v5if6 Check the logs for full command output.

Copy link
Member

@samuela samuela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use fetchpatch and then I think this PR is good to go!

@mweinelt
Copy link
Member

Let's add a disable for python3.10 on coqui-trainer with a reference to coqui-ai/Trainer#22.

  disabled = pythonAtLeast "3.10"; # https://github.com/coqui-ai/Trainer/issues/22

I agree with the remarks made by @samuela, patches need to go, fetchpatch is the way.

@samuela
Copy link
Member

samuela commented Mar 23, 2022

Result of nixpkgs-review pr 164712 run on x86_64-linux 1

8 packages built:
  • python310Packages.pytorch-bin
  • python310Packages.torchaudio-bin
  • python310Packages.torchvision-bin
  • python39Packages.coqui-trainer
  • python39Packages.pytorch-bin
  • python39Packages.torchaudio-bin
  • python39Packages.torchvision-bin
  • tts

Copy link
Member

@samuela samuela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok LGTM! I'll go ahead and merge tomorrow unless anyone objects

@rehno-lindeque
Copy link
Contributor Author

Thanks for all the help!

@samuela samuela merged commit 077f078 into NixOS:master Mar 24, 2022
@samuela
Copy link
Member

samuela commented Mar 24, 2022

Great work, and thank you for your persistence with this @rehno-lindeque! It's not easy getting PRs through for some of these larger packages but it's absolutely crucial work!

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nixpkgss-current-development-workflow-is-not-sustainable/18741/53

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants