-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CMake metadata for CUDA-enabled libtorch #339
Conversation
Hi! This is the friendly automated conda-forge-linting service. I just wanted to let you know that I linted all conda-recipes in your PR ( I do have some suggestions for making it better though... For recipe/meta.yaml:
This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/13102492026. Examine the logs at this URL for more detail. |
ffcd410
to
37380b3
Compare
@danpetry, if you're feeling motivated: Next step is getting rid of |
FWICS that comment dates back to 2021, but they've started requiring CMake 3.18 in 2022 — so perhaps it just meant they couldn't bump minimum CMake version yet back then? |
@h-vetinari the motivation is there but unfortunately my daughter has got sick and is out of daycare for the next couple of days. The next time I'll be able to work on it will be Monday now. I'd be happy to know about the state/next steps then. |
Yeah, a lot changed over the years, and the CMake code there is still pretty crusty. Mainly I want to keep the surgery minimal to not introduce behaviour changes; and making all the places where they set CUDA archs explicitly a no-op seems... excessive? One idea I just had would be vendoring CMake's implementation of |
Wishing a speedy recovery to your daughter @danpetry, take your time! |
Are you aiming to submit these changes upstream? While being conservative for conda-forge patching makes sense, I think it'd fine to go all the way in |
TBH, it's unlikely. I can throw something over the fence for upstream to use as a jumping off point, but the CMake files are pretty crusty (and sprawling), plus I don't know the codebase, so I feel this would require an extraordinary amount of time to get into mergeable shape. |
For context: we've pushed on these types of build system changes in PyTorch in the past - it's close to impossible to land a structural change like moving away from |
Heh, so I guess my success with pytorch/pytorch#145487 was either exceptional (or I'm judging it prematurely). |
Or maybe things improved - but let's see after it gets merged:) |
This reverts commit 53ab2c8.
There's something very strange going on with the CMake cache. For e2c551d, the cache worked fine (logs), meaning that the At first I thought this was caused by removing --- a/recipe/build.sh
+++ b/recipe/build.sh
@@ -219,6 +219,8 @@ elif [[ ${cuda_compiler_version} != "None" ]]; then
export USE_STATIC_CUDNN=0
export MAGMA_HOME="${PREFIX}"
export USE_MAGMA=1
+ # turn off noisy nvcc warnings
+ export CUDAFLAGS="-w --ptxas-options=-w"
else
if [[ "$target_platform" != *-64 ]]; then
# Breakpad seems to not work on aarch64 or ppc64le
diff --git a/recipe/meta.yaml b/recipe/meta.yaml
index e3b8b81..e110190 100644
--- a/recipe/meta.yaml
+++ b/recipe/meta.yaml
@@ -69,10 +69,11 @@ source:
- patches/0016-point-include-paths-to-PREFIX-include.patch
- patches/0017-Add-conda-prefix-to-inductor-include-paths.patch
- patches/0018-make-ATEN_INCLUDE_DIR-relative-to-TORCH_INSTALL_PREF.patch
- - patches/0019-remove-DESTINATION-lib-from-CMake-install-TARGETS-di.patch # [win]
+ - patches/0019-remove-DESTINATION-lib-from-CMake-install-TARGETS-di.patch # [win]
- patches/0020-make-library-name-in-test_mutable_custom_op_fixed_la.patch
- patches/0021-avoid-deprecated-find_package-CUDA-in-caffe2-CMake-m.patch
- - patches_submodules/0001-remove-DESTINATION-lib-from-CMake-install-directives.patch # [win]
+ - patches_submodules/fbgemm/0001-remove-DESTINATION-lib-from-CMake-install-directives.patch # [win]
+ - patches_submodules/tensorpipe/0001-switch-away-from-find_package-CUDA.patch
build:
number: {{ build }}
diff --git a/recipe/patches_submodules/tensorpipe/0001-switch-away-from-find_package-CUDA.patch b/recipe/patches_submodules/tensorpipe/0001-switch-away-from-find_package-CUDA.patch
new file mode 100644
index 0000000..fe411d7
--- /dev/null
+++ b/recipe/patches_submodules/tensorpipe/0001-switch-away-from-find_package-CUDA.patch
@@ -0,0 +1,22 @@
+From 9a1de62dd1b3d816d6fb87c2041f4005ab5c683d Mon Sep 17 00:00:00 2001
+From: "H. Vetinari" <h.vetinari@gmx.com>
+Date: Sun, 2 Feb 2025 08:54:01 +1100
+Subject: [PATCH] switch away from find_package(CUDA)
+
+---
+ tensorpipe/CMakeLists.txt | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+diff --git a/third_party/tensorpipe/tensorpipe/CMakeLists.txt b/third_party/tensorpipe/tensorpipe/CMakeLists.txt
+index efcffc2..1c3b2ca 100644
+--- a/third_party/tensorpipe/tensorpipe/CMakeLists.txt
++++ b/third_party/tensorpipe/tensorpipe/CMakeLists.txt
+@@ -234,7 +234,7 @@ if(TP_USE_CUDA)
+ # TP_INCLUDE_DIRS is list of include path to be used
+ set(TP_CUDA_INCLUDE_DIRS)
+
+- find_package(CUDA REQUIRED)
++ find_package(CUDAToolkit REQUIRED)
+ list(APPEND TP_CUDA_LINK_LIBRARIES ${CUDA_LIBRARIES})
+ list(APPEND TP_CUDA_INCLUDE_DIRS ${CUDA_INCLUDE_DIRS})
+ none of which are plausible as a cache-busting mechanism. Maybe this is somehow similar to #343, but I really don't know what could be causing this. I've compared the host/build environments between the --- a/cache_good.txt
+++ b/cache_good.txt
@@ -3,11 +3,11 @@ The following NEW packages will be INSTALLED:
_libgcc_mutex: 0.1-conda_forge conda-forge
_openmp_mutex: 4.5-2_gnu conda-forge
attr: 2.5.1-h166bdaf_1 conda-forge
- brotli-python: 1.1.0-py312h2ec8cdc_2 conda-forge
+ brotli-python: 1.1.0-py313h46c70d0_2 conda-forge
bzip2: 1.0.8-h4bc722e_7 conda-forge
ca-certificates: 2025.1.31-hbcca054_0 conda-forge
certifi: 2024.12.14-pyhd8ed1ab_0 conda-forge
- cffi: 1.17.1-py312h06ac9bb_0 conda-forge
+ cffi: 1.17.1-py313hfab6e84_0 conda-forge
charset-normalizer: 3.4.1-pyhd8ed1ab_0 conda-forge
cuda-cccl_linux-64: 12.6.77-ha770c72_0 conda-forge
cuda-crt-dev_linux-64: 12.6.85-ha770c72_0 conda-forge
@@ -67,35 +67,35 @@ The following NEW packages will be INSTALLED:
liblzma: 5.6.3-hb9d3cd8_1 conda-forge
libmagma: 2.8.0-h566cb83_2 conda-forge
libmagma_sparse: 2.8.0-h0af6554_0 conda-forge
+ libmpdec: 4.0.0-h4bc722e_0 conda-forge
libnl: 3.11.0-hb9d3cd8_0 conda-forge
- libnsl: 2.0.1-hd590300_0 conda-forge
libnvjitlink: 12.6.85-hbd13f7d_0 conda-forge
libprotobuf: 5.28.3-h6128344_1 conda-forge
libsqlite: 3.48.0-hee588c1_1 conda-forge
libstdcxx: 14.2.0-hc0a3c3a_1 conda-forge
libstdcxx-ng: 14.2.0-h4852527_1 conda-forge
libsystemd0: 257.2-h3dc2cb9_0 conda-forge
+ libtorch: 2.5.1-cuda126_generic_h744fda7_212 local
libudev1: 257.2-h9a4d06a_0 conda-forge
libuuid: 2.38.1-h0b41bf4_0 conda-forge
libuv: 1.50.0-hb9d3cd8_0 conda-forge
- libxcrypt: 4.4.36-hd590300_1 conda-forge
libzlib: 1.3.1-hb9d3cd8_2 conda-forge
lz4-c: 1.10.0-h5888daf_1 conda-forge
magma: 2.8.0-h51420fd_0 conda-forge
nccl: 2.25.1.1-ha44e49d_0 conda-forge
ncurses: 6.5-h2d0b736_3 conda-forge
- numpy: 2.2.2-py312h72c5963_0 conda-forge
+ numpy: 2.2.2-py313h17eae1a_0 conda-forge
nvtx-c: 3.1.0-ha770c72_1 conda-forge
openssl: 3.4.0-h7b32b05_1 conda-forge
- pip: 25.0-pyh8b19718_0 conda-forge
+ pip: 25.0-pyh145f28c_0 conda-forge
pkg-config: 0.29.2-h4bc722e_1009 conda-forge
pybind11: 2.13.6-pyh1ec8472_2 conda-forge
pybind11-global: 2.13.6-pyh415d2e4_2 conda-forge
pycparser: 2.22-pyh29332c3_1 conda-forge
pysocks: 1.7.1-pyha55dd90_7 conda-forge
- python: 3.12.8-h9e4cc4f_1_cpython conda-forge
- python_abi: 3.12-5_cp312 conda-forge
- pyyaml: 6.0.2-py312h178313f_2 conda-forge
+ python: 3.13.1-ha99a958_105_cp313 conda-forge
+ python_abi: 3.13-5_cp313 conda-forge
+ pyyaml: 6.0.2-py313h8060acc_2 conda-forge
rdma-core: 55.0-h5888daf_0 conda-forge
readline: 8.2-h8228510_1 conda-forge
requests: 2.32.3-pyhd8ed1ab_1 conda-forge
@@ -106,10 +106,8 @@ The following NEW packages will be INSTALLED:
typing_extensions: 4.12.2-pyha770c72_1 conda-forge
tzdata: 2025a-h78e105d_0 conda-forge
urllib3: 2.3.0-pyhd8ed1ab_0 conda-forge
- wheel: 0.45.1-pyhd8ed1ab_1 conda-forge
yaml: 0.2.5-h7f98852_2 conda-forge
- zlib: 1.3.1-hb9d3cd8_2 conda-forge
- zstandard: 0.23.0-py312hef9b889_1 conda-forge
+ zstandard: 0.23.0-py313h80202fe_1 conda-forge
zstd: 1.5.6-ha6fb4c9_0 conda-forge
The following NEW packages will be INSTALLED:
@@ -158,7 +156,6 @@ The following NEW packages will be INSTALLED:
libgcc-devel_linux-64: 13.3.0-h84ea5a7_101 conda-forge
libgcc-ng: 14.2.0-h69a702a_1 conda-forge
libgomp: 14.2.0-h77fa898_1 conda-forge
- libiconv: 1.17-hd590300_2 conda-forge
liblzma: 5.6.3-hb9d3cd8_1 conda-forge
libmpdec: 4.0.0-h4bc722e_0 conda-forge
libnghttp2: 1.64.0-h161d5f1_0 conda-forge
@@ -172,20 +169,16 @@ The following NEW packages will be INSTALLED:
libuuid: 2.38.1-h0b41bf4_0 conda-forge
libuv: 1.50.0-hb9d3cd8_0 conda-forge
libzlib: 1.3.1-hb9d3cd8_2 conda-forge
- lz4-c: 1.10.0-h5888daf_1 conda-forge
make: 4.4.1-hb9d3cd8_2 conda-forge
ncurses: 6.5-h2d0b736_3 conda-forge
ninja: 1.12.1-h297d8ca_0 conda-forge
openssl: 3.4.0-h7b32b05_1 conda-forge
- popt: 1.16-h0b475e3_2002 conda-forge
protobuf: 5.28.3-py313h46c70d0_0 conda-forge
python: 3.13.1-ha99a958_105_cp313 conda-forge
python_abi: 3.13-5_cp313 conda-forge
readline: 8.2-h8228510_1 conda-forge
rhash: 1.4.5-hb9d3cd8_0 conda-forge
- rsync: 3.4.1-h168f954_0 conda-forge
sysroot_linux-64: 2.17-h0157908_18 conda-forge
tk: 8.6.13-noxft_h4845f30_101 conda-forge
tzdata: 2025a-h78e105d_0 conda-forge
- xxhash: 0.8.3-hb9d3cd8_0 conda-forge
zstd: 1.5.6-ha6fb4c9_0 conda-forge Here's the comparison for the bad cache --- a/cache_bad.txt
+++ b/cache_bad.txt
@@ -3,11 +3,11 @@ The following NEW packages will be INSTALLED:
_libgcc_mutex: 0.1-conda_forge conda-forge
_openmp_mutex: 4.5-2_gnu conda-forge
attr: 2.5.1-h166bdaf_1 conda-forge
- brotli-python: 1.1.0-py312h2ec8cdc_2 conda-forge
+ brotli-python: 1.1.0-py311hfdbb021_2 conda-forge
bzip2: 1.0.8-h4bc722e_7 conda-forge
ca-certificates: 2025.1.31-hbcca054_0 conda-forge
certifi: 2024.12.14-pyhd8ed1ab_0 conda-forge
- cffi: 1.17.1-py312h06ac9bb_0 conda-forge
+ cffi: 1.17.1-py311hf29c0ef_0 conda-forge
charset-normalizer: 3.4.1-pyhd8ed1ab_0 conda-forge
cuda-cccl_linux-64: 12.6.77-ha770c72_0 conda-forge
cuda-crt-dev_linux-64: 12.6.85-ha770c72_0 conda-forge
@@ -75,6 +75,7 @@ The following NEW packages will be INSTALLED:
libstdcxx: 14.2.0-hc0a3c3a_1 conda-forge
libstdcxx-ng: 14.2.0-h4852527_1 conda-forge
libsystemd0: 257.2-h3dc2cb9_0 conda-forge
+ libtorch: 2.5.1-cuda126_generic_h744fda7_212 local
libudev1: 257.2-h9a4d06a_0 conda-forge
libuuid: 2.38.1-h0b41bf4_0 conda-forge
libuv: 1.50.0-hb9d3cd8_0 conda-forge
@@ -84,7 +85,7 @@ The following NEW packages will be INSTALLED:
magma: 2.8.0-h51420fd_0 conda-forge
nccl: 2.25.1.1-ha44e49d_0 conda-forge
ncurses: 6.5-h2d0b736_3 conda-forge
- numpy: 2.2.2-py312h72c5963_0 conda-forge
+ numpy: 2.0.2-py311h71ddf71_1 conda-forge
nvtx-c: 3.1.0-ha770c72_1 conda-forge
openssl: 3.4.0-h7b32b05_1 conda-forge
pip: 25.0-pyh8b19718_0 conda-forge
@@ -93,9 +94,9 @@ The following NEW packages will be INSTALLED:
pybind11-global: 2.13.6-pyh415d2e4_2 conda-forge
pycparser: 2.22-pyh29332c3_1 conda-forge
pysocks: 1.7.1-pyha55dd90_7 conda-forge
- python: 3.12.8-h9e4cc4f_1_cpython conda-forge
- python_abi: 3.12-5_cp312 conda-forge
- pyyaml: 6.0.2-py312h178313f_2 conda-forge
+ python: 3.11.11-h9e4cc4f_1_cpython conda-forge
+ python_abi: 3.11-5_cp311 conda-forge
+ pyyaml: 6.0.2-py311h2dc5d0c_2 conda-forge
rdma-core: 55.0-h5888daf_0 conda-forge
readline: 8.2-h8228510_1 conda-forge
requests: 2.32.3-pyhd8ed1ab_1 conda-forge
@@ -108,8 +109,7 @@ The following NEW packages will be INSTALLED:
urllib3: 2.3.0-pyhd8ed1ab_0 conda-forge
wheel: 0.45.1-pyhd8ed1ab_1 conda-forge
yaml: 0.2.5-h7f98852_2 conda-forge
- zlib: 1.3.1-hb9d3cd8_2 conda-forge
- zstandard: 0.23.0-py312hef9b889_1 conda-forge
+ zstandard: 0.23.0-py311hbc35293_1 conda-forge
zstd: 1.5.6-ha6fb4c9_0 conda-forge
The following NEW packages will be INSTALLED:
@@ -158,10 +158,9 @@ The following NEW packages will be INSTALLED:
libgcc-devel_linux-64: 13.3.0-h84ea5a7_101 conda-forge
libgcc-ng: 14.2.0-h69a702a_1 conda-forge
libgomp: 14.2.0-h77fa898_1 conda-forge
- libiconv: 1.17-hd590300_2 conda-forge
liblzma: 5.6.3-hb9d3cd8_1 conda-forge
- libmpdec: 4.0.0-h4bc722e_0 conda-forge
libnghttp2: 1.64.0-h161d5f1_0 conda-forge
+ libnsl: 2.0.1-hd590300_0 conda-forge
libprotobuf: 5.28.3-h6128344_1 conda-forge
libsanitizer: 13.3.0-heb74ff8_1 conda-forge
libsqlite: 3.48.0-hee588c1_1 conda-forge
@@ -171,21 +170,18 @@ The following NEW packages will be INSTALLED:
libstdcxx-ng: 14.2.0-h4852527_1 conda-forge
libuuid: 2.38.1-h0b41bf4_0 conda-forge
libuv: 1.50.0-hb9d3cd8_0 conda-forge
+ libxcrypt: 4.4.36-hd590300_1 conda-forge
libzlib: 1.3.1-hb9d3cd8_2 conda-forge
- lz4-c: 1.10.0-h5888daf_1 conda-forge
make: 4.4.1-hb9d3cd8_2 conda-forge
ncurses: 6.5-h2d0b736_3 conda-forge
ninja: 1.12.1-h297d8ca_0 conda-forge
openssl: 3.4.0-h7b32b05_1 conda-forge
- popt: 1.16-h0b475e3_2002 conda-forge
- protobuf: 5.28.3-py313h46c70d0_0 conda-forge
- python: 3.13.1-ha99a958_105_cp313 conda-forge
- python_abi: 3.13-5_cp313 conda-forge
+ protobuf: 5.28.3-py311hfdbb021_0 conda-forge
+ python: 3.11.11-h9e4cc4f_1_cpython conda-forge
+ python_abi: 3.11-5_cp311 conda-forge
readline: 8.2-h8228510_1 conda-forge
rhash: 1.4.5-hb9d3cd8_0 conda-forge
- rsync: 3.4.1-h168f954_0 conda-forge
sysroot_linux-64: 2.17-h0157908_18 conda-forge
tk: 8.6.13-noxft_h4845f30_101 conda-forge
tzdata: 2025a-h78e105d_0 conda-forge
- xxhash: 0.8.3-hb9d3cd8_0 conda-forge
zstd: 1.5.6-ha6fb4c9_0 conda-forge Final comparison, here's the
|
Sigh, this is the second time I see this - at first I thought it was just a flake:
|
Hmm, it may be a flake — in Gentoo I'm also seeing |
I'm going to reproduce the cache problem locally, and see if I can figure something out. |
Ok, I don't think it's "just" cache. FWICS only the CUDA objects get rebuilt and ccache doesn't match. Figuring out how to get verbose CMake output… |
Ok, I'm seeing some weird things. For a start, for some reason the pytorch build doesn't get Anyway, for some reason They definitely get set and are passed to CMake — but they don't appear in "CUDA flags" output in the "libtorch" part and aren't used in the ninja file. But they do appear and are used when reconfiguring for "pytorch". |
set CUDNN_INCLUDE_DIR=%LIBRARY_PREFIX%\include | ||
|
||
@REM turn off very noisy nvcc warnings | ||
set "CUDAFLAGS=-w --ptxas-options=-w" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set "CUDAFLAGS=-w --ptxas-options=-w" | |
set "CMAKE_CUDA_FLAGS=-w --ptxas-options=-w" |
I think it works. The new flags are passed in "libtorch" build, and according to diff
that's the only change in CUDA invocations. I'll know for sure when libtorch recompiles and it starts building pytorch.
Ok, confirmed that |
Thanks so much for this! I didn't really consider this possible because the variable is set in an identical way between libtorch and pytorch, but I guess unusual things are possible. 😅 |
OK, with the caching issue solved, and the test failure hopefully being a fluke, I'm going to call this "good enough" for merging now, to finally get those CMake fixes out the door. I'll open a PR for pytorch to see if we can get the ball rolling for this work upstream. |
I think you need to do that in |
Windows builds haven't changed behaviour (if anything, they seem to be a bit faster, despite adding ~30min of new test run time). I double-checked, and on windows the warnings already don't appear, so I didn't touch But if there are some caching issues to fix on windows, I'd be very glad for any support there. |
But I do see it added there: pytorch-cpu-feedstock/recipe/bld.bat Lines 96 to 97 in be20390
|
Yeah, no question that it's there, but it didn't have the same effect as on linux (neither w.r.t. to caching, nor w.r.t. to being actually necessary). In any case, I'm attempting removal within |
Gah, |
I've sometimes resorted to |
wow, the merges here have been a bit cursed recently. both linux-64+CUDA jobs passed here before, and now we're getting a whole host of new failures: 20 new failures on linux-64 + CUDA + openblas
Looking at some of the errors
same for
At least the caching works again, but it's one step forward, two steps back. 😑 PS. Here's the diff between the passing run and the merge, it's incomprehensible to me how this could cause compilation errors: --- a/recipe/build.sh
+++ b/recipe/build.sh
@@ -220,7 +220,7 @@ elif [[ ${cuda_compiler_version} != "None" ]]; then
export MAGMA_HOME="${PREFIX}"
export USE_MAGMA=1
# turn off noisy nvcc warnings
- export CUDAFLAGS="-w --ptxas-options=-w"
+ export CMAKE_CUDA_FLAGS="-w --ptxas-options=-w"
else
if [[ "$target_platform" != *-64 ]]; then
# Breakpad seems to not work on aarch64 or ppc64le
@@ -253,7 +253,7 @@ case ${PKG_NAME} in
cp build/CMakeCache.txt build/CMakeCache.txt.orig
;;
pytorch)
- $PREFIX/bin/python -m pip install . --no-deps --no-build-isolation -vvv --no-clean \
+ $PREFIX/bin/python -m pip install . --no-deps --no-build-isolation -v --no-clean \
| sed "s,${CXX},\$\{CXX\},g" \
| sed "s,${PREFIX},\$\{PREFIX\},g"
# Keep this in ${PREFIX}/lib so that the library can be found by
diff --git a/recipe/meta.yaml b/recipe/meta.yaml
index e110190..e768515 100644
--- a/recipe/meta.yaml
+++ b/recipe/meta.yaml
@@ -334,6 +334,7 @@ outputs:
- {{ pin_subpackage('libtorch', exact=True) }}
- pybind11
- eigen
+ - zlib
run:
- llvm-openmp # [osx]
- intel-openmp {{ mkl }} # [win] |
On MKL the same INTERNALERROR remained - so not flaky, after all. 🥲 |
In fact, the linux-64 + CUDA + MKL build hasn't passed since dfadf15. The only real changes since then (aside from all the patches 0017 and onwards, which don't change much) was 9fcb3a7. Here are the unix-relevant changes since dfadf15 (minus CMake tests, patches, & test skips) diff --git a/recipe/build.sh b/recipe/build.sh
index 57044b0..22dde8f 100644
--- a/recipe/build.sh
+++ b/recipe/build.sh
@@ -1,9 +1,11 @@
#!/bin/bash
-echo "=== Building ${PKG_NAME} (py: ${PY_VER}) ==="
-
set -ex
+echo "#########################################################################"
+echo "Building ${PKG_NAME} (py: ${PY_VER}) using BLAS implementation $blas_impl"
+echo "#########################################################################"
+
# This is used to detect if it's in the process of building pytorch
export IN_PYTORCH_BUILD=1
@@ -20,9 +22,22 @@ rm -rf pyproject.toml
export USE_CUFILE=0
export USE_NUMA=0
export USE_ITT=0
+
+#################### ADJUST COMPILER AND LINKER FLAGS #####################
+# Pytorch's build system doesn't like us setting the c++ standard through CMAKE_CXX_FLAGS
+# and will issue a warning. We need to use at least C++17 to match the abseil ABI, see
+# https://github.com/conda-forge/abseil-cpp-feedstock/issues/45, which pytorch 2.5 uses already:
+# https://github.com/pytorch/pytorch/blob/v2.5.1/CMakeLists.txt#L36-L48
+export CXXFLAGS="$(echo $CXXFLAGS | sed 's/-std=c++[0-9][0-9]//g')"
+# The below three lines expose symbols that would otherwise be hidden or
+# optimised away. They were here before, so removing them would potentially
+# break users' programs
export CFLAGS="$(echo $CFLAGS | sed 's/-fvisibility-inlines-hidden//g')"
export CXXFLAGS="$(echo $CXXFLAGS | sed 's/-fvisibility-inlines-hidden//g')"
export LDFLAGS="$(echo $LDFLAGS | sed 's/-Wl,--as-needed//g')"
+# The default conda LDFLAGs include -Wl,-dead_strip_dylibs, which removes all the
+# MKL sequential, core, etc. libraries, resulting in a "Symbol not found: _mkl_blas_caxpy"
+# error on osx-64.
export LDFLAGS="$(echo $LDFLAGS | sed 's/-Wl,-dead_strip_dylibs//g')"
export LDFLAGS_LD="$(echo $LDFLAGS_LD | sed 's/-dead_strip_dylibs//g')"
if [[ "$c_compiler" == "clang" ]]; then
@@ -45,6 +60,7 @@ fi
# can be imported on system without a GPU
LDFLAGS="${LDFLAGS//-Wl,-z,now/-Wl,-z,lazy}"
+################ CONFIGURE CMAKE FOR CONDA ENVIRONMENT ###################
export CMAKE_GENERATOR=Ninja
export CMAKE_LIBRARY_PATH=$PREFIX/lib:$PREFIX/include:$CMAKE_LIBRARY_PATH
export CMAKE_PREFIX_PATH=$PREFIX
@@ -73,6 +89,8 @@ export USE_SYSTEM_SLEEF=1
# use our protobuf
export BUILD_CUSTOM_PROTOBUF=OFF
rm -rf $PREFIX/bin/protoc
+export USE_SYSTEM_PYBIND11=1
+export USE_SYSTEM_EIGEN_INSTALL=1
# prevent six from being downloaded
> third_party/NNPACK/cmake/DownloadSix.cmake
@@ -98,18 +116,29 @@ if [[ "${CI}" == "github_actions" ]]; then
# reduce parallelism to avoid getting OOM-killed on
# cirun-openstack-gpu-2xlarge, which has 32GB RAM, 8 CPUs
export MAX_JOBS=4
-else
+elif [[ "${CI}" == "azure" ]]; then
export MAX_JOBS=${CPU_COUNT}
-fi
-
-if [[ "$blas_impl" == "generic" ]]; then
- # Fake openblas
- export BLAS=OpenBLAS
- export OpenBLAS_HOME=${PREFIX}
else
- export BLAS=MKL
+ # Leave a spare core for other tasks, per common practice.
+ # Reducing further can help with out-of-memory errors.
+ export MAX_JOBS=$((CPU_COUNT > 1 ? CPU_COUNT - 1 : 1))
fi
+case "$blas_impl" in
+ "generic")
+ # Fake openblas
+ export BLAS=OpenBLAS
+ export OpenBLAS_HOME=${PREFIX}
+ ;;
+ "mkl")
+ export BLAS=MKL
+ ;;
+ *)
+ echo "[ERROR] Unsupported BLAS implementation '${blas_impl}'" >&2
+ exit 1
+ ;;
+esac
+
if [[ "$PKG_NAME" == "pytorch" ]]; then
# Trick Cmake into thinking python hasn't changed
sed "s/3\.12/$PY_VER/g" build/CMakeCache.txt.orig > build/CMakeCache.txt
@@ -147,11 +176,9 @@ elif [[ ${cuda_compiler_version} != "None" ]]; then
# all of them.
export CUDAToolkit_BIN_DIR=${BUILD_PREFIX}/bin
export CUDAToolkit_ROOT_DIR=${PREFIX}
- if [[ "${target_platform}" != "${build_platform}" ]]; then
- export CUDA_TOOLKIT_ROOT=${PREFIX}
- fi
# for CUPTI
export CUDA_TOOLKIT_ROOT_DIR=${PREFIX}
+ export CUDAToolkit_ROOT=${PREFIX}
case ${target_platform} in
linux-64)
export CUDAToolkit_TARGET_DIR=${PREFIX}/targets/x86_64-linux
@@ -163,12 +190,24 @@ elif [[ ${cuda_compiler_version} != "None" ]]; then
echo "unknown CUDA arch, edit build.sh"
exit 1
esac
+
+ # Compatibility matrix for update: https://en.wikipedia.org/wiki/CUDA#GPUs_supported
+ # Warning from pytorch v1.12.1: In the future we will require one to
+ # explicitly pass TORCH_CUDA_ARCH_LIST to cmake instead of implicitly
+ # setting it as an env variable.
+ # Doing this is nontrivial given that we're using setup.py as an entry point, but should
+ # be addressed to pre-empt upstream changing it, as it probably won't result in a failed
+ # configuration.
+ #
+ # See:
+ # https://pytorch.org/docs/stable/cpp_extension.html (Compute capabilities)
+ # https://github.com/pytorch/pytorch/blob/main/.ci/manywheel/build_cuda.sh
case ${cuda_compiler_version} in
- 12.6)
+ 12.[0-6])
export TORCH_CUDA_ARCH_LIST="5.0;6.0;6.1;7.0;7.5;8.0;8.6;8.9;9.0+PTX"
;;
*)
- echo "unsupported cuda version. edit build.sh"
+ echo "No CUDA architecture list exists for CUDA v${cuda_compiler_version}. See build.sh for information on adding one."
exit 1
esac
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
@@ -180,6 +219,8 @@ elif [[ ${cuda_compiler_version} != "None" ]]; then
export USE_STATIC_CUDNN=0
export MAGMA_HOME="${PREFIX}"
export USE_MAGMA=1
+ # turn off noisy nvcc warnings
+ export CMAKE_CUDA_FLAGS="-w --ptxas-options=-w"
else
if [[ "$target_platform" != *-64 ]]; then
# Breakpad seems to not work on aarch64 or ppc64le
@@ -203,7 +244,8 @@ case ${PKG_NAME} in
mv build/lib.*/torch/bin/* ${PREFIX}/bin/
mv build/lib.*/torch/lib/* ${PREFIX}/lib/
- mv build/lib.*/torch/share/* ${PREFIX}/share/
+ # need to merge these now because we're using system pybind11, meaning the destination directory is not empty
+ rsync -a build/lib.*/torch/share/* ${PREFIX}/share/
mv build/lib.*/torch/include/{ATen,caffe2,tensorpipe,torch,c10} ${PREFIX}/include/
rm ${PREFIX}/lib/libtorch_python.*
@@ -211,7 +253,7 @@ case ${PKG_NAME} in
cp build/CMakeCache.txt build/CMakeCache.txt.orig
;;
pytorch)
- $PREFIX/bin/python -m pip install . --no-deps -vvv --no-clean \
+ $PREFIX/bin/python -m pip install . --no-deps --no-build-isolation -v --no-clean \
| sed "s,${CXX},\$\{CXX\},g" \
| sed "s,${PREFIX},\$\{PREFIX\},g"
# Keep this in ${PREFIX}/lib so that the library can be found by
diff --git a/recipe/meta.yaml b/recipe/meta.yaml
index d5fc48f..e1c2a2d 100644
--- a/recipe/meta.yaml
+++ b/recipe/meta.yaml
@@ -1,7 +1,10 @@
# if you wish to build release candidate number X, append the version string with ".rcX"
{% set version = "2.5.1" %}
-{% set build = 10 %}
+{% set build = 12 %}
+# Use a higher build number for the CUDA variant, to ensure that it's
+# preferred by conda's solver, and it's preferentially
+# installed where the platform supports it.
{% if cuda_compiler_version != "None" %}
{% set build = build + 200 %}
{% endif %}
@@ -64,6 +67,13 @@ source:
- patches/0015-simplify-torch.utils.cpp_extension.include_paths-use.patch
# point to headers that are now living in $PREFIX/include instead of $SP_DIR/torch/include
- patches/0016-point-include-paths-to-PREFIX-include.patch
+ - patches/0017-Add-conda-prefix-to-inductor-include-paths.patch
+ - patches/0018-make-ATEN_INCLUDE_DIR-relative-to-TORCH_INSTALL_PREF.patch
+ - patches/0019-remove-DESTINATION-lib-from-CMake-install-TARGETS-di.patch # [win]
+ - patches/0020-make-library-name-in-test_mutable_custom_op_fixed_la.patch
+ - patches/0021-avoid-deprecated-find_package-CUDA-in-caffe2-CMake-m.patch
+ - patches_submodules/fbgemm/0001-remove-DESTINATION-lib-from-CMake-install-directives.patch # [win]
+ - patches_submodules/tensorpipe/0001-switch-away-from-find_package-CUDA.patch
build:
number: {{ build }}
@@ -117,6 +127,7 @@ requirements:
- protobuf
- make # [linux]
- sccache # [win]
+ - rsync # [unix]
host:
# GPU requirements
- cudnn # [cuda_compiler_version != "None"]
@@ -167,6 +178,9 @@ requirements:
- libuv
- pkg-config # [unix]
- typing_extensions
+ - pybind11
+ - eigen
+ - zlib
run:
# GPU requirements without run_exports
- {{ pin_compatible('cudnn') }} # [cuda_compiler_version != "None"]
@@ -299,6 +330,9 @@ outputs:
- pkg-config # [unix]
- typing_extensions
- {{ pin_subpackage('libtorch', exact=True) }}
+ - pybind11
+ - eigen
+ - zlib
run:
- llvm-openmp # [osx]
- intel-openmp {{ mkl }} # [win]
@@ -314,6 +348,7 @@ outputs:
- filelock
- jinja2
- networkx
+ - pybind11
- nomkl # [blas_impl != "mkl"]
- fsspec
# avoid that people without GPUs needlessly download ~0.5-1GB
@@ -335,6 +370,8 @@ outputs:
requires:
- {{ compiler('c') }}
- {{ compiler('cxx') }}
+ # for torch.compile tests
+ - {{ compiler('cuda') }} # [cuda_compiler_version != "None"]
- ninja
- boto3
- hypothesis |
I had restarted the openblas job, and this time the test suite simply hung indefinitely (note timestamps):
|
can we please revert much of the added testing? Ensuring that scientific software passes on CIs is a job of its own. I think we can have very abbreviated tests that mostly ensure correct linkages. So trying to load as many libraries as possible and ensure none have dangling links to missing SO files is likely enough. |
The only recently-added testing (for 2.5) was pytorch-cpu-feedstock/recipe/meta.yaml Line 464 in be20390
which found some issues with the torch.compile setup, and that is a feature that I'd like to keep working (and thus tested), at least going forward (we can remove some tests for 2.5 just for the sake of publishing something, but I don't feel great about that, though OTOH the failures above occurred outside of test_torchinductor ).
With conda-forge taking over from the pytorch channel, I'd like to test more than the bare minimum. I don't want users running into something like It was running fine in this PR as well until 162a7eb (minus the build cache issue and the pytest-internal crash). I'm starting to think that e1f50ac from #318 might have something to with all that. I will try that next. Finally, I did take your input and removed the smoke test in #326. |
As an update, I tried reverting the unvendoring of pybind and the non-isolation changes in #344, and it still fails with the same pytorch error, i.e.
I double-checked the pytest versions, and there's no difference either between passing:
and failing
I've started a last-ditch resort to checking whether a hard-reset back to the last passing commit still passes in #345 |
Follow-up to #318.
Fixes #333 (when tests are passing 🤞)
Doesn't address #334tensorpipe is not supported on windows apparently