Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

errors with cuda ops installation #2

Closed
dvschultz opened this issue Feb 1, 2021 · 22 comments
Closed

errors with cuda ops installation #2

dvschultz opened this issue Feb 1, 2021 · 22 comments

Comments

@dvschultz
Copy link

tested on a fresh Colab V100 and P100

Screen Shot 2021-02-01 at 12 07 00 PM

@dvschultz
Copy link
Author

fixed by installing ninja. Might recommend adding that to the readme as a requirement

@woctezuma
Copy link

I have encountered the same issue on Colab, and your fix works!

%pip install ninja

@nurpax
Copy link
Contributor

nurpax commented Feb 2, 2021

@dvschultz Thanks for the report! README.md will be updated.

@futscdav
Copy link

futscdav commented Feb 4, 2021

Also note that nvcc doesn't work with new gcc, so if you have system default gcc > 8, pytorch will honor the CC env variable, do
export CC=g++-8
before you run any scripts that would build the cuda kernels.

@tasinislam21
Copy link

I did install ninja but then I got -> OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized. This is happened during evaluating the metrics.

@TheodoreGalanos
Copy link

Still getting this error on a vast.ai VM with pytorch 1.7.1 and cuda 11 installed. I installed all required packages but whatever I try I keep getting the errors when trying to compile the custom cuda ops.

Is there perhaps a guide for Linux and ada-pytorch? It seems to be that it should work out of the box, but unfortunately it does not.

p.s. I have already made it work in windows by installing vc2019 and cuda 11. Would love to make it run on the VM so that I can train larger models.

@wuyuyu1024
Copy link

wuyuyu1024 commented Feb 9, 2021

Having same error. I'm on win10 with RTX-3070 GPU and torch 1.7.1+cu110. I also installed required packages and deleted torch_extensions.
Here is my log:
log.txt
Anyone could help me? 🙏

@gokhanbaydar
Copy link

If still not working try installing Windows 10 SDK, I had the same problem, installed Windows 10 SDK and now its working fine.

@Dhruva-Storz
Copy link

still have this issue on linux with CUDA 11.0 after installing ninja, is there a specific version of ninja we need to install? I have the same error on both conda and pip (after pasting the line in the README)

@nurpax
Copy link
Contributor

nurpax commented Feb 15, 2021

The original poster in this bug filed this for Colab. Not sure what @Dhruva-Storz is running on.

Try changing this line to get more details about what could be going wrong:

verbosity = 'brief' # Verbosity level: 'none', 'brief', 'full'

to

verbosity = 'full'

and check if you get anything relevant in the log.

Remember to completely remove your torch extensions dir (search for TORCH_EXTENSIONS_DIR on https://pytorch.org/docs/stable/cpp_extension.html for details) when re-running the code.

Usually this is a matter of CUDA SDK (the one you have to install yourself, not the pytorch bundled cuda toolkit) not being installed properly, or there being multiple versions of it and some old or otherwise incompatible version gets used when building our custom extensions.

@Dhruva-Storz
Copy link

My apologies for not giving enough info.

Im running on :

ubuntu 20.04.1,
CUDA 11.1,
RTX 3090

My pytorch installation is 1.7.1 with cuda toolkit 11.0
All packages installed on python virtual environment, same errors when using conda virtual environment

To reproduce error

python generate.py --outdir=out --trunc=1 --seeds=85,265,297,849 --network=models/network-snapshot-000144.pkl

Important bits of error after deleting pytorch extensions dir, setting verbosity to full

FAILED: bias_act.cuda.o 
...
nvcc fatal   : Unsupported gpu architecture 'compute_86'
...
FAILED: upfirdn2d.cuda.o 
...
nvcc fatal   : Unsupported gpu architecture 'compute_86'
...
Error building extension 'upfirdn2d_plugin'
  warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + str(sys.exc_info()[1]))
Setting up PyTorch plugin "upfirdn2d_plugin"...
Using /media/SharedUsers/DhruvG/home/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module upfirdn2d_plugin, skipping build step...
Loading extension module upfirdn2d_plugin...
/media/SharedUsers/DhruvG/home/Documents/stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py:34: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:

You mentioned in the README that we need to use cuda-toolkit 11.1, but the website has no installation instructions for 11.1. This may be the source of the problem. do you have any suggestions on what might be causing this?

@nurpax
Copy link
Contributor

nurpax commented Feb 15, 2021

I now see that the README is quite confusing about this.

In order to run on RTX 3090, you need to install:

  • pytorch 1.7 built for cuda 11.0 (or later, but at the time of writing cuda 11.0 build is the latest).
  • CUDA 11.1 toolkit (from NVIDIA's website). If you can't find CUDA 11.1, CUDA 11.2 probably works too.

The latter is required to build our custom pytorch extensions. Nvcc from CUDA 11.0 will fail with the error you saw above if you're running it on RTX 3090. Nvcc from CUDA 11.1 should work.

@ghost
Copy link

ghost commented Feb 17, 2021

No solution yet?

@ghost
Copy link

ghost commented Feb 17, 2021

TORCH_EXTENSIONS_DIR

this solution does not work either.

@ghost
Copy link

ghost commented Feb 17, 2021

If still not working try installing Windows 10 SDK, I had the same problem, installed Windows 10 SDK and now its working fine.

your solution did not fix my problem either.

@Dhruva-Storz
Copy link

Dhruva-Storz commented Feb 17, 2021

TORCH_EXTENSIONS_DIR

Can you please elaborate on Remember to completely remove your torch extensions dir (search for TORCH_EXTENSIONS_DIR on https://pytorch.org/docs/stable/cpp_extension.html for details) when re-running the code..... I am not sure I understand what you want us to do

I havent found a way to safely install cuda 11.1 on my work computer because it might interfere with the work of others, so I havent been able to test nurpax's solution. However, it seems like this should fix the problem as the build errors seem to be related to nvcc. If not, the code still runs, you just have to disable warnings with

python -W ignore foo.py

When they say remove torch_extensions_dir, I believe they mean that you delete the folder where the custom torch extensions were installed. Mine was in ~/.cache/torch_extensions

Im probably going to wait for official cuda 11.1 support from pytorch so I can safely install it in an environment. However, if anyone has solutions on how to install two different cuda toolkits safely, do let me know.

@nurpax
Copy link
Contributor

nurpax commented Feb 17, 2021

I’m not sure if installing CUDA toolkit from Conda is enough (ie. as part of pytorch installation). I think you really do need a separate full CUDA installation with nvcc, headers, the whole nine yards. Not from Conda or Pip but using NVIDIA’s packages/installers. I recall trying without it, using just what’s bundled with pytorch installation and I don’t think it contained everything that’s required to build our CUDA kernels.

I’d be happy to be shown wrong on this as it’d simplify the installation instructions.

At least on Windows, you can have multiple CUDA versions installed simultaneously. Safer is of course to match what CUDA you have in the PATH with what your pytorch was built with.

If you do end up installing different CUDA SDKs, don’t let the installers touch your GPU drivers. Those are best kept at yiur most recent version.

@ghost
Copy link

ghost commented Feb 17, 2021

@nurpax I am using a separate full CUDA installation with nvcc, cudNN etc..

nurpax pushed a commit that referenced this issue Feb 19, 2021
Print full traceback when custom extension build fails.

Also allow pytorch 1.9 so that this runs against pytorch upstream
devel builds.

issues #2, #28, #35, #37, #39
@dokluch
Copy link

dokluch commented Mar 18, 2021

Stuck here big time with ImportError: No module named 'upfirdn2d_plugin'

I am using a vast.ai instance nvidia/cuda:11.2.1-cudnn8-runtime-ubuntu18.04

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   30C    P0    35W / 250W |      0MiB / 16160MiB |      0%      Default |

Conda environment is set with
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch --yes
(doesn't matter if I try a newer one)

What I've tried

FIrst I made sure my VM has CUDA 11.2 installed. Then I've installed a newer torch with CUDA 11.1.1, which did not help and I've rolled back.

Removed torch_extensions
Just as described here:
#11

Didn't help

gcc
I found this thread and
#35

And tried installing gcc7
conda install -c conda-forge/label/gcc7 gcc_linux-64 (didn't help)

and even gcc5
conda install -c psi4 gcc-5
The latter sent me in a weird loop and I've abandoned this path.

This does not help either
#2 (comment)

Google Colab works fine and has ubuntu 18.04 with gcc 7.5.0 installed which I am trying to mimic. Hope that is the correct logic.

UPD:
Another instance with gcc 7.5.0 throws the same error as well

gcc --version
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.

UPD2
Installing gcc 5 as described here: https://askubuntu.com/questions/1087150/install-gcc-5-on-ubuntu-18-04
Did not help either

Please advice on any possible next steps. No idea where to move next.

@zjgt
Copy link

zjgt commented Jun 1, 2021

Thanks for all the discussions above. It has been very helpful. I probably had all of the above problems, the visual studio definitely helped taking care of the C++ related compiling issues, and the installation of the whole cuda11.3 package (2.7G) from nvidia website took care of the upfirdn2d bug. Now my program is running in pycharm with pytorch 1.7.1, cuda 11.3, python 3.7.

snakch pushed a commit to snakch/stylegan2-ada-pytorch that referenced this issue Jun 20, 2021
Add index and seed feature to image and video generation
@halfjoe
Copy link

halfjoe commented Jul 6, 2021

Thanks for all the discussions above.

I have successfully set up the environment with 3090, and would like to share my settings.
Ubuntu 18.04.4, gcc 7.5.0, CUDA 11.1, CUDNN 8.0.5, python 3.7, pytorch 1.7.1

Here CUDA and CUDNN are installed manually, and pytorch is built from source (https://github.com/pytorch/pytorch/tree/v1.7.1). After installing pytorch, print(torch.__version__) returns 1.7.0a0+57bffc3, which is OK.

@tasinislam21
Copy link

I have encountered the same issue on Colab, and your fix works!

%pip install ninja

works on colab and windows but not on ubuntu 20.04

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests