-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
smilei v5.0 problems with compiling on GPU on new HPC #674
Comments
Hello, Typically for a A100 you can look at the machine file:
Your make command would look something like:
An example of a working environment we can recommend would be:
Regarding your specific error, it looks like you did not compile with nvc++, likely because no machine file for GPU was specified in the make command. |
Hi, sorry for the late answer, I will try to summarise what we tried. Loaded modules:
Created new machine file containing:
and used the make command: But still didn't have luck with the compilation, the error we are getting is:
Do you have an idea, what could be the problem? wrong compilation flags, missing or incorrect module? |
Hi,
means your hdf5 module was not compiled with your nvhpc module / the nvidia compiler nvc++ Now i can already predict some issues with your nvhpc module that is recent:
it will require using '-gpu=cc70' as -ta=tesla:cc70 is deprecated after nvhpc 23.5, -Mcudalib=curand should also be removed as it is deprecated, we also know that we have an issue with the newest curand library therefore you will need a fix for the header file gpuRandom.h in src/tools/ ... Finally for your specific error, could you print the command make is trying to execute (that you should be seeing thanks to the "verbose" configuration ) to be sure there is nothing else? To sum things up: the quickest way for you to use smilei on gpu would be to:
|
So, concerning the HDF5 - there is no module compiled with nvidia compiler and enable the paralell option at the cluster, this means I have to download and compile it myself, right? About the CUDA - there is no CUDA 11.8 nor 11.2, will 11.3, 11.4 or 11.7 do? Concerning the error, I am not sure, where I can find this. But these are the last lines and the error occurred, in fact, multiple times |
The way to go is normally to ask the administrator to make it available to you. It will benefit other potential users too. |
Regarding hdf5, as beck-llr said, that should be the job of your support team/admins. (you would do something like this comment: https://forums.developer.nvidia.com/t/how-to-build-parallel-hdf5-with-nvhpc/181361/4 ) For the cuda version you mention it should not be a problem. You have to watch out for the nvhpc module though, do you have anything <= 23.1 ? Finally, the errors you mentioned are due to make terminating because of the error you showed part of previously.
returns you? |
Hi, Yes, there is NVHPC/23.1, 22.7 and 22.2. Still not sure, what exactly do you want me to show... this is everything that is written by the terminal:
Edit: only the correct part of the terminal message was kept, so the post is not too long. |
For future reference, this is what i meant:
Because you were using recent module (nvhpc 23.7) you were missing flags such as -gpu=cc70,cc80 -acc etc. To simplifiy your first compilation, please use nvhpc23.1 and cuda 11.3 as you have these, and have support compile hdf5 with the compiler that comes with it. Finally your machine file should look like this (i saw that the Karolina cluster is using AMD CPU + A100) :
If you want to try nvhpc 23.7, it should look like this:
|
Ok, thank you. I will let you know, once the proper HDF5 module is ready and I try the compilation again. |
Hi, |
Something is very wrong in your setup. Can you show the result of |
also a |
make env: and I used the machine file, that was recommended a few comments above |
remove the line:
make clean, and try again. I should have removed this line from your script when i adapted it. |
Hello, Do you think there is a problem with some of the modules I used to compile the code? These are the modules I used:
|
There should be nothing wrong with your modules. We are encountering a completely different class of problems which are runtime issues. from your message (please try to format it if you can, EDIT: thanks for the formatting) it crashes while computing a scalar diag. First, what test case are you trying to run? What diags are in the namelist? EDIT: are you using the latest version of smilei? Post november we added some fixes. |
I tried to run two of the basic tutorials - thermal plasma (this is the mistake in the comment) and laser propagation in vacuum (this one failed at Fields diagnostics). |
In the smilei.out.txt you just provided the reason for the failure is clear: you do not have the package numpy in the python module that is loaded. Make sure you have the packages required as in the doc : sphinx, h5py, numpy, matplotlib, pint For the other tutorial that failed (thermal plasma i think), please provide the exact input and output file. You may want to do that after you installed the python packages and run it again. |
yeah sorry, I loaded the wrong module. Sending the current error file |
Does it still occur with non frozen species ? It could be that the |
I'd like to look at your input file as well to check. |
input file: input.txt |
You are running a test case in 1D when it is not currently supported on GPU :) (it might be soon-ish) (check the list of currently supported features here ) |
ok, that was a pretty silly mistake... I tried another case (input file: [input.txt])(https://github.com/SmileiPIC/Smilei/files/13980089/input.txt) |
so we are back to the cuda device error. In your slurm script i don't see you loading the environment you used at compile time. typically mine looks like this:
while bind_gpu.sh (might not be required here though):
Try again with sourcing the compilation environment in your slurm script, you might just be missing that. |
Hi, sorry I have never used an environment for the compilation before, I just loaded the modules and I did the same thing in the submission script. Therefore, I don't really know how an environment should look like, I did some googling but it did not help me much... Could you please provide me with an example or some guideline? |
In your slurm script i can only see:
ergo, unless by default the running environment includes nvhpc, cuda & openmpi, i don't see how your executable can access its dependencies. Can you add "module list" in your slurm script and run it so we can see what is available at runtime? Also, in your latest output i see one mpi process and 8 patches. Are you trying to run on 1 or 8 GPUs? |
Here is the output file with module list (the HDF5 module loads a lot of other modules as its dependencies) out.txt I am trying to run on 8 GPUs as I am able to only allocate a full node, which has 8 GPUs. I also had 8 mpi processes in the slurm script, and the error I got was the same, while every process printed the same error message in the output file, so for testing purposes I set only 1 mpi process so the output file wouldn't be so long. |
You should load NVHPC when you compile Smilei |
That seems to be the case. Although the fact that there is another cuda module loaded is not great. @spadova-a Can you do make clean and recompile + execute with the new binary just to be sure. |
Hi, sorry for the inactivity, right now I have a lot of work to do. I will give the installation a new try soon. |
@spadova-a The makefile has been modified to make GPU compilation easier. |
Dear colleagues, I tried the compilation on an accelerated node with 8 A100 GPUs.
There, I tried to load the proper modules, and there is a good candidate indeed.
Let's try the last one then.
These are all the loaded modules:
We have OpenMPI,
and proper python
I created a primitive machine file "karolina" including just two lines
Then I tried
It seems to me that despite the name of the module I might try to compile HDF5 myself according to your instructions eventually as well. |
Hi, Preface: no test has been done with the latest nvhpc version (ie 24.0 and above) but it "should" work. Here is an example of how I do it on my machine with nvhpc 23.11 that you can use as a reference:
It seems the person who installed your HDF5 module did not include the "-DHDF5_ENABLE_PARALLEL=ON" option in Once your hdf5 install is finished you should
at compile time and runtime.
|
Hi Charles,
This installation was successful. Then I prepared this machine file
Then I attempted to compile Smilei
Typical errors are:
and
I think I need to specify flags better. However, I do not know how. |
Try to add |
Done. Different errors popped up: They are of this kind
|
Assuming you did a "make clean" before compiling again, i am thinking you do not have the -arch option specified in your machine file for GPU_COMPILER_FLAGS.
|
Thank you both. We can move forward as the compilation was successful. It failed by running though.
I took a slightly modified example file for 2D LWFA with GPU computing on.
It runs for half a minute, writes some outputs, and then fails. I watched error_smilei.log I think it is only a question of a proper submission now. Accelerated nodes at Karolina have 128 cores and 8 x NVIDIA A100, i.e. 16 cores per GPU. For some reason, 16 processes runs with a running command shown above. |
You probably want -arch=sm_80 instead of -arch=sm_86, as suggested from the error |
I edited the previous reply. |
It could be a memory issue. Try with less particles ? |
I tried now even with one particle per cell. The same error still. |
As far as i can see the input file contains not yet supported features such as the filter and the load balancing for instance |
Oh, I did not think about it! Could you please recommend some save input for a test, please? |
here is a namelist that i used to benchmark an A100 (note that this is in 3D with no moving window, also we use one patch as it is best for GPUs, for multiple GPUs you have to increase the number of patches proportionaly):
|
Great, this one runs till the end!
|
You must define a binding between processes and gpus, typically using a binding file, or using the proper options for your queue manager (such as slurm) |
Well, thank you. I am not sure if I am capable of figuring it out myself. I guess I should try to discuss it with cluster user support. |
The fact that it ran on one GPU was what we asked for in the input file since there was only one patch. As for the binding script it may not be necessary in your case, simply change
to
(increasing the size of the problem and the number of patch to have an equivalent charge on each GPU) and in your slurm command you would specify something like:
See if that crashes / works |
I could not do I will try to do some more testing tomorrow! Thanks.
|
Glad we could help :) |
Side note: we really need to explain, in the documentation for gpu, that there should be 1 process per gpu, and that it is better to have 1 patch per gpu or so |
Ok I think we really need one page dedicated to gpu with links to other places if necessary |
Agreed |
Hi, could you include the machine file in the code? Here are my suggestions, the comments include the installation description. |
We might add it in /scripts/compile_tools/machine/ with the other machine scripts , likely under "karolina_gpu" |
Hello there,
I would like to use Smilei with GPU on Karolina cluster at IT4I in Ostrava and I am not sure how to compile it. So, I asked the administrator to help me with it, but he encoutered the following problem - GPU compilation for A100 fails with:
I will share this issue with the administrator, since I don't know any details of his procedure. Could you please help us find the problem? Note, that there were no problems with the CPU compilation.
The text was updated successfully, but these errors were encountered: