-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoint module makes the code crash due to "Invalid tag" in intelMPI2019 and above #307
Comments
This problem is related to intel-mpi-2019. See issue #270 |
Dear @mccoys
I gave I am asking the sysadmin for help now and will keep you updated. Many thanks! |
You can avoid the MPI_THREAD_MULTIPLE issue by compiling smilei with the Note that this may cause some slowdown. |
Dear @mccoys , Thanks for the advice. Besides, strangely, after I re-compile the code with I will keep you updated with the intelmpi-2020 attempt. PS, as for the MPI-tag issue, I also find it strange because for my simulation I only use [32, 32] patches. |
There are many more tags than patches. I think we have fixed issues we used to have on tag numbering. But we never know. There might be unforeseen situations. |
I might need to add that, I used to run the above namelist well, with the collision module including single-specie.
Then I met with the MPI-tag issue. I am still in the queue now and I will keep you updated with the intelmpi/2020u2 results. |
Dear @mccoys The simulation again crashed, but with a new, yet stranger err.
In the .out file, the simulation didn't even start, only creating the directory for the run. Any idea what is happening? |
Since I have a workable model without ionization or collision, I repeat the procedure again step-by-step, with the former intelmpi/2019u3. Now, I have the ionization and collision, and the code seems fine with a reduced-scale on Niagara. So, I guess the problem is more about the checkpoint module, not the collision? |
Concerning the segfault, is there any more information on the error ? No other lines ? I would guess this is a compilation issue. Have you done Concerning checkpoints, we will try to reproduce the issue first. |
Dear @mccoys , Yes, I
And there's no But as you can see from the above information, I am not using the new version of intelmpi/2020u2, because it won't compile, giving the error: |
you should check |
Dear @iltommi , Yes, Besides, after I compile the code with Thanks! |
No. The problem is not the checkpoints. It is true that checkpoints require more MPI tags, but there is no other way. The problem is that your version of intelmpi is problematic with smilei. Now, the problem is that the new intelmpi2020 has crashed. Could you show your
The command |
Dear @mccoys , On the machine Niagara, when I compile Smilei with the 2020u2 module:
Since I can't compile it now with 2020u2, I can't show you the And I can always compile it with 2019u3:
|
Why can't you show the result of make env ? |
It complains about your empty locale environment try |
Sorry, I thought the results of |
Sorry for the late updates:
|
Hi Yao, int flag;
int* tag_ub_ptr;
MPI_Comm_get_attr(MPI_COMM_WORLD, MPI_TAG_UB, &tag_ub_ptr, &flag);
cout << "Max tag of current MPI library : " << (*tag_ub_ptr) << endl; You can also add this after the line 58 of the main program. Julien |
Hello @jderouillat Thanks for your reply, I will do it now. Thanks |
Dear @iltommi The problem of the locale environment is solved by changing |
Hi @jderouillat I did what you suggested, and the output is
And here's more info.:
It seems that the intelmpi/2020u2 still has less maximum tag than the checkpoint needs. |
Some updates from the sysadmin of Scinet:
Hope that this helps. And thanks to the discussion with @jderouillat ,
I am trying to see if replacing |
We are currently trying to make checkpoints need a lower tag. It is not necessary to have such a high number, and may help your situation. |
Many thanks to your sysadmin for the 2nd link. Did you test what is recommend ? $ export MPIR_CVAR_CH4_OFI_TAG_BITS=25
$ export MPIR_CVAR_CH4_OFI_RANK_BITS=14 It's of course not a long term solution but it could be a nice workaround. |
Dear @jderouillat , Sorry for this late reply. I test this with a single node, it works! Many thanks! |
The checkpoint module now is good with updated I am closing this issue. Thanks again for your support. |
Descriptions
Dear Developers,
I have 6 species in a typical laser-solid interaction model (3 ions & 3 electrons), and I want to have both ion-electron collisions and electron-electron collisions.
The namelist is here (with line 139-153 uncommented and 156-186 commented)
input.py.txt
However, the code keeps crashes when I run on the Niagara cluster with a Seg. fault.
Reproduce the error
I try to investigate this with a scale-reduced case (smaller box size, few ppc, and lower resolution) on my laptop and find that the reason for the crash (at least one of the reasons) is due to multiple species in the collision module.
In other words, if I just use one single specie in species1 and species2, no Seg. fault and the code runs.
With this in mind, I modified the original namelist by adding several collision modules, with each of them only contains a single species (in the attached input file, with line 139-153 commented and 156-186 uncommented)
Unfortunately, the code crashes again.
The .out and .err files are as follows:
No23_SG_Coll_eei-4225525.out.txt
No23_SG_Coll_eei-4225525.err.txt
Now I am a bit confused and lost.
Can you take a look and give me some help with this?
According to the document, a collision module with multiple species should not be a problem.
Parameters
make env
gives the following results:The text was updated successfully, but these errors were encountered: