Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SRW - Discrepancies in the animated flux report on different servers #335

Closed
mrakitin opened this issue Aug 31, 2016 · 20 comments
Closed

SRW - Discrepancies in the animated flux report on different servers #335

mrakitin opened this issue Aug 31, 2016 · 20 comments
Assignees
Milestone

Comments

@mrakitin
Copy link
Collaborator

When I try to perform a calculation within the animated flux report on different installations of Sirepo (alpha vs. my dev installation) with the same input JSON file and related magn_meas_esm.zip, I get different results on different servers:

alpha dev machine
alpha localhost

Note that on alpha maximum Flux value is ~8e15 [ph/s/.1%bw], but correct result should be ~8e14 [ph/s/.1%bw].

I checked the folders with calculations from both servers, and didn't find any differences in the .py files, all the parameters look the same except:

  • v.wm_ns = v.sm_ns = 2 on alpha;
  • v.wm_ns = v.sm_ns = 1 on localhost.

Here are the corresponding archives:

It's hard to reproduce the situation. Originally @ochubar noticed this issue on our internal installation at BNL (nsls2expdev1 server), then I reproduced it on alpha. During our meeting @moellep reproduced it on his dev installation, but @robnagler didn't see the wrong result with exactly the same inputs. Need to find the reason of this strange bug.

@mrakitin mrakitin added this to the SRW - bugs milestone Aug 31, 2016
@mrakitin
Copy link
Collaborator Author

The calculation can be found here - https://alpha.sirepo.com/srw#/source/nJvy1IIy.

@mrakitin mrakitin self-assigned this Aug 31, 2016
@mrakitin
Copy link
Collaborator Author

This is weird. I tried to run it with and without mpiexec in console, and got different results. When no mpiexec -np ... is provided, the calculation result looks correct, but when running it with mpiexec -np ..., the flux is incorrect - ~8e15.

@mrakitin
Copy link
Collaborator Author

mpiexec version on my Linux dev vagrant box:

$ mpiexec -version
mpiexec (OpenRTE) 1.8.3

Report bugs to http://www.open-mpi.org/community/help/

@robnagler
Copy link
Member

robnagler commented Aug 31, 2016 via email

mrakitin added a commit to mrakitin/sirepo_bugs that referenced this issue Aug 31, 2016
@mrakitin
Copy link
Collaborator Author

Did you try to run it with mpi in Sirepo on your machine this morning?

@mrakitin
Copy link
Collaborator Author

Here is test data to check on different servers:
https://github.com/mrakitin/sirepo_bugs/tree/master/issue_335

@mrakitin
Copy link
Collaborator Author

With HYDRA mpiexec I got the same wrong results.

$ mpiexec -version
HYDRA build details:
    Version:                                 3.1
    Release Date:                            Thu Feb 20 11:41:13 CST 2014
    CC:                              gcc -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wl,-z,relro
    CXX:                             g++ -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wl,-z,relro
    F77:                             gfortran -g -O2 -fstack-protector-strong -Wl,-z,relro
    F90:                             gfortran -g -O2 -fstack-protector-strong -Wl,-z,relro
    Configure options:                       '--disable-option-checking' '--prefix=/usr' '--build=x86_64-linux-gnu' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--disable-dependency-tracking' '--enable-shared' '--enable-fc' '--disable-rpath' '--disable-wrapper-rpath' '--sysconfdir=/etc/mpich' '--libdir=/usr/lib/x86_64-linux-gnu' '--includedir=/usr/include/mpich' '--docdir=/usr/share/doc/mpich' '--with-hwloc-prefix=system' '--enable-checkpointing' '--with-hydra-ckpointlib=blcr' 'build_alias=x86_64-linux-gnu' 'MPICHLIB_CFLAGS=-g -O2 -fstack-protector-strong -Wformat -Werror=format-security' 'MPICHLIB_CXXFLAGS=-g -O2 -fstack-protector-strong -Wformat -Werror=format-security' 'MPICHLIB_FFLAGS=-g -O2 -fstack-protector-strong' 'MPICHLIB_FCFLAGS=-g -O2 -fstack-protector-strong' 'CFLAGS=-g -O2 -fstack-protector-strong -Wformat -Werror=format-security -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -O2' 'LDFLAGS=-Wl,-z,relro ' 'CPPFLAGS=-D_FORTIFY_SOURCE=2 -I/build/mpich-Lgqv02/mpich-3.1/src/mpl/include -I/build/mpich-Lgqv02/mpich-3.1/src/mpl/include -I/build/mpich-Lgqv02/mpich-3.1/src/openpa/src -I/build/mpich-Lgqv02/mpich-3.1/src/openpa/src -I/build/mpich-Lgqv02/mpich-3.1/src/mpi/romio/include' 'CXXFLAGS=-g -O2 -fstack-protector-strong -Wformat -Werror=format-security -g -O2 -fstack-protector-strong -Wformat -Werror=format-security' 'F77=gfortran' 'FFLAGS=-g -O2 -fstack-protector-strong -g -O2 -fstack-protector-strong -O2' 'FC=gfortran' 'FCFLAGS=-g -O2 -fstack-protector-strong -g -O2 -fstack-protector-strong' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'LIBS=-lrt -lcr -lpthread '
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs cobalt
    Checkpointing libraries available:       blcr
    Demux engines available:                 poll select

@mrakitin
Copy link
Collaborator Author

Microsoft MPI Startup Program [Version 7.1.12437.25] also gives wrong data.

@mrakitin
Copy link
Collaborator Author

Tried it on different OS and with different mpi installations. Conclusion - the calculation is always wrong when mpiexec is used on both Windows and Linux. When pure python is executed, the result looks good. It's a bug in SRW. Will create a separate ticket for that in SRW project, then link to this ticket and close it.

@robnagler
Copy link
Member

Confirmed: I'm running with mpirun with 4 cores, and it is showing ~8e15. This morning I was running with one core.

@mrakitin
Copy link
Collaborator Author

mrakitin commented Sep 6, 2016

I created an issue in mrakitin/SRW to debug this issue - mrakitin/SRW#6.

@mrakitin mrakitin closed this as completed Sep 6, 2016
@mrakitin
Copy link
Collaborator Author

mrakitin commented Sep 6, 2016

I debugged the issue and found out that incorrect values appear when v.wm_na = v.sm_na = 1 is used. If it's set to 2 or more, there is no such problem. I'm going to fix it on SRW side, but meanwhile it can be fixed in Sirepo by commenting out the line where v.sm_na is set to 1.

@robnagler, what do you think?

@mrakitin
Copy link
Collaborator Author

mrakitin commented Sep 7, 2016

Fixed in 813df49. Now 5 particles_per_core are used for these calculations.

@robnagler, it looks strange that v.sm_na was set to 1 on cpu-001/alpha (I noticed that in the generated mpi_run.py files):

import srwl_bl
v = srwl_bl.srwl_uti_parse_options(varParam, use_sys_argv=False)
source_type, mag = srwl_bl.setup_source(v)
v.wm_na = v.sm_na = 1
# Number of "iterations" per save is best set to num processes
v.wm_ns = v.sm_ns = 24
op = set_optics()
srwl_bl.SRWLBeamline(_name=v.name).calc_all(v, op)

The following condition should be False on these systems in https://github.com/radiasoft/sirepo/blob/master/sirepo/pkcli/srw.py#L57:

if pkconfig.channel_in('dev'):
    p['particles_per_core'] = 1

Are we on the dev channel on these servers?

Anyway, according to the information from Oleg, v.sm_na = 1 should not be used and I set it to 5.

@robnagler
Copy link
Member

robnagler commented Sep 7, 2016 via email

@mrakitin
Copy link
Collaborator Author

mrakitin commented Sep 7, 2016

Thanks Rob, I see about "service". However on cpu-001 I have the same configuration as on alpha:

bivio_service_base_dir=/var/lib
bivio_service_channel=alpha
rabbitmq_host=rabbitmq
sirepo_db_dir=/var/db/sirepo
sirepo_beaker_secret=$sirepo_db_dir/beaker_secret
sirepo_port=7000

However the generated file contained v.wm_na = v.sm_na = 1.

@robnagler
Copy link
Member

I think I fixed this in radiasoft/devops@7106b33. Download and install
/etc/init.d/bivio-service.functions. Verify after restart with:

grep PYKERN_PKCONFIG_CHANNEL /var/lib/sirepo/init.log

@robnagler
Copy link
Member

Restart both celery and sirepo

@mrakitin
Copy link
Collaborator Author

mrakitin commented Sep 7, 2016

Done, thanks.

@mrakitin
Copy link
Collaborator Author

mrakitin commented Sep 7, 2016

Strange, the number of macroelectrons is not the limitation for the calculation:

fluxAnimation# tail -f srw_mpi.log
[2016-09-07 21:17:58]: Done 479472 out of 99935 (479.78% complete)
[2016-09-07 21:17:58]: Done 479496 out of 99935 (479.81% complete)
[2016-09-07 21:17:58]: Done 479520 out of 99935 (479.83% complete)
[2016-09-07 21:17:58]: Done 479544 out of 99935 (479.86% complete)
[2016-09-07 21:17:58]: Done 479568 out of 99935 (479.88% complete)
[2016-09-07 21:17:59]: Done 479592 out of 99935 (479.90% complete)
[2016-09-07 21:17:59]: Done 479616 out of 99935 (479.93% complete)
[2016-09-07 21:17:59]: Done 479640 out of 99935 (479.95% complete)
[2016-09-07 21:18:00]: Done 479664 out of 99935 (479.98% complete)
[2016-09-07 21:18:00]: Done 479688 out of 99935 (480.00% complete)
fluxAnimation# cat run.log
../../../../../../../../home/vagrant/.pyenv/versions/2.7.10/lib/python2.7/site-packages/pykern-20160902.164834-py2.7.egg/pykern/pksubprocess.py:63:check_call_with_signals 1078: started: ['mpiexec', '--bind-to', 'none', '-n', '24', '/home/vagrant/.pyenv/versions/2.7.10/bin/python', 'mpi_run.py']
../../../../../../../../home/vagrant/.pyenv/versions/2.7.10/lib/python2.7/site-packages/pykern-20160902.164834-py2.7.egg/pykern/pksubprocess.py:72:check_call_with_signals 1078: exception: ['mpiexec', '--bind-to', 'none', '-n', '24', '/home/vagrant/.pyenv/versions/2.7.10/bin/python', 'mpi_run.py'] RuntimeError: error exit(1)
  File "/home/vagrant/.pyenv/versions/2.7.10/bin/sirepo", line 9, in <module>
    load_entry_point('sirepo==20160907.133350', 'console_scripts', 'sirepo')()
  File "/home/vagrant/.pyenv/versions/2.7.10/lib/python2.7/site-packages/sirepo-20160907.133350-py2.7.egg/sirepo/sirepo_console.py", line 18, in main
    return pkcli.main('sirepo')
  File "/home/vagrant/.pyenv/versions/2.7.10/lib/python2.7/site-packages/pykern-20160902.164834-py2.7.egg/pykern/pkcli/__init__.py", line 131, in main
    argh.dispatch(parser, argv=argv)
  File "/home/vagrant/.pyenv/versions/2.7.10/lib/python2.7/site-packages/argh/dispatching.py", line 174, in dispatch
    for line in lines:
  File "/home/vagrant/.pyenv/versions/2.7.10/lib/python2.7/site-packages/argh/dispatching.py", line 277, in _execute_command
    for line in result:
  File "/home/vagrant/.pyenv/versions/2.7.10/lib/python2.7/site-packages/argh/dispatching.py", line 260, in _call
    result = function(*positional, **keywords)
  File "/home/vagrant/.pyenv/versions/2.7.10/lib/python2.7/site-packages/sirepo-20160907.133350-py2.7.egg/sirepo/pkcli/srw.py", line 71, in run_background
    mpi.run_script(script)
  File "/home/vagrant/.pyenv/versions/2.7.10/lib/python2.7/site-packages/sirepo-20160907.133350-py2.7.egg/sirepo/mpi.py", line 67, in run_script
    return run_program([sys.executable or 'python', fn])
  File "/home/vagrant/.pyenv/versions/2.7.10/lib/python2.7/site-packages/sirepo-20160907.133350-py2.7.egg/sirepo/mpi.py", line 41, in run_program
    env=env,
  File "/home/vagrant/.pyenv/versions/2.7.10/lib/python2.7/site-packages/pykern-20160902.164834-py2.7.egg/pykern/pksubprocess.py", line 67, in check_call_with_signals
    raise RuntimeError('error exit({})'.format(rc))
Traceback (most recent call last):
  File "/home/vagrant/.pyenv/versions/2.7.10/bin/sirepo", line 9, in <module>
    load_entry_point('sirepo==20160907.133350', 'console_scripts', 'sirepo')()
  File "/home/vagrant/.pyenv/versions/2.7.10/lib/python2.7/site-packages/sirepo-20160907.133350-py2.7.egg/sirepo/sirepo_console.py", line 18, in main
    return pkcli.main('sirepo')
  File "/home/vagrant/.pyenv/versions/2.7.10/lib/python2.7/site-packages/pykern-20160902.164834-py2.7.egg/pykern/pkcli/__init__.py", line 131, in main
    argh.dispatch(parser, argv=argv)
  File "/home/vagrant/.pyenv/versions/2.7.10/lib/python2.7/site-packages/argh/dispatching.py", line 174, in dispatch
    for line in lines:
  File "/home/vagrant/.pyenv/versions/2.7.10/lib/python2.7/site-packages/argh/dispatching.py", line 277, in _execute_command
    for line in result:
  File "/home/vagrant/.pyenv/versions/2.7.10/lib/python2.7/site-packages/argh/dispatching.py", line 260, in _call
    result = function(*positional, **keywords)
  File "/home/vagrant/.pyenv/versions/2.7.10/lib/python2.7/site-packages/sirepo-20160907.133350-py2.7.egg/sirepo/pkcli/srw.py", line 71, in run_background
    mpi.run_script(script)
  File "/home/vagrant/.pyenv/versions/2.7.10/lib/python2.7/site-packages/sirepo-20160907.133350-py2.7.egg/sirepo/mpi.py", line 67, in run_script
    return run_program([sys.executable or 'python', fn])
  File "/home/vagrant/.pyenv/versions/2.7.10/lib/python2.7/site-packages/sirepo-20160907.133350-py2.7.egg/sirepo/mpi.py", line 41, in run_program
    env=env,
  File "/home/vagrant/.pyenv/versions/2.7.10/lib/python2.7/site-packages/pykern-20160902.164834-py2.7.egg/pykern/pksubprocess.py", line 67, in check_call_with_signals
    raise RuntimeError('error exit({})'.format(rc))
RuntimeError: error exit(1)

@mrakitin
Copy link
Collaborator Author

mrakitin commented Sep 8, 2016

It wasn't reported correctly in the srw_mpi.log file. Fixed via radiasoft/SRW-light@afcdfd4. Alpha will need to be rebuilt to have the latest version of SRW.

@mrakitin mrakitin closed this as completed Sep 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants