srw: fluxAnimation broken on beta #547

moellep · 2016-11-23T01:32:56Z

The fluxAnimation only works about 1 out of 10 tries on beta. Usually it shows "Initialization Simulation" and does not complete. For example, open the Tabulated Undulator Example and select Start New Simulation on the Spectral Flux report.

Possibly related, the particle count shown on the report does not match alpha - alpha shows a result such as "Completed particle: 20 / 100000" where beta shows "Completed particle: 20 / 99990". Both reports had 100000 particles selected.

moellep · 2016-11-23T18:39:00Z

mrakitin noted that the different particle count is probably because beta uses a different number of cores for the calculation than alpha, so that is unrelated to this problem.

When the problem occurs, one cpu is maxed at 100%, but the others remain unused. When the process works correctly, at least 2 cpus are at 100%. One difference between alpha and beta is beta uses NFS for the file system.

robnagler · 2016-11-28T19:04:49Z

Interim notes... The main process is an endless loop of:

epoll_wait(4, [], 32, 0)                = 0

No files are written by the process. It doesn't seem to matter if it is apa14 or apa19.

When I start the command manually inside a container (using su - vagrant) it runs fine. It seems to be repeatable.

robnagler · 2016-11-28T23:49:56Z

I think the problem is in srwl_wfr_emit_prop_multi_e in srwlib.py. Only one process should create the directory. And, the directory creation should be wrapped in an exception, because there's a race condition between the test and the mkdir. Using a library, the code should look like:

    if rank == 0:
        pkio.mkdir_parent(log_dir)

If you don't want a dependency with pykern and py.path, the code needs to be written carefully. pkio uses py.path.ensure, which handles the race condition properly.

After some debugging, I was able to reproduce the error by deleting the directory on apa11 and then running the command on apa19, simulating how it would be started by celery. I think MPI gets confused when there is an exception in a child and the process loops. I suspect there's some coding required in srwlib.py to handle errors properly when an exception occurs. In any event, I was able to force the stack trace by sending a SIGINT to the child that was spinning:

MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 233 on node e420c5cca08d exited on signal 2 (Interrupt).
--------------------------------------------------------------------------
[e420c5cca08d:00230] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[e420c5cca08d:00230] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
DEBUG MESSAGE: Actual number of macro-electrons: 99990
Traceback (most recent call last):
  File "mpi_run.py", line 222, in <module>
    main()
  File "mpi_run.py", line 220, in main
    srwl_bl.SRWLBeamline(_name=v.name).calc_all(v, op)
  File "/home/vagrant/.pyenv/versions/py2/lib/python2.7/site-packages/srwl_bl.py", line 1558, in calc_all
    _fname = os.path.join(_v.fdir, _v.sm_fn) if(len(_v.sm_fn) > 0) else '')
  File "/home/vagrant/.pyenv/versions/py2/lib/python2.7/site-packages/srwl_bl.py", line 845, in calc_arb_spec_me
    _file_path = _fname, _char = charMultiE)
  File "/home/vagrant/.pyenv/versions/py2/lib/python2.7/site-packages/srwlib.py", line 4279, in srwl_wfr_emit_prop_multi_e
    os.mkdir(log_dir)
OSError: [Errno 17] File exists: '/var/db/sirepo/user/53mly0X5/srw/w0dWgbfV/fluxAnimation/__srwl_logs__'

mrakitin · 2016-11-29T15:23:00Z

Thanks for the analysis @robnagler, I'll change the SRW code to avoid race condition.

mrakitin · 2016-11-30T15:33:38Z

@robnagler, I think I fixed the issue on SRW side - mrakitin/SRW#24. The new code available in radiasoft/SRW-light@695a602.

@moellep, could you please release it on alpha and if possible on beta to check how it works now? Thanks.

robnagler · 2016-11-30T15:39:37Z

It's not quite right. You should copy what py.path.local does:

            try:
                self.mkdir()
            except py.error.EEXIST:
                # race condition: file/dir created by another thread/process.
                # complain if it is not a dir
                if self.check(dir=0):
                    raise

The race condition is that the directory may exist already (for whatever reason) and therefore it is not an error to recreate it.

mrakitin · 2016-11-30T15:48:59Z

I don't think we need to recreate it. It may contain a bunch of previous logs if you run SRW examples apart of Sirepo. The code will attempt to create a dir only if it's a master process and the directory does not exist. What's wrong with it?

moellep · 2016-11-30T16:25:30Z

The current code has been released to beta and the report runs OK. I did notice that occasionally, the error below occurs and halts the UI, even though reloading the page shows the simulation is still running.

mrakitin · 2016-11-30T16:30:49Z

Thanks Paul! The error you observe has already been reported - #555. I also saw it. If you switch to the beamline page and go back, the error disappears, but anyway it's not good to have it and I need to catch and solve it.

mrakitin · 2016-11-30T16:32:55Z

@moellep, I need to note that the error does not stop the simulation, the error appears only in the UI.

robnagler · 2016-11-30T18:32:49Z

In a distributed system, race conditions are amplified so they show up more frequently, which is why this problem was reproducible. The problem is that operations have to be atomic, and if at all possible, idempotent. This is similar to what I was talking about writing files. In order to write files in a distributed system, one should write them to a temporary location and then rename them to the well-known location. Rename is atomic. The same thing is true for directory creation.

In this particular case, the directory needs to exist in order for you to write the logs. Creating it is merely stating that a directory must exist so if the mkdir fails with EEXISTS that's fine, because you have accomplished the goal. If the problem you are trying to solve is that the files must be unique, I would recommend creating them in the current directory with a time stamp with a non-cryptographic nonce and skipping the subdirectory creation entirely. The reading process can then find the most recent file by a simple name search (put the date first in the name). In the case of Sirepo, the main directory is created each time.

The fundamental principle is that you cannot assume that two sequential operations will see the same state unless there is some type of locking. NFS locking is available, but it has specific semantics that would require us to write special code. Not all implementations of NFS turn on locking or respect it. In this specific case, the object we want to lock may or may not exist so it cannot be used for locking so you would need to lock the parent directory, and all processes would need to cooperate in the locking protocol properly or you might end up with some deadlocks or errors. Since we are implementing cross-library locking, I strongly recommend we avoid this issue. Fundamentally, we would want to build a distributed system using synchronized IPC instead of file I/O, but again, that's not going to happen without even more code.

This problem shows up quite often. It essentially a cache consistency issue. When a client asks for something and assumes that state will hold across calls to the server, it is assume it's local copy (cache) is consistent with the server, which it is not. What might happen in our case is that apa11 will delete the directory, but from the perspective of the client (apa19) it might think the directory still exists so the mkdir might fail. Therefore, the code probably should have a retry to handle this. You can't simply test for existence and then make the directory, because this too may fail. What you want is to assert existence with mkdir and only fail if the mkdir returns some other error than EEXISTS.

BTW, this is often a hard problem to understand. Apple's WebDAV implementation has this problem, for example. Writing distributed systems is hard, unfortunately, but there are many tools out there which pretend it is easy, e.g. MPI, which end up failing in ways that are difficult to debug.

mrakitin · 2016-11-30T19:16:23Z

@robnagler, thanks for the valuable information. I see that I need to use a low-level system symbols rather than simple check os,path.isdir(). What about this kind of solution - http://stackoverflow.com/a/600612/4143531? It seems similar to what you explained.

mrakitin · 2016-11-30T19:21:08Z

I wonder if we need this additional test:

if ... and os.path.isdir(path):

robnagler · 2016-11-30T19:29:51Z

The codes is fine.

You need the additional test to be "correct" (sort of). This comes down to a user interface issue: Should the program fail if a mkdir fails because the path is actually a file? That's what that test is for.

I still strongly recommend you skip the subdirectory creation. From a UI perspective, you are an expert, and you know that this directory will be created. A new end user is not likely to know what is in __srwl_logs__ or even to know to look for it. Rather just do the obvious in the current directory, and you will make things easier for the end user. From Sirepo's perspective, this is all hidden, but it does add complexity, and as we have seem, bugs. By eliminating the directory creation, you eliminate the possibility of this particular bug, and, in general, you make the code more robust by having fewer moving parts.

mrakitin · 2016-11-30T20:12:03Z

Thanks. I agree that it's simpler and more transparent to have a log file without sub-directories. However it was requested by @ochubar to place the logs to the directory, so I cannot avoid creation of that dir at this point. Probably I can check if the directory exists, and if so, place the logs in it, otherwise write the logs to a current dir. @robnagler, what are your thoughts about it? I'm a bit constrained by fitting the code to the two projects with different approaches, so cannot simply use external dependencies or introduce new behavior.

robnagler · 2016-11-30T20:47:17Z

I wouldn't write more code. :( Sometimes what a user wants has unintended consequences, and it is up to the programmer to communicate that. That's one of the reasons we say "simple ain't easy". :)

Perhaps you and @ochubar should have a conversation about this. My belief is that simpler structures (no extra directory) yields less complex and therefore more reliable code. There are good reasons to have the extra directory, but there are also good reasons to not have it. That's the hard part of good design: balancing tradeoffs.

mrakitin · 2016-11-30T20:55:23Z

I think I'll have this code:

    log_dir = os.path.abspath('__srwl_logs__')  #MR04112016
    if not os.path.isdir(log_dir):  #MR30112016
        log_dir = os.getcwd()
    timestamp = '{:%Y-%m-%d_%H-%M-%S}'.format(datetime.datetime.now())
    log_file = 'srwl_stat_wfr_emit_prop_multi_e_{}'.format(timestamp)
    log_path = os.path.join(log_dir, log_file)

In this case if a directory was created before, the logs will go there. Otherwise they will be written to the current dir. This way we will avoid creation of the directory with the possible race conditions. It will be simpler for Sirepo and more convenient for stand-alone SRW simulations.

robnagler · 2016-11-30T21:23:44Z

I realize this works, but it is too complicated imho.

- mrakitin/SRW#24 - avoid creation of the logs dir (a user can create it manually if it's necessary to store the logs there). - #547 - should fix the issue with hanging animated flux report on beta due to there is no race conditions with directory creation. - #555 - the problem should be fixed since the JSON logs are now created atomically.

- mrakitin/SRW#24 - avoid creation of the logs dir (a user can create it manually if it's necessary to store the logs there). - radiasoft/sirepo#547 - should fix the issue with hanging animated flux report on beta due to there is no race conditions with directory creation. - radiasoft/sirepo#555 - the problem should be fixed since the JSON logs are now created atomically.

- #24 - avoid creation of the logs dir (a user can create it manually if it's necessary to store the logs there). - radiasoft/sirepo#547 - should fix the issue with hanging animated flux report on beta due to there is no race conditions with directory creation. - radiasoft/sirepo#555 - the problem should be fixed since the JSON logs are now created atomically.

mrakitin · 2016-12-01T00:56:07Z

I think I fixed it in 54b4134 and in SRW:

Should also fix #555. Need to test it on beta.

mrakitin · 2016-12-02T14:57:57Z

I'm asked to return the code for creation of the directories. I'll add the checks with errno.EEXIST and the dir will be created only by master process.

moellep added the 1st Priority label Nov 23, 2016

mrakitin assigned robnagler Nov 23, 2016

mrakitin mentioned this issue Nov 29, 2016

Avoid the race condition on logs directory creation mrakitin/SRW#24

Closed

mrakitin added a commit that referenced this issue Dec 2, 2016

__srwl_logs__ dir is created by master process - #547.

a37e49c

mrakitin added a commit to mrakitin/SRW that referenced this issue Dec 2, 2016

__srwl_logs__ dir is created by master process - radiasoft/sirepo#547.

1a07cf3

mrakitin added a commit to radiasoft/SRW-light that referenced this issue Dec 2, 2016

__srwl_logs__ dir is created by master process - radiasoft/sirepo#547.

d40926e

robnagler closed this as completed Dec 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

srw: fluxAnimation broken on beta #547

srw: fluxAnimation broken on beta #547

moellep commented Nov 23, 2016

moellep commented Nov 23, 2016

robnagler commented Nov 28, 2016

robnagler commented Nov 28, 2016

mrakitin commented Nov 29, 2016

mrakitin commented Nov 30, 2016

robnagler commented Nov 30, 2016

mrakitin commented Nov 30, 2016

moellep commented Nov 30, 2016

mrakitin commented Nov 30, 2016

mrakitin commented Nov 30, 2016

robnagler commented Nov 30, 2016

mrakitin commented Nov 30, 2016

mrakitin commented Nov 30, 2016

robnagler commented Nov 30, 2016 •

edited

Loading

mrakitin commented Nov 30, 2016

robnagler commented Nov 30, 2016

mrakitin commented Nov 30, 2016

robnagler commented Nov 30, 2016

mrakitin commented Dec 1, 2016 •

edited

Loading

mrakitin commented Dec 2, 2016

srw: fluxAnimation broken on beta #547

srw: fluxAnimation broken on beta #547

Comments

moellep commented Nov 23, 2016

moellep commented Nov 23, 2016

robnagler commented Nov 28, 2016

robnagler commented Nov 28, 2016

mrakitin commented Nov 29, 2016

mrakitin commented Nov 30, 2016

robnagler commented Nov 30, 2016

mrakitin commented Nov 30, 2016

moellep commented Nov 30, 2016

mrakitin commented Nov 30, 2016

mrakitin commented Nov 30, 2016

robnagler commented Nov 30, 2016

mrakitin commented Nov 30, 2016

mrakitin commented Nov 30, 2016

robnagler commented Nov 30, 2016 • edited Loading

mrakitin commented Nov 30, 2016

robnagler commented Nov 30, 2016

mrakitin commented Nov 30, 2016

robnagler commented Nov 30, 2016

mrakitin commented Dec 1, 2016 • edited Loading

mrakitin commented Dec 2, 2016

robnagler commented Nov 30, 2016 •

edited

Loading

mrakitin commented Dec 1, 2016 •

edited

Loading