Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use copy(deepcopy=true) for checkpointing #172

Merged
merged 6 commits into from
Aug 23, 2024

Conversation

BenjaminRodenberg
Copy link
Member

There seems to be a bug when using checkpointing. I stumbled across this with subcycling:

  • Perpendicular flap case with FEniCS and OpenFOAM as given in precice/tutorials@c20ed5b. Use subcycling in solid.py by setting fenics_dt = precice_dt / 10. Here, OpenFOAM runs into an exception after some time. The vtk files also start to obviously look faulty after some time.
  • Perpendicular flap case with FEniCS and Fake fluid as given in precice/tutorials@c20ed5b. Use subcycling like above. Simulation runs through, but vtk results are incorrect.

Looking at the value of u_n in any of these cases shows that we do not correctly load the checkpoint we originally stored. I assume there is some unintended pointer-magic happening. I used the following debugging statements in solid.py to sample the solution at the tip of the flap:

if precice.requires_writing_checkpoint():  # write checkpoint
    print(f"Store checkpoint: {(u_n(0,H), v_n(0,H), a_n(0,H)),t,n}")
...
if precice.requires_reading_checkpoint():  # roll back to checkpoint
    print(f"Read checkpoint: {(u_cp(0,H), v_cp(0,H), a_cp(0,H)),t_cp,n_cp}")
    print(f"Overwrites: {(u_n(0,H), v_n(0,H), a_n(0,H)),t,n}")
    print(f"Ignores: {u_np1(0,H)}")
    ...
else:
    print(f"u_n: {(u_n(0,H), v_n(0,H), a_n(0,H)),t,n}")
...

The fixes applied here seem to avoid the error.

There are still some todos:

  • Test this to make sure we don't get a similar error again.
  • Understand performance implications. I have the impression that creating a deep-copy slows down the process. Would be nice to find a more efficient way, but the primary goal is to fix the bug.

@NiklasVin
Copy link
Collaborator

NiklasVin commented Jun 30, 2024

Before I start working on the TODOs of this PR, I wanted to understand what was going on here since I didn't experience any issues in the partitioned heat tutorial.
To this end, I ran some tutorials with subcycling and did some research about the copy function.

Tests (without the suggested fix)

  1. Partitioned Heat equation
    I modified the heat.py file such that only one participant does subcycling.
    It seems like in this tutorial, the checkpointing doesn't affect the solution (or at least not so severely)
  2. Elastic Tube 3D
    Similar to the perpendicular flap case, I set fenics_dt=precice_dt/5. The simulation crashed as well
  3. Perpendicular Flap OpenFOAM-OpenFOAM
    Again, the solid participant does subcycling and this is what you'd get shortly before the program crash:
    flap
  4. Perpendicular Flap Fluid Fake-FEniCS
    The picture speaks for itself:
    flap_fake
  5. Perpendicular Flap Fluid Fake-OpenFOAM
    After some time, the simulation explodes as well, but before that the flap does not become so wiggly as the one from 4. It is just bending to the right and grows.

What's wrong with copy() ?

Though in the current dolfin documentation about Function.copy(), the dof are deep copied, but as we use legacy FEniCS this is how the copy function is implemented (when I started looking at the documentation, I haven't considered that the dolfin version FEniCS uses is outdated; that's why I am pointing that out).
So, I think it is necessary to use a copy(deepcopy=true) with the current state of the adapter.

@NiklasVin
Copy link
Collaborator

Since the odd behavior isn't solely restricted to the OpenFOAM-FEniCS perpendicular flap tutorial but also to cases where the FEniCS adapter isn't used at all, my guess would be that there is a bug elsewhere as well.

What do you think?

@IshaanDesai
Copy link
Member

Since the odd behavior isn't solely restricted to the OpenFOAM-FEniCS perpendicular flap tutorial but also to cases where the FEniCS adapter isn't used at all, my guess would be that there is a bug elsewhere as well.

The OpenFOAM-OpenFOAM failure definitely makes this more confusing. Otherwise I could imagine that something goes wrong in the Python bindings, and hence all Python-based participants produce failures.

@uekerman
Copy link
Member

uekerman commented Jul 2, 2024

I could imagine that you have two independent issues here. Maybe ignoring OpenFOAM for the moment is the more sane strategy.

@BenjaminRodenberg
Copy link
Member Author

@NiklasVin I assume you already did this: Can you also do the experiment from above including the changes from precice/tutorials#554 but without deepcopy=true? I would assume that you do not see any problems in this case because you are avoiding the problem on the side of the user.

@NiklasVin
Copy link
Collaborator

@NiklasVin I assume you already did this: Can you also do the experiment from above including the changes from precice/tutorials#554 but without deepcopy=true? I would assume that you do not see any problems in this case because you are avoiding the problem on the side of the user.

Exactly. I tested it without deepcopy=true and everything works fine.

@BenjaminRodenberg
Copy link
Member Author

Based on the current state I would still suggest to merge this PR and use deepcopy by default. Main reason: It makes the usage of the FEniCS adapter more secure and we cannot run that easily in issues like the one fixed via precice/tutorials#554. So generally I would consider this as a usability improvement at cost of performance. If we learn that the performance cost is too high, we could still offer an option to set deepcopy=false for experienced users that implement safe update schemes like in precice/tutorials#554.

@IshaanDesai, @uekerman, @NiklasVin: Any comments? If not, I would like to merge this PR.

@IshaanDesai
Copy link
Member

Makes sense, lets merge this 👍

@NiklasVin
Copy link
Collaborator

A short comment regarding the performance aspect: I tested the runtime of copying a dolfin function both with and without deep-copying. On average, the deep-copy variant is about 13% slower (only copying the function, not the whole execution of the simulation). So, the overall runtime shouldn't be affected too much, at least, it wasn't when comparing precice/tutorials#554 with the suggested changes here.
So, it makes sense to merge this.

Copy link
Collaborator

@NiklasVin NiklasVin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me. 👍
I am opening an issue regarding the crash of the OpenFOAM-OpenFOAM example from above.

@BenjaminRodenberg BenjaminRodenberg merged commit 8716694 into develop Aug 23, 2024
7 checks passed
@BenjaminRodenberg BenjaminRodenberg added this to the v2.2.0 milestone Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants