-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix in distributed fill halos #3714
base: main
Are you sure you want to change the base?
Conversation
Why is a counter the best solution? This seems implicit and indirect. Can we write the code to be more obvious and direct? Where is |
Maybe more specifically I don't understand the motivation for this: Oceananigans.jl/src/DistributedComputations/halo_communication.jl Lines 44 to 48 in 315e66b
This is a kind of abstraction for a send/recv event, but its super implicit just consisting of integers, rather than simply recording this information as symbols or strings (is a number the only way to generate a unique ID for a field?). I can't figure out why we would record the Z location, this seems random. Why not record the 3D location directly (not use a digit)? Otherwise this code is seemingly more specific than it needs to be. It would be annoying to debug this too since you'd have to be constantly computing these codes. |
Tthe MPI tag must be an integer. The maximum value is vendor dependent but it is quite strict. We could probably record the whole location without incurring in integer dimension issue. We cannot do (Center, Center, Center) -> 0
(Center, Center, Face) -> 1
(Center, Face, Center) -> 2
(Face, Center, Center) -> 3
(Face, Face, Center) -> 4
(Center, Face, Face) -> 5
(Face, Center, Face) -> 6
(Center, Center, Nothing) -> 7
(Face, Center, Nothing) -> 8
(Center, Face, Nothing) -> 9
... If all the permutations fit into 99 values, we consume only 2 digits which would probably fit within the limits. We can also compress the west to east -> 0
east to west -> 1
south to north -> 2
north to south -> 3
top to bottom -> 4
bottom to top -> 5 |
But why does the ID for the process in julia have to be an integer? It's only MPI that needs an integer --- not us. You can distinguish the problem of creating an abstraction for a process, and the problem of generating an integer tag from that abstraction. We skip the intermediate step of a human-understandable abstraction here. I'm not suggesting we use a different process for creating an MPI tag. The point is that we have two distinct problems that are being conflated into one:
We should do 1 separately from 2, rather than combining them into one thing that is pretty hard to understand. |
I think this is fine and could be encapsulated in some function But first we need to create the concept of the "ID" for a fill halo event. The rest is detail. |
location_counter = 0 | ||
for LX in (:Face, :Center, :Nothing) | ||
for LY in (:Face, :Center, :Nothing) | ||
for LZ in (:Face, :Center, :Nothing) | ||
@eval loc_id(::$LX, ::$LY, ::$LZ) = $location_counter | ||
location_counter += 1 | ||
end | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
location_counter = 0 | |
for LX in (:Face, :Center, :Nothing) | |
for LY in (:Face, :Center, :Nothing) | |
for LZ in (:Face, :Center, :Nothing) | |
@eval loc_id(::$LX, ::$LY, ::$LZ) = $location_counter | |
location_counter += 1 | |
end | |
end | |
end | |
loc_id(lx, ly, lz) = loc_id(lx) + loc_id(ly) + loc_id(lz) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why don't you reserve a digit for each location in xyz rather than adding them together? like
Nothing, Nothing, Nothing = "000"
Center, Nothing, Nothing = "100"
Center, Nothing, Face = "102"
etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, the integer has to be small
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not lead to a unique identifier for each location combination
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, there are 27 locations. You need 2 digits, not 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, so the problem is equivalent to converting a number from base-3 to base-10, eg
loc_id(lx, ly, lz) = loc_id(lx) + 3 * loc_id(ly) + 9 * loc_id(lz)
This counts from 0 ("000") up to 26 ("222")
You're gonna want to implement a struct that's someting like struct HaloFillingEvent
location
z_indices
from_side
to_side
field_id
end and then a function mpi_tag(hfe::HaloFillingEvent) = # number |
I am not sure about this solution. The tag is used immediately (and only) where created, not recorded, and automatically destroyed by MPI after the communication is complete, so I do not immediately see the immediate utility of extra steps, or to save something in memory. A function that, given architecture, location, and side, spits out a unique tag seems sufficient for interpretability without having to record the output somewhere (it's a bit like a hash function, if you have function and inputs you have everything you need). |
Maybe (for debugging purposes) a function, which reverses the hash, i.e. given the tag spits out the inputs can be usefull |
There are a few purposes:
with an actual list of the active requests (rather than simply counting them --- eg The objective is not to write the minimal code that will work, but to create a system that is human understandable. While a minimal functionality can be debugged and made to work once, it will be very brittle because if it breaks, it could shut down the whole system and prevent future development. Also |
Yes but since the object I'm suggesting is even smaller in memory than an integer (almost 0 in size except for |
Here's an example of future development that could be needed. Right now the tag involves the
we want something that is more like
and then the Now the code is more organized. PS why are |
From how I understand it you are suggesting this struct HaloFillingEvent
location
z_indices
from_side
to_side
field_id
end and then a function mpi_tag(hfe::HaloFillingEvent) = # number at the moment, the step is just this mpi_side_tag(arch, location) = # number The number is going to be the same, so the interpretability of that tag is not going to improve with the above solution. |
If you can demonstrate that this code can be implemented in a way that encourages debugging, inspection at the REPL, etc, without the abstraction, then that can be accepted. I just think that creating the abstraction is not only helpful for future people but will also help you organize your ideas. I would start with that and in the end if you find its not useful, eliminate it. But I wouldn't start by designing code without it and "seeing what happens". The objective here is not simply to "make things work". |
The fact of the matter is the code is hard to interpret right now, so you have not demonstrated that the abstractions don't need a bit more work. I'm suggesting one solution but I agree, other strategies might succeed too. |
Why would you not simply keep track of the communications themselves, rather than just the number? There's no price to keeping a list of the events versus the number of them active so I don't understand why you throw away that info. |
We could do that, in fact the requests are stored in the What about changing the |
I don't follow. You have all of that information if you simply store the You can also store the pointer to the field itself rather than an "id" / number. |
…into ss/new-branch-distributed-tripolar
…ananigans.jl into ss/branch-for-distributed-omip
What's the status of this? Is this the cause of what @kburns saw? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still find the distributed algorithm hard to understand but we should merge this if its a bugfix
This ensures that in fringe cases where only one rank updates the halos of a field like if rank == 0
fill_halo_regions!(f; only_local_halos=true)
end rank 0 does not end up being out of sync with other ranks. This is not something that would happen in a simulation, but it is still a protection against these types of situations. |
To asynchronously fill the halos of distributed fields, the code uses an incremental counter to track how many MPI requests are live and update the MPI send and receive tag. The counter is reset when communication is synchronized.
As it is defined right now, the counter is always incremented at the end of a
fill_halo_regions!
on a distributed grid, irrespective of what happened in thefill_halo_regions!
, with the assumption that all cores participate in thefill_halo_regions!
so the counters are correctly synchronized.Oceananigans.jl/src/DistributedComputations/halo_communication.jl
Lines 117 to 121 in 315e66b
Unfortunately, I experienced a situation where this was not the case.
In this case, I wanted to do different things on different cores, which is allowed when using the
only_local_halos = true
keyword argument (a very rare occurrence, but a possibility nonetheless).For example, if we execute this code on the main branch
The mpi_tag will be
1
on rank 0 and0
on other ranks. This means that in subsequent halo passes, MPI will stall because it cannot match the tag between the send and receive operations of cores that communicate with rank 0.This PR fixes this issue by incrementing the counter only if we have actually launched a mpi send or receive operation, that happens when at least one of the
bcs
is a distributed boundary condition andonly_local_halos == false