Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix in distributed fill halos #3714

Open
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

simone-silvestri
Copy link
Collaborator

@simone-silvestri simone-silvestri commented Aug 16, 2024

To asynchronously fill the halos of distributed fields, the code uses an incremental counter to track how many MPI requests are live and update the MPI send and receive tag. The counter is reset when communication is synchronized.

As it is defined right now, the counter is always incremented at the end of a fill_halo_regions! on a distributed grid, irrespective of what happened in the fill_halo_regions!, with the assumption that all cores participate in the fill_halo_regions! so the counters are correctly synchronized.

# Switch to the next field to send
arch.mpi_tag[] += 1
return nothing
end

Unfortunately, I experienced a situation where this was not the case.
In this case, I wanted to do different things on different cores, which is allowed when using the only_local_halos = true keyword argument (a very rare occurrence, but a possibility nonetheless).

For example, if we execute this code on the main branch

arch = Distributed(CPU())
grid = RectilinearGrid(size = (2, 2, 1), extent = (1, 1, 1))
c = Field(grid)

if arch.local_rank == 0
    fill_halo_regions!(c; only_local_halos = true)
end

The mpi_tag will be 1 on rank 0 and 0 on other ranks. This means that in subsequent halo passes, MPI will stall because it cannot match the tag between the send and receive operations of cores that communicate with rank 0.

This PR fixes this issue by incrementing the counter only if we have actually launched a mpi send or receive operation, that happens when at least one of the bcs is a distributed boundary condition and only_local_halos == false

@navidcy navidcy added the bug 🐞 Even a perfect program still has bugs label Aug 16, 2024
@glwagner
Copy link
Member

Why is a counter the best solution? This seems implicit and indirect. Can we write the code to be more obvious and direct? Where is mpi_tag used?

@glwagner
Copy link
Member

Maybe more specifically I don't understand the motivation for this:

# Define functions that return unique send and recv MPI tags for each side.
# It's an integer where
# digit 1-2: an identifier for the field that is reset each timestep
# digit 3: an identifier for the field's Z-location
# digit 4: the side we send to/recieve from

This is a kind of abstraction for a send/recv event, but its super implicit just consisting of integers, rather than simply recording this information as symbols or strings (is a number the only way to generate a unique ID for a field?). I can't figure out why we would record the Z location, this seems random. Why not record the 3D location directly (not use a digit)? Otherwise this code is seemingly more specific than it needs to be. It would be annoying to debug this too since you'd have to be constantly computing these codes.

@simone-silvestri
Copy link
Collaborator Author

simone-silvestri commented Aug 19, 2024

Tthe MPI tag must be an integer. The maximum value is vendor dependent but it is quite strict.
The smaller maximum value is 32767

We could probably record the whole location without incurring in integer dimension issue. We cannot do
field_id * 100000 + loc_x * 10000 + loc_y * 1000 + loc_z * 100 + from_side * 10 + to_side because the tag could be larger than 32767.
However, we could combine the three locations in a dictionary

(Center, Center, Center) -> 0
(Center, Center, Face) -> 1
(Center, Face, Center) -> 2
(Face, Center, Center) -> 3
(Face, Face, Center) -> 4
(Center, Face, Face) -> 5
(Face, Center, Face) -> 6
(Center, Center, Nothing) -> 7
(Face, Center, Nothing) -> 8
(Center, Face, Nothing) -> 9
...

If all the permutations fit into 99 values, we consume only 2 digits which would probably fit within the limits.

We can also compress the from_side and to_side into only one digit because we have

west to east -> 0
east to west -> 1
south to north -> 2
north to south -> 3
top to bottom -> 4
bottom to top -> 5

@glwagner
Copy link
Member

glwagner commented Aug 19, 2024

Tthe MPI tag must be an integer.

But why does the ID for the process in julia have to be an integer? It's only MPI that needs an integer --- not us. You can distinguish the problem of creating an abstraction for a process, and the problem of generating an integer tag from that abstraction. We skip the intermediate step of a human-understandable abstraction here.

I'm not suggesting we use a different process for creating an MPI tag. The point is that we have two distinct problems that are being conflated into one:

  1. Create a schema for identifying fill halo events
  2. Generate MPI tags from fill halo events

We should do 1 separately from 2, rather than combining them into one thing that is pretty hard to understand.

@glwagner
Copy link
Member

(Center, Center, Center) -> 0
(Center, Center, Face) -> 1
(Center, Face, Center) -> 2
(Face, Center, Center) -> 3
(Face, Face, Center) -> 4
(Center, Face, Face) -> 5
(Face, Center, Face) -> 6
(Center, Center, Nothing) -> 7
(Face, Center, Nothing) -> 8
(Center, Face, Nothing) -> 9
...

If all the permutations fit into 99 values, we consume only 2 digits which would probably fit within the limits.

We can also compress the from_side and to_side into only one digit because we have

west to east -> 0
east to west -> 1
south to north -> 2
north to south -> 3
top to bottom -> 4
bottom to top -> 5

I think this is fine and could be encapsulated in some function generate_MPI_tag_from_fill_halo_ID.

But first we need to create the concept of the "ID" for a fill halo event. The rest is detail.

Comment on lines 46 to 54
location_counter = 0
for LX in (:Face, :Center, :Nothing)
for LY in (:Face, :Center, :Nothing)
for LZ in (:Face, :Center, :Nothing)
@eval loc_id(::$LX, ::$LY, ::$LZ) = $location_counter
location_counter += 1
end
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
location_counter = 0
for LX in (:Face, :Center, :Nothing)
for LY in (:Face, :Center, :Nothing)
for LZ in (:Face, :Center, :Nothing)
@eval loc_id(::$LX, ::$LY, ::$LZ) = $location_counter
location_counter += 1
end
end
end
loc_id(lx, ly, lz) = loc_id(lx) + loc_id(ly) + loc_id(lz)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why don't you reserve a digit for each location in xyz rather than adding them together? like

Nothing, Nothing, Nothing = "000"
Center, Nothing, Nothing = "100"
Center, Nothing, Face = "102"

etc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, the integer has to be small

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not lead to a unique identifier for each location combination

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, there are 27 locations. You need 2 digits, not 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, so the problem is equivalent to converting a number from base-3 to base-10, eg

loc_id(lx, ly, lz) = loc_id(lx) + 3 * loc_id(ly) + 9 * loc_id(lz)

This counts from 0 ("000") up to 26 ("222")

@glwagner
Copy link
Member

glwagner commented Aug 20, 2024

You're gonna want to implement a struct that's someting like

struct HaloFillingEvent
    location
    z_indices
    from_side
    to_side
    field_id
end

and then a function

mpi_tag(hfe::HaloFillingEvent) = # number

@simone-silvestri
Copy link
Collaborator Author

You're gonna want to implement a struct that's someting like

struct HaloFillingEvent
    location
    z_indices
    from_side
    to_side
    field_id
end

and then a function

mpi_tag(hfe::HaloFillingEvent) = # number

I am not sure about this solution. The tag is used immediately (and only) where created, not recorded, and automatically destroyed by MPI after the communication is complete, so I do not immediately see the immediate utility of extra steps, or to save something in memory. A function that, given architecture, location, and side, spits out a unique tag seems sufficient for interpretability without having to record the output somewhere (it's a bit like a hash function, if you have function and inputs you have everything you need).

@simone-silvestri
Copy link
Collaborator Author

simone-silvestri commented Aug 20, 2024

Maybe (for debugging purposes) a function, which reverses the hash, i.e. given the tag spits out the inputs can be usefull

@glwagner
Copy link
Member

glwagner commented Aug 20, 2024

You're gonna want to implement a struct that's someting like

struct HaloFillingEvent
    location
    z_indices
    from_side
    to_side
    field_id
end

and then a function

mpi_tag(hfe::HaloFillingEvent) = # number

I am not sure about this solution. The tag is used immediately (and only) where created, not recorded, and automatically destroyed by MPI after the communication is complete, so I do not immediately see the immediate utility of extra steps, or to save something in memory. A function that, given architecture, location, and side, spits out a unique tag seems sufficient for interpretability without having to record the output somewhere (it's a bit like a hash function, if you have function and inputs you have everything you need).

There are a few purposes:

  • Make it easier for a future developer to understand the code by using words rather than a digit code
  • Make debugging possible
  • Replace this:
arch.active_requests[] += 1

with an actual list of the active requests (rather than simply counting them --- eg push!(active_events, new_event), and pop!(active_events, finished_event).

The objective is not to write the minimal code that will work, but to create a system that is human understandable. While a minimal functionality can be debugged and made to work once, it will be very brittle because if it breaks, it could shut down the whole system and prevent future development.

Also active_requests used to be called mpi_tag and I don't understand the relationship between those two concepts.

@glwagner
Copy link
Member

glwagner commented Aug 20, 2024

Maybe (for debugging purposes) a function, which reverses the hash, i.e. given the tag spits out the inputs can be usefull

Yes but since the object I'm suggesting is even smaller in memory than an integer (almost 0 in size except for field_id and possibly z_indices), why would you skip the intermediate abstraction? You're thinking that an integer is "simple" but its not.

@glwagner
Copy link
Member

Here's an example of future development that could be needed. Right now the tag involves the z_indices for some reason. Maybe in the future someone will want to include x_indices and y_indices. Right now, the code that creates the tag is ad hoc. To change it we have to dive into this metaprogrammed function which is not nice:

function $send_tag_fn_name(arch, grid, location)
            field_id   = string(arch.active_requests[], pad=ID_DIGITS)
            loc_digit  = string(loc_id(location...), pad=ID_DIGITS)
            side_digit = string(side_id[Symbol($side_str)])
            return parse(Int, field_id * loc_digit * side_digit)
        end

we want something that is more like

event = HaloFillingEvent(field, side)
id = mpi_tag(event)

and then the event is used internally / in julia, while id only has to do with MPI.

Now the code is more organized.

PS why are arch and grid passed into send_tag_fn_name?

@simone-silvestri
Copy link
Collaborator Author

From how I understand it you are suggesting this

struct HaloFillingEvent
    location
    z_indices
    from_side
    to_side
    field_id
end

and then a function

mpi_tag(hfe::HaloFillingEvent) = # number

at the moment, the step is just this

mpi_side_tag(arch, location) = # number

The number is going to be the same, so the interpretability of that tag is not going to improve with the above solution.
The mpi_tag was indeed a misnomer, and I was about to write here that we should change its name because it can create confusion (let me revert it to what it was so we can decide how to call it). That mpi_tag is what you called field_id or an active counter that keeps track of how many communications are active, which is reset after communication is complete.

@glwagner
Copy link
Member

If you can demonstrate that this code can be implemented in a way that encourages debugging, inspection at the REPL, etc, without the abstraction, then that can be accepted. I just think that creating the abstraction is not only helpful for future people but will also help you organize your ideas. I would start with that and in the end if you find its not useful, eliminate it. But I wouldn't start by designing code without it and "seeing what happens". The objective here is not simply to "make things work".

@glwagner
Copy link
Member

glwagner commented Aug 20, 2024

From how I understand it you are suggesting this

struct HaloFillingEvent
    location
    z_indices
    from_side
    to_side
    field_id
end

and then a function

mpi_tag(hfe::HaloFillingEvent) = # number

at the moment, the step is just this

mpi_side_tag(arch, location) = # number

The number is going to be the same, so the interpretability of that tag is not going to improve with the above solution. The mpi_tag was indeed a misnomer, and I was about to write here that we should change its name because it can create confusion (let me revert it to what it was so we can decide how to call it). That mpi_tag is what you called field_id or an active counter that keeps track of how many communications are active, which is reset after communication is complete.

The fact of the matter is the code is hard to interpret right now, so you have not demonstrated that the abstractions don't need a bit more work. I'm suggesting one solution but I agree, other strategies might succeed too.

@glwagner
Copy link
Member

an active counter that keeps track of how many communications are active

Why would you not simply keep track of the communications themselves, rather than just the number? There's no price to keeping a list of the events versus the number of them active so I don't understand why you throw away that info.

@simone-silvestri
Copy link
Collaborator Author

We could do that, in fact the requests are stored in the arch.mpi_requests vector. The problem with using the length of that vector for inferring a field_id is that is not unique for different ranks. For example, a slab decomposition on a bounded domain will lead to 2 requests for the rank adjacent to boundaries (RightConnected or LeftConnected) and 4 for ranks connected on both sides. In this way, the counter is incremented globally when one field is sent, the side is inferred from the fill_halo_side! routine so this, combined to a location, leads to a unique tag.

What about changing the mpi_tag name to active_comm_counter and add an inverse function to extract the inputs from the tag? Then it is quite easy to inspect the tag and understand what communication we are looking at.

@glwagner
Copy link
Member

glwagner commented Aug 20, 2024

We could do that, in fact the requests are stored in the arch.mpi_requests vector. The problem with using the length of that vector for inferring a field_id is that is not unique for different ranks. For example, a slab decomposition on a bounded domain will lead to 2 requests for the rank adjacent to boundaries (RightConnected or LeftConnected) and 4 for ranks connected on both sides. In this way, the counter is incremented globally when one field is sent, the side is inferred from the fill_halo_side! routine so this, combined to a location, leads to a unique tag.

What about changing the mpi_tag name to active_comm_counter and add an inverse function to extract the inputs from the tag? Then it is quite easy to inspect the tag and understand what communication we are looking at.

I don't follow. You have all of that information if you simply store the HaloFillEvents in a vector. Maybe you can't use the "length" but surely you can compute the number needed, which is currently stored as a "counter".

You can also store the pointer to the field itself rather than an "id" / number.

@glwagner
Copy link
Member

What's the status of this? Is this the cause of what @kburns saw?

Copy link
Member

@glwagner glwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still find the distributed algorithm hard to understand but we should merge this if its a bugfix

@simone-silvestri
Copy link
Collaborator Author

What's the status of this? Is this the cause of what @kburns saw?

This ensures that in fringe cases where only one rank updates the halos of a field like

if rank == 0
    fill_halo_regions!(f; only_local_halos=true)
end

rank 0 does not end up being out of sync with other ranks. This is not something that would happen in a simulation, but it is still a protection against these types of situations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Even a perfect program still has bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants