Bugfix in distributed fill halos #3714

simone-silvestri · 2024-08-16T20:30:10Z

To asynchronously fill the halos of distributed fields, the code uses an incremental counter to track how many MPI requests are live and update the MPI send and receive tag. The counter is reset when communication is synchronized.

As it is defined right now, the counter is always incremented at the end of a fill_halo_regions! on a distributed grid, irrespective of what happened in the fill_halo_regions!, with the assumption that all cores participate in the fill_halo_regions! so the counters are correctly synchronized.

Oceananigans.jl/src/DistributedComputations/halo_communication.jl

Lines 117 to 121 in 315e66b

    
               # Switch to the next field to send 
        
               arch.mpi_tag[] += 1 
        
               return nothing 
        
           end

Unfortunately, I experienced a situation where this was not the case.
In this case, I wanted to do different things on different cores, which is allowed when using the only_local_halos = true keyword argument (a very rare occurrence, but a possibility nonetheless).

For example, if we execute this code on the main branch

arch = Distributed(CPU())
grid = RectilinearGrid(size = (2, 2, 1), extent = (1, 1, 1))
c = Field(grid)

if arch.local_rank == 0
    fill_halo_regions!(c; only_local_halos = true)
end

The mpi_tag will be 1 on rank 0 and 0 on other ranks. This means that in subsequent halo passes, MPI will stall because it cannot match the tag between the send and receive operations of cores that communicate with rank 0.

This PR fixes this issue by incrementing the counter only if we have actually launched a mpi send or receive operation, that happens when at least one of the bcs is a distributed boundary condition and only_local_halos == false

glwagner · 2024-08-16T22:57:59Z

Why is a counter the best solution? This seems implicit and indirect. Can we write the code to be more obvious and direct? Where is mpi_tag used?

glwagner · 2024-08-16T23:06:10Z

Maybe more specifically I don't understand the motivation for this:

Oceananigans.jl/src/DistributedComputations/halo_communication.jl

Lines 44 to 48 in 315e66b

    
           # Define functions that return unique send and recv MPI tags for each side. 
        
           # It's an integer where 
        
           #   digit 1-2: an identifier for the field that is reset each timestep 
        
           #   digit 3: an identifier for the field's Z-location 
        
           #   digit 4: the side we send to/recieve from

This is a kind of abstraction for a send/recv event, but its super implicit just consisting of integers, rather than simply recording this information as symbols or strings (is a number the only way to generate a unique ID for a field?). I can't figure out why we would record the Z location, this seems random. Why not record the 3D location directly (not use a digit)? Otherwise this code is seemingly more specific than it needs to be. It would be annoying to debug this too since you'd have to be constantly computing these codes.

simone-silvestri · 2024-08-19T17:57:56Z

Tthe MPI tag must be an integer. The maximum value is vendor dependent but it is quite strict.
The smaller maximum value is 32767

We could probably record the whole location without incurring in integer dimension issue. We cannot do
field_id * 100000 + loc_x * 10000 + loc_y * 1000 + loc_z * 100 + from_side * 10 + to_side because the tag could be larger than 32767.
However, we could combine the three locations in a dictionary

(Center, Center, Center) -> 0
(Center, Center, Face) -> 1
(Center, Face, Center) -> 2
(Face, Center, Center) -> 3
(Face, Face, Center) -> 4
(Center, Face, Face) -> 5
(Face, Center, Face) -> 6
(Center, Center, Nothing) -> 7
(Face, Center, Nothing) -> 8
(Center, Face, Nothing) -> 9
...

If all the permutations fit into 99 values, we consume only 2 digits which would probably fit within the limits.

We can also compress the from_side and to_side into only one digit because we have

west to east -> 0
east to west -> 1
south to north -> 2
north to south -> 3
top to bottom -> 4
bottom to top -> 5

glwagner · 2024-08-19T18:30:22Z

Tthe MPI tag must be an integer.

But why does the ID for the process in julia have to be an integer? It's only MPI that needs an integer --- not us. You can distinguish the problem of creating an abstraction for a process, and the problem of generating an integer tag from that abstraction. We skip the intermediate step of a human-understandable abstraction here.

I'm not suggesting we use a different process for creating an MPI tag. The point is that we have two distinct problems that are being conflated into one:

Create a schema for identifying fill halo events
Generate MPI tags from fill halo events

We should do 1 separately from 2, rather than combining them into one thing that is pretty hard to understand.

glwagner · 2024-08-19T18:36:23Z

(Center, Center, Center) -> 0
(Center, Center, Face) -> 1
(Center, Face, Center) -> 2
(Face, Center, Center) -> 3
(Face, Face, Center) -> 4
(Center, Face, Face) -> 5
(Face, Center, Face) -> 6
(Center, Center, Nothing) -> 7
(Face, Center, Nothing) -> 8
(Center, Face, Nothing) -> 9
...

If all the permutations fit into 99 values, we consume only 2 digits which would probably fit within the limits.

We can also compress the from_side and to_side into only one digit because we have

west to east -> 0
east to west -> 1
south to north -> 2
north to south -> 3
top to bottom -> 4
bottom to top -> 5

I think this is fine and could be encapsulated in some function generate_MPI_tag_from_fill_halo_ID.

But first we need to create the concept of the "ID" for a fill halo event. The rest is detail.

glwagner · 2024-08-20T14:07:09Z

src/DistributedComputations/halo_communication.jl

+location_counter = 0
+for LX in (:Face, :Center, :Nothing)
+    for LY in (:Face, :Center, :Nothing)
+        for LZ in (:Face, :Center, :Nothing)
+            @eval loc_id(::$LX, ::$LY, ::$LZ) = $location_counter
+            location_counter += 1
+        end
+    end
+end


Suggested change

location_counter = 0

for LX in (:Face, :Center, :Nothing)

for LY in (:Face, :Center, :Nothing)

for LZ in (:Face, :Center, :Nothing)

@eval loc_id(::$LX, ::$LY, ::$LZ) = $location_counter

location_counter += 1

end

end

end

loc_id(lx, ly, lz) = loc_id(lx) + loc_id(ly) + loc_id(lz)

But why don't you reserve a digit for each location in xyz rather than adding them together? like

Nothing, Nothing, Nothing = "000" Center, Nothing, Nothing = "100" Center, Nothing, Face = "102"

etc

Oh I see, the integer has to be small

This does not lead to a unique identifier for each location combination

Wait, there are 27 locations. You need 2 digits, not 1.

Ah right, so the problem is equivalent to converting a number from base-3 to base-10, eg

loc_id(lx, ly, lz) = loc_id(lx) + 3 * loc_id(ly) + 9 * loc_id(lz)

This counts from 0 ("000") up to 26 ("222")

glwagner · 2024-08-20T14:09:38Z

You're gonna want to implement a struct that's someting like

struct HaloFillingEvent
    location
    z_indices
    from_side
    to_side
    field_id
end

and then a function

mpi_tag(hfe::HaloFillingEvent) = # number

simone-silvestri · 2024-08-20T14:16:03Z

You're gonna want to implement a struct that's someting like
struct HaloFillingEvent
    location
    z_indices
    from_side
    to_side
    field_id
end
and then a function
mpi_tag(hfe::HaloFillingEvent) = # number

I am not sure about this solution. The tag is used immediately (and only) where created, not recorded, and automatically destroyed by MPI after the communication is complete, so I do not immediately see the immediate utility of extra steps, or to save something in memory. A function that, given architecture, location, and side, spits out a unique tag seems sufficient for interpretability without having to record the output somewhere (it's a bit like a hash function, if you have function and inputs you have everything you need).

simone-silvestri · 2024-08-20T14:19:45Z

Maybe (for debugging purposes) a function, which reverses the hash, i.e. given the tag spits out the inputs can be usefull

glwagner · 2024-08-20T14:20:48Z

You're gonna want to implement a struct that's someting like
struct HaloFillingEvent
    location
    z_indices
    from_side
    to_side
    field_id
end
and then a function
mpi_tag(hfe::HaloFillingEvent) = # number
I am not sure about this solution. The tag is used immediately (and only) where created, not recorded, and automatically destroyed by MPI after the communication is complete, so I do not immediately see the immediate utility of extra steps, or to save something in memory. A function that, given architecture, location, and side, spits out a unique tag seems sufficient for interpretability without having to record the output somewhere (it's a bit like a hash function, if you have function and inputs you have everything you need).

There are a few purposes:

Make it easier for a future developer to understand the code by using words rather than a digit code
Make debugging possible
Replace this:

arch.active_requests[] += 1

with an actual list of the active requests (rather than simply counting them --- eg push!(active_events, new_event), and pop!(active_events, finished_event).

The objective is not to write the minimal code that will work, but to create a system that is human understandable. While a minimal functionality can be debugged and made to work once, it will be very brittle because if it breaks, it could shut down the whole system and prevent future development.

Also active_requests used to be called mpi_tag and I don't understand the relationship between those two concepts.

glwagner · 2024-08-20T14:23:19Z

Maybe (for debugging purposes) a function, which reverses the hash, i.e. given the tag spits out the inputs can be usefull

Yes but since the object I'm suggesting is even smaller in memory than an integer (almost 0 in size except for field_id and possibly z_indices), why would you skip the intermediate abstraction? You're thinking that an integer is "simple" but its not.

glwagner · 2024-08-20T14:31:25Z

Here's an example of future development that could be needed. Right now the tag involves the z_indices for some reason. Maybe in the future someone will want to include x_indices and y_indices. Right now, the code that creates the tag is ad hoc. To change it we have to dive into this metaprogrammed function which is not nice:

function $send_tag_fn_name(arch, grid, location)
            field_id   = string(arch.active_requests[], pad=ID_DIGITS)
            loc_digit  = string(loc_id(location...), pad=ID_DIGITS)
            side_digit = string(side_id[Symbol($side_str)])
            return parse(Int, field_id * loc_digit * side_digit)
        end

we want something that is more like

event = HaloFillingEvent(field, side)
id = mpi_tag(event)

and then the event is used internally / in julia, while id only has to do with MPI.

Now the code is more organized.

PS why are arch and grid passed into send_tag_fn_name?

simone-silvestri · 2024-08-20T14:33:24Z

From how I understand it you are suggesting this

struct HaloFillingEvent
    location
    z_indices
    from_side
    to_side
    field_id
end

and then a function

mpi_tag(hfe::HaloFillingEvent) = # number

at the moment, the step is just this

mpi_side_tag(arch, location) = # number

The number is going to be the same, so the interpretability of that tag is not going to improve with the above solution.
The mpi_tag was indeed a misnomer, and I was about to write here that we should change its name because it can create confusion (let me revert it to what it was so we can decide how to call it). That mpi_tag is what you called field_id or an active counter that keeps track of how many communications are active, which is reset after communication is complete.

glwagner · 2024-08-20T14:33:28Z

If you can demonstrate that this code can be implemented in a way that encourages debugging, inspection at the REPL, etc, without the abstraction, then that can be accepted. I just think that creating the abstraction is not only helpful for future people but will also help you organize your ideas. I would start with that and in the end if you find its not useful, eliminate it. But I wouldn't start by designing code without it and "seeing what happens". The objective here is not simply to "make things work".

glwagner · 2024-08-20T14:34:29Z

From how I understand it you are suggesting this
struct HaloFillingEvent
    location
    z_indices
    from_side
    to_side
    field_id
end
and then a function
mpi_tag(hfe::HaloFillingEvent) = # number
at the moment, the step is just this
mpi_side_tag(arch, location) = # number
The number is going to be the same, so the interpretability of that tag is not going to improve with the above solution. The mpi_tag was indeed a misnomer, and I was about to write here that we should change its name because it can create confusion (let me revert it to what it was so we can decide how to call it). That mpi_tag is what you called field_id or an active counter that keeps track of how many communications are active, which is reset after communication is complete.

The fact of the matter is the code is hard to interpret right now, so you have not demonstrated that the abstractions don't need a bit more work. I'm suggesting one solution but I agree, other strategies might succeed too.

glwagner · 2024-08-20T14:35:18Z

an active counter that keeps track of how many communications are active

Why would you not simply keep track of the communications themselves, rather than just the number? There's no price to keeping a list of the events versus the number of them active so I don't understand why you throw away that info.

simone-silvestri · 2024-08-20T15:12:04Z

We could do that, in fact the requests are stored in the arch.mpi_requests vector. The problem with using the length of that vector for inferring a field_id is that is not unique for different ranks. For example, a slab decomposition on a bounded domain will lead to 2 requests for the rank adjacent to boundaries (RightConnected or LeftConnected) and 4 for ranks connected on both sides. In this way, the counter is incremented globally when one field is sent, the side is inferred from the fill_halo_side! routine so this, combined to a location, leads to a unique tag.

What about changing the mpi_tag name to active_comm_counter and add an inverse function to extract the inputs from the tag? Then it is quite easy to inspect the tag and understand what communication we are looking at.

glwagner · 2024-08-20T15:24:56Z

We could do that, in fact the requests are stored in the arch.mpi_requests vector. The problem with using the length of that vector for inferring a field_id is that is not unique for different ranks. For example, a slab decomposition on a bounded domain will lead to 2 requests for the rank adjacent to boundaries (RightConnected or LeftConnected) and 4 for ranks connected on both sides. In this way, the counter is incremented globally when one field is sent, the side is inferred from the fill_halo_side! routine so this, combined to a location, leads to a unique tag.

What about changing the mpi_tag name to active_comm_counter and add an inverse function to extract the inputs from the tag? Then it is quite easy to inspect the tag and understand what communication we are looking at.

I don't follow. You have all of that information if you simply store the HaloFillEvents in a vector. Maybe you can't use the "length" but surely you can compute the number needed, which is currently stored as a "counter".

You can also store the pointer to the field itself rather than an "id" / number.

…uted-omip

…into ss/new-branch-distributed-tripolar

…ananigans.jl into ss/branch-for-distributed-omip

…uted-omip

glwagner · 2025-02-21T14:50:52Z

What's the status of this? Is this the cause of what @kburns saw?

glwagner

I still find the distributed algorithm hard to understand but we should merge this if its a bugfix

simone-silvestri · 2025-02-24T08:50:03Z

What's the status of this? Is this the cause of what @kburns saw?

This ensures that in fringe cases where only one rank updates the halos of a field like

if rank == 0
    fill_halo_regions!(f; only_local_halos=true)
end

rank 0 does not end up being out of sync with other ranks. This is not something that would happen in a simulation, but it is still a protection against these types of situations.

simone-silvestri added 9 commits August 16, 2024 16:15

bugfixxed

e2b39b4

some more changes

3fe8565

comment

ddc0836

remove helper shows

976955a

comment

4934ee3

another quick bugfix

45f14b5

quick change

43571d5

some fixes

ae9634f

add child_arch in another PR

27f94e0

navidcy added the bug 🐞 Even a perfect program still has bugs label Aug 16, 2024

simone-silvestri added 2 commits August 20, 2024 01:41

improve the explanation

99162c9

chnage name to field

c32a2b5

glwagner reviewed Aug 20, 2024

View reviewed changes

chnage back

7dbafb2

simone-silvestri added 6 commits August 20, 2024 13:24

removing all the shows

723ee08

remove the deadlock

89130d0

this works!!

1074201

Merge remote-tracking branch 'origin/main' into ss/branch-for-distrib…

e355c94

…uted-omip

merge with main

92091f1

Merge remote-tracking branch 'origin/ss/branch-for-distributed-omip' …

9a51271

…into ss/new-branch-distributed-tripolar

simone-silvestri mentioned this pull request Sep 17, 2024

(0.3.0) Pencil decomposition for the TripolarGrid CliMA/OrthogonalSphericalShellGrids.jl#32

Open

simone-silvestri added 16 commits September 19, 2024 13:10

Merge branch 'main' into ss/branch-for-distributed-omip

c901b51

adapt to new syntax

7e74246

a small bug

f089f6c

Merge branch 'ss/branch-for-distributed-omip' of github.com:CliMA/Oce…

69cb4cd

…ananigans.jl into ss/branch-for-distributed-omip

bugfix

84ded08

Merge remote-tracking branch 'origin/main' into ss/branch-for-distrib…

0adddc6

…uted-omip

Merge remote-tracking branch 'origin/main' into ss/branch-for-distrib…

f5a7f80

…uted-omip

change active cells

d22284c

Merge branch 'main' into ss/branch-for-distributed-omip

8406b39

Merge branch 'main' into ss/branch-for-distributed-omip

38638b5

Merge branch 'main' into ss/branch-for-distributed-omip

3049445

Merge branch 'main' into ss/branch-for-distributed-omip

8a9e17e

add a Nothing free surface

48991ab

better

c67d697

Merge remote-tracking branch 'origin/main' into ss/branch-for-distrib…

e02a6ee

…uted-omip

Merge branch 'main' into ss/branch-for-distributed-omip

39168da

Merge branch 'main' into ss/branch-for-distributed-omip

a8c4a1f

glwagner approved these changes Feb 21, 2025

View reviewed changes

Merge branch 'main' into ss/branch-for-distributed-omip

c76a7ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix in distributed fill halos #3714

Bugfix in distributed fill halos #3714

simone-silvestri commented Aug 16, 2024 •

edited

Loading

glwagner commented Aug 16, 2024

glwagner commented Aug 16, 2024

simone-silvestri commented Aug 19, 2024 •

edited

Loading

glwagner commented Aug 19, 2024 •

edited

Loading

glwagner commented Aug 19, 2024

glwagner Aug 20, 2024

glwagner Aug 20, 2024

glwagner Aug 20, 2024

simone-silvestri Aug 20, 2024

glwagner Aug 20, 2024

glwagner Aug 20, 2024

glwagner commented Aug 20, 2024 •

edited

Loading

simone-silvestri commented Aug 20, 2024

simone-silvestri commented Aug 20, 2024 •

edited

Loading

glwagner commented Aug 20, 2024 •

edited

Loading

glwagner commented Aug 20, 2024 •

edited

Loading

glwagner commented Aug 20, 2024

simone-silvestri commented Aug 20, 2024

glwagner commented Aug 20, 2024

glwagner commented Aug 20, 2024 •

edited

Loading

glwagner commented Aug 20, 2024

simone-silvestri commented Aug 20, 2024

glwagner commented Aug 20, 2024 •

edited

Loading

glwagner commented Feb 21, 2025

glwagner left a comment

simone-silvestri commented Feb 24, 2025

	# Switch to the next field to send
	arch.mpi_tag[] += 1

	return nothing
	end

Bugfix in distributed fill halos #3714

Are you sure you want to change the base?

Bugfix in distributed fill halos #3714

Conversation

simone-silvestri commented Aug 16, 2024 • edited Loading

glwagner commented Aug 16, 2024

glwagner commented Aug 16, 2024

simone-silvestri commented Aug 19, 2024 • edited Loading

glwagner commented Aug 19, 2024 • edited Loading

glwagner commented Aug 19, 2024

glwagner Aug 20, 2024

Choose a reason for hiding this comment

glwagner Aug 20, 2024

Choose a reason for hiding this comment

glwagner Aug 20, 2024

Choose a reason for hiding this comment

simone-silvestri Aug 20, 2024

Choose a reason for hiding this comment

glwagner Aug 20, 2024

Choose a reason for hiding this comment

glwagner Aug 20, 2024

Choose a reason for hiding this comment

glwagner commented Aug 20, 2024 • edited Loading

simone-silvestri commented Aug 20, 2024

simone-silvestri commented Aug 20, 2024 • edited Loading

glwagner commented Aug 20, 2024 • edited Loading

glwagner commented Aug 20, 2024 • edited Loading

glwagner commented Aug 20, 2024

simone-silvestri commented Aug 20, 2024

glwagner commented Aug 20, 2024

glwagner commented Aug 20, 2024 • edited Loading

glwagner commented Aug 20, 2024

simone-silvestri commented Aug 20, 2024

glwagner commented Aug 20, 2024 • edited Loading

glwagner commented Feb 21, 2025

glwagner left a comment

Choose a reason for hiding this comment

simone-silvestri commented Feb 24, 2025

simone-silvestri commented Aug 16, 2024 •

edited

Loading

simone-silvestri commented Aug 19, 2024 •

edited

Loading

glwagner commented Aug 19, 2024 •

edited

Loading

glwagner commented Aug 20, 2024 •

edited

Loading

simone-silvestri commented Aug 20, 2024 •

edited

Loading

glwagner commented Aug 20, 2024 •

edited

Loading

glwagner commented Aug 20, 2024 •

edited

Loading

glwagner commented Aug 20, 2024 •

edited

Loading

glwagner commented Aug 20, 2024 •

edited

Loading