Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thread = True() as default for the time integration? #2283

Open
efaulhaber opened this issue Feb 13, 2025 · 3 comments
Open

thread = True() as default for the time integration? #2283

efaulhaber opened this issue Feb 13, 2025 · 3 comments
Labels
discussion parallelization Related to MPI, threading, tasks etc. performance We are greedy

Comments

@efaulhaber
Copy link
Member

Following our Slack discussion:
thread = True() makes the broadcasting in the time integration significantly faster on multiple threads, but it's not used in the elixirs.
Without multithreaded time integration:

julia> trixi_include("examples/tree_2d_dgsem/elixir_euler_source_terms.jl", initial_refinement_level=6, save_solution=nothing)

───────────────────────────────────────────────────────────────────────────────────
            Trixi.jl                      Time                    Allocations      
                                 ───────────────────────   ────────────────────────
        Tot / % measured:             4.68s /  94.0%           12.6MiB /  82.4%    

Section                  ncalls     time    %tot     avg     alloc    %tot      avg
───────────────────────────────────────────────────────────────────────────────────
perform step                914    4.26s   96.6%  4.66ms   7.61MiB   73.4%  8.52KiB
  ~perform step~            914    2.71s   61.5%  2.96ms      672B    0.0%    0.74B
  rhs!                    4.57k    1.55s   35.1%   339μs   7.61MiB   73.4%  1.70KiB
    reset ∂u/∂t           4.57k    489ms   11.1%   107μs     0.00B    0.0%    0.00B
    volume integral       4.57k    441ms   10.0%  96.5μs   1.32MiB   12.8%     304B
    source terms          4.57k    212ms    4.8%  46.5μs   1.26MiB   12.1%     288B
    interface flux        4.57k    120ms    2.7%  26.3μs   1.46MiB   14.1%     336B
    prolong2interfaces    4.57k    107ms    2.4%  23.4μs   1.32MiB   12.8%     304B
    surface integral      4.57k   72.0ms    1.6%  15.8μs   1.19MiB   11.4%     272B
    ~rhs!~                4.57k   70.0ms    1.6%  15.3μs   4.78KiB    0.0%    1.07B
    Jacobian              4.57k   34.6ms    0.8%  7.58μs   1.05MiB   10.1%     240B
    prolong2mortars       4.57k    327μs    0.0%  71.6ns     0.00B    0.0%    0.00B
    prolong2boundaries    4.57k    258μs    0.0%  56.4ns     0.00B    0.0%    0.00B
    mortar flux           4.57k    250μs    0.0%  54.7ns     0.00B    0.0%    0.00B
    boundary flux         4.57k    139μs    0.0%  30.4ns     0.00B    0.0%    0.00B
calculate dt                915   97.5ms    2.2%   107μs    257KiB    2.4%     288B
analyze solution             11   50.1ms    1.1%  4.55ms   2.50MiB   24.1%   233KiB

With thread = True():

julia> trixi_include("examples/tree_2d_dgsem/elixir_euler_source_terms.jl", initial_refinement_level=6, save_solution=nothing)

────────────────────────────────────────────────────────────────────────────────────────────────────
Trixi.jl simulation finished.  Final time: 2.0  Time steps: 914 (accepted), 914 (total)
────────────────────────────────────────────────────────────────────────────────────────────────────

───────────────────────────────────────────────────────────────────────────────────
            Trixi.jl                      Time                    Allocations      
                                 ───────────────────────   ────────────────────────
        Tot / % measured:             1.34s /  77.7%           12.6MiB /  82.4%    

Section                  ncalls     time    %tot     avg     alloc    %tot      avg
───────────────────────────────────────────────────────────────────────────────────
perform step                914    986ms   94.5%  1.08ms   7.61MiB   73.4%  8.52KiB
  rhs!                    4.57k    793ms   76.0%   174μs   7.61MiB   73.4%  1.70KiB
    volume integral       4.57k    206ms   19.8%  45.2μs   1.32MiB   12.8%     304B
    source terms          4.57k    186ms   17.8%  40.8μs   1.26MiB   12.1%     288B
    interface flux        4.57k    134ms   12.9%  29.4μs   1.46MiB   14.1%     336B
    prolong2interfaces    4.57k   86.2ms    8.3%  18.9μs   1.32MiB   12.8%     304B
    ~rhs!~                4.57k   69.1ms    6.6%  15.1μs   4.78KiB    0.0%    1.07B
    surface integral      4.57k   62.2ms    6.0%  13.6μs   1.19MiB   11.4%     272B
    reset ∂u/∂t           4.57k   24.9ms    2.4%  5.44μs     0.00B    0.0%    0.00B
    Jacobian              4.57k   22.7ms    2.2%  4.96μs   1.05MiB   10.1%     240B
    mortar flux           4.57k    222μs    0.0%  48.5ns     0.00B    0.0%    0.00B
    prolong2mortars       4.57k    205μs    0.0%  44.9ns     0.00B    0.0%    0.00B
    prolong2boundaries    4.57k    197μs    0.0%  43.1ns     0.00B    0.0%    0.00B
    boundary flux         4.57k    140μs    0.0%  30.6ns     0.00B    0.0%    0.00B
  ~perform step~            914    193ms   18.5%   211μs      672B    0.0%    0.74B
analyze solution             11   48.4ms    4.6%  4.40ms   2.50MiB   24.1%   233KiB
calculate dt                915   9.15ms    0.9%  10.0μs    257KiB    2.4%     288B

As @JoshuaLampert pointed out:

If we decide to use thread = True() by default, we could also add it to the ode_default_options instead of putting them manually into each elixir

As @ranocha pointed out:

And issues when switching to Base threads since OrdinaryDiffEq.jl will still use Polyester.jl
We should check these aspects before switching everything

For my multi-GPU prototype of TrixiParticles, I built a custom wrapper data type with a custom broadcasting style to make sure that all broadcasting in the time integration itself is also done on multiple GPUs with @threaded.
I can easily adapt this to have a ThreadedBroadcastArray data type that defines broadcasting with Trixi.@threaded. This way, all time integration schemes are automatically multithreaded in the same way as the rest of Trixi, even if (when?) we move away from Polyester.jl.
We only need to change

u0_ode = compute_coefficients(first(tspan), semi)

to

u0_ode_ = compute_coefficients(first(tspan), semi)
u0_ode = ThreadedBroadcastArray(u0_ode_)

Would this be desired in Trixi.jl? @ranocha @sloede

@efaulhaber efaulhaber added discussion parallelization Related to MPI, threading, tasks etc. performance We are greedy labels Feb 13, 2025
@JoshuaLampert
Copy link
Member

From an old discussion: #1108 (comment).

@efaulhaber
Copy link
Member Author

That's another advantage of the custom array type. It works even with methods that don't have a kwarg for multithreading.

@efaulhaber
Copy link
Member Author

I implemented the custom array type in #2284.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion parallelization Related to MPI, threading, tasks etc. performance We are greedy
Projects
None yet
Development

No branches or pull requests

2 participants