`thread = True()` as default for the time integration? #2283

efaulhaber · 2025-02-13T14:53:44Z

Following our Slack discussion:
thread = True() makes the broadcasting in the time integration significantly faster on multiple threads, but it's not used in the elixirs.
Without multithreaded time integration:

julia> trixi_include("examples/tree_2d_dgsem/elixir_euler_source_terms.jl", initial_refinement_level=6, save_solution=nothing)

───────────────────────────────────────────────────────────────────────────────────
            Trixi.jl                      Time                    Allocations      
                                 ───────────────────────   ────────────────────────
        Tot / % measured:             4.68s /  94.0%           12.6MiB /  82.4%    

Section                  ncalls     time    %tot     avg     alloc    %tot      avg
───────────────────────────────────────────────────────────────────────────────────
perform step                914    4.26s   96.6%  4.66ms   7.61MiB   73.4%  8.52KiB
  ~perform step~            914    2.71s   61.5%  2.96ms      672B    0.0%    0.74B
  rhs!                    4.57k    1.55s   35.1%   339μs   7.61MiB   73.4%  1.70KiB
    reset ∂u/∂t           4.57k    489ms   11.1%   107μs     0.00B    0.0%    0.00B
    volume integral       4.57k    441ms   10.0%  96.5μs   1.32MiB   12.8%     304B
    source terms          4.57k    212ms    4.8%  46.5μs   1.26MiB   12.1%     288B
    interface flux        4.57k    120ms    2.7%  26.3μs   1.46MiB   14.1%     336B
    prolong2interfaces    4.57k    107ms    2.4%  23.4μs   1.32MiB   12.8%     304B
    surface integral      4.57k   72.0ms    1.6%  15.8μs   1.19MiB   11.4%     272B
    ~rhs!~                4.57k   70.0ms    1.6%  15.3μs   4.78KiB    0.0%    1.07B
    Jacobian              4.57k   34.6ms    0.8%  7.58μs   1.05MiB   10.1%     240B
    prolong2mortars       4.57k    327μs    0.0%  71.6ns     0.00B    0.0%    0.00B
    prolong2boundaries    4.57k    258μs    0.0%  56.4ns     0.00B    0.0%    0.00B
    mortar flux           4.57k    250μs    0.0%  54.7ns     0.00B    0.0%    0.00B
    boundary flux         4.57k    139μs    0.0%  30.4ns     0.00B    0.0%    0.00B
calculate dt                915   97.5ms    2.2%   107μs    257KiB    2.4%     288B
analyze solution             11   50.1ms    1.1%  4.55ms   2.50MiB   24.1%   233KiB

With thread = True():

julia> trixi_include("examples/tree_2d_dgsem/elixir_euler_source_terms.jl", initial_refinement_level=6, save_solution=nothing)

────────────────────────────────────────────────────────────────────────────────────────────────────
Trixi.jl simulation finished.  Final time: 2.0  Time steps: 914 (accepted), 914 (total)
────────────────────────────────────────────────────────────────────────────────────────────────────

───────────────────────────────────────────────────────────────────────────────────
            Trixi.jl                      Time                    Allocations      
                                 ───────────────────────   ────────────────────────
        Tot / % measured:             1.34s /  77.7%           12.6MiB /  82.4%    

Section                  ncalls     time    %tot     avg     alloc    %tot      avg
───────────────────────────────────────────────────────────────────────────────────
perform step                914    986ms   94.5%  1.08ms   7.61MiB   73.4%  8.52KiB
  rhs!                    4.57k    793ms   76.0%   174μs   7.61MiB   73.4%  1.70KiB
    volume integral       4.57k    206ms   19.8%  45.2μs   1.32MiB   12.8%     304B
    source terms          4.57k    186ms   17.8%  40.8μs   1.26MiB   12.1%     288B
    interface flux        4.57k    134ms   12.9%  29.4μs   1.46MiB   14.1%     336B
    prolong2interfaces    4.57k   86.2ms    8.3%  18.9μs   1.32MiB   12.8%     304B
    ~rhs!~                4.57k   69.1ms    6.6%  15.1μs   4.78KiB    0.0%    1.07B
    surface integral      4.57k   62.2ms    6.0%  13.6μs   1.19MiB   11.4%     272B
    reset ∂u/∂t           4.57k   24.9ms    2.4%  5.44μs     0.00B    0.0%    0.00B
    Jacobian              4.57k   22.7ms    2.2%  4.96μs   1.05MiB   10.1%     240B
    mortar flux           4.57k    222μs    0.0%  48.5ns     0.00B    0.0%    0.00B
    prolong2mortars       4.57k    205μs    0.0%  44.9ns     0.00B    0.0%    0.00B
    prolong2boundaries    4.57k    197μs    0.0%  43.1ns     0.00B    0.0%    0.00B
    boundary flux         4.57k    140μs    0.0%  30.6ns     0.00B    0.0%    0.00B
  ~perform step~            914    193ms   18.5%   211μs      672B    0.0%    0.74B
analyze solution             11   48.4ms    4.6%  4.40ms   2.50MiB   24.1%   233KiB
calculate dt                915   9.15ms    0.9%  10.0μs    257KiB    2.4%     288B

As @JoshuaLampert pointed out:

If we decide to use thread = True() by default, we could also add it to the ode_default_options instead of putting them manually into each elixir

As @ranocha pointed out:

And issues when switching to Base threads since OrdinaryDiffEq.jl will still use Polyester.jl
We should check these aspects before switching everything

For my multi-GPU prototype of TrixiParticles, I built a custom wrapper data type with a custom broadcasting style to make sure that all broadcasting in the time integration itself is also done on multiple GPUs with @threaded.
I can easily adapt this to have a ThreadedBroadcastArray data type that defines broadcasting with Trixi.@threaded. This way, all time integration schemes are automatically multithreaded in the same way as the rest of Trixi, even if (when?) we move away from Polyester.jl.
We only need to change

Trixi.jl/src/semidiscretization/semidiscretization.jl

Line 95 in 1e1f643

u0_ode = compute_coefficients(first(tspan), semi)

to

u0_ode_ = compute_coefficients(first(tspan), semi)
u0_ode = ThreadedBroadcastArray(u0_ode_)

Would this be desired in Trixi.jl? @ranocha @sloede

The text was updated successfully, but these errors were encountered:

JoshuaLampert · 2025-02-13T15:02:35Z

From an old discussion: #1108 (comment).

efaulhaber · 2025-02-13T15:04:41Z

That's another advantage of the custom array type. It works even with methods that don't have a kwarg for multithreading.

efaulhaber · 2025-02-13T17:08:22Z

I implemented the custom array type in #2284.

efaulhaber added discussion parallelization Related to MPI, threading, tasks etc. performance We are greedy labels Feb 13, 2025

JoshuaLampert mentioned this issue Feb 13, 2025

Use ode_default_options more consistently in the elixirs #2282

Closed

efaulhaber mentioned this issue Feb 13, 2025

[Proof of Concept] Implement data type to define threaded broadcasting #2284

Draft

DanielDoehring mentioned this issue Feb 16, 2025

Avoid broadcasting (.=) in favor of @threaded ? #2288

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`thread = True()` as default for the time integration? #2283

`thread = True()` as default for the time integration? #2283

efaulhaber commented Feb 13, 2025

JoshuaLampert commented Feb 13, 2025

efaulhaber commented Feb 13, 2025

efaulhaber commented Feb 13, 2025

thread = True() as default for the time integration? #2283

thread = True() as default for the time integration? #2283

Comments

efaulhaber commented Feb 13, 2025

JoshuaLampert commented Feb 13, 2025

efaulhaber commented Feb 13, 2025

efaulhaber commented Feb 13, 2025

`thread = True()` as default for the time integration? #2283

`thread = True()` as default for the time integration? #2283