Optimizations method weighted #23

CarlosPenaDePedro · 2024-11-22T11:35:44Z

This pull request highlights changes that can improve the performance of the code. The first change involves iterating over sparse matrices using indexes instead of iterators, which allows for better compiler optimizations and is generally more suited for structures with a known size, such as vectors.

The second change introduces OpenMP parallelization for certain independent sections of the code, such as the treatment of rows, to further improve performance.

These changes are applied only to specific parts of the code relevant to our current interests. However, we understand that if these changes are accepted, a major refactor would be required to standardize their usage across all parts of the code that rely on this iterator structure. Such a refactor would represent a long-term development effort.

We also acknowledge that the iterator structure may have a specific purpose and might not be replaced. Our goal is to showcase the observed improvements on MN5 and explain how we achieved them.

We applied these changes to the ForceMissing calculation (Section 3) and the MissingValueTreatment, specifically the MissingIfHeaviestMissing case (Section 9).

Other relevant sections include:

Section 8, which corresponds to the matrix copy and represents the major bottleneck, as we could not modify it within MIR. Perhaps the eckit copy constructor could be revised to optimize this process.
Section 10, which covers the interpolation calculation itself. This section is already parallelized using OpenMP due to ECKIT_OMP and MIR_OMP being enabled.

This table presents the average burst time for the MIR Method Weighted sections across different servers during an output step of Tco1279-eORCA12 simulations using the DE340 output plans. The data focuses exclusively on the ocean servers, showcasing only eORCA12 data with missing values.

This is the base case.

Using indexes instead of iterators

Section 3: 5.4x Average burst time speedup respect base
Section 9: 2.85x Average burst time speedup respect base

Using indexes and 16 OMP threads

Section 3: 29.42x Average burst time speedup respect base
Section 9: 18.97x Average burst time speedup respect base

FussyDuck · 2024-11-22T11:35:50Z

All committers have signed the CLA.

CarlosPenaDePedro · 2024-11-22T12:08:07Z

src/mir/method/MethodWeighted.cc

    std::vector<size_t> forceMissing;  // reserving size unnecessary (not the general case)
-    {
-        auto begin = W.begin(0);
-        auto end(begin);
-        for (size_t r = 0; r < W.rows(); r++) {
-            if (begin == (end = W.end(r))) {
-                forceMissing.push_back(r);
-            }
-            begin = end;
+    #pragma omp parallel for reduction(vec_merge_sorted:forceMissing)
+    for (size_t r = 0; r < W.rows(); ++r) {
+        if (W.outer()[r] == W.outer()[r + 1]) {
+            forceMissing.push_back(r);
        }
    }


This part corresponds to Section 3. If I am not mistaken, this calculation could be performed during matrix creation, as the original empty rows in the weight matrix are part of the geometry. A vector containing the indices of these empty rows could be stored as a private member and cached. This approach could save significant time, especially with larger matrices, where traversing all rows is costly. For example, in the base case profiling with eORCA12, each MIR API call takes approximately 40 ms.

pmaciel · 2025-03-05T15:09:09Z

The OMP header should look into the conditional pre-processor define (defined in #include "mir/api/mir_config.h")
#if mir_HAVE_OMP

And in case there's no OpenMP found the pragma, or defined symbol, should be doing nothing

wdeconinck · 2025-03-13T15:50:04Z

In atlas we have something just for that :

#if ATLAS_HAVE_OMP
#define ATLAS_OMP_STR(x) #x
#define ATLAS_OMP_STRINGIFY(x) ATLAS_OMP_STR(x)
#define atlas_omp_pragma(x) _Pragma(ATLAS_OMP_STRINGIFY(x))
#else
#define atlas_omp_pragma(x)
#endif

With something similar you can then use:

mir_omp_pragma( omp parallel for reduction(vec_merge_sorted:forceMissing) )

Or define the reduction:

#define mir_omp_parallel_for_reduction(x) mir_omp_pragma( omp parallel for reduction(x) )

and then use

mir_omp_parallel_for_reduction(vec_merge_sorted:forceMissing)
for( ... ) {...}

CarlosPenaDePedro added 6 commits November 20, 2024 13:15

Avoid iterator creation in forceMissing creation

f23b471

Added OMP to forceMissing creation

7503bdd

Modified MissingifHeaviestMissing to improve access to sparse matrix

0267e16

Added OpenMP to compute rows of MissingIfHeaviestMissing

d7931c8

Rebased file to follow develop

61866b6

Simplified and fixed OMP pragma

d08fa5f

github-actions bot added the contributor label Nov 22, 2024

CarlosPenaDePedro commented Nov 22, 2024

View reviewed changes

CarlosPenaDePedro marked this pull request as ready for review November 22, 2024 12:08

CarlosPenaDePedro and others added 4 commits March 4, 2025 08:49

Merge branch 'ecmwf:develop' into optimizations_MethodWeighted

763c1dd

Updated outer instances to use UIndex new getter

8cc0a33

Fixed implicit conversion

9863411

Moved OMP reduction definition to a header file

d6980b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations method weighted #23

Optimizations method weighted #23

CarlosPenaDePedro commented Nov 22, 2024 •

edited

Loading

FussyDuck commented Nov 22, 2024 •

edited

Loading

CarlosPenaDePedro Nov 22, 2024

pmaciel commented Mar 5, 2025

wdeconinck commented Mar 13, 2025

Optimizations method weighted #23

Are you sure you want to change the base?

Optimizations method weighted #23

Conversation

CarlosPenaDePedro commented Nov 22, 2024 • edited Loading

FussyDuck commented Nov 22, 2024 • edited Loading

CarlosPenaDePedro Nov 22, 2024

Choose a reason for hiding this comment

pmaciel commented Mar 5, 2025

wdeconinck commented Mar 13, 2025

CarlosPenaDePedro commented Nov 22, 2024 •

edited

Loading

FussyDuck commented Nov 22, 2024 •

edited

Loading