Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Add runner with Ubuntu on ARM64 #634

Merged
merged 2 commits into from
Jan 28, 2025
Merged

Conversation

mmuetzel
Copy link
Contributor

GitHub started hosting runners with Ubuntu on ARM64 processors for open source projects for free:
https://github.blog/changelog/2025-01-16-linux-arm64-hosted-runners-now-available-for-free-in-public-repositories-public-preview/

Add one configuration that is using these runners to the build matrix.

According to their blog post, the arm64 runners are more efficient and potentially faster than the x86_64 runners. (But it is still in a preview phase and maybe it will take some time for them to better scale and balance the load.) If that turns out to be true, it should be easy to switch more configurations in that workflow to the arm64 runners (and maybe keep only one or two running on x86_64).

@mmuetzel
Copy link
Contributor Author

Oof. More tests are failing on that runner than I would have hoped.
At least some of the failing tests (H1BasisEvaluation, SD_H1BasisEvaluation, pointload2) also fail on macOS (Apple Silicon). Maybe, there are some assumptions somewhere that only hold for Intel/AMD processors?

I don't have physical access to ARM64 hardware. Not sure if I could help track down any of that.
Maybe, valgrind or some sanitizers would be able to find something that is odd also on Intel/AMD?

@juharu
Copy link
Contributor

juharu commented Jan 23, 2025 via email

@juharu
Copy link
Contributor

juharu commented Jan 23, 2025 via email

@mmuetzel
Copy link
Contributor Author

Have you checked it's not a stack size problem ?

I might be wrong. But wouldn't that crash the program?
It looks like the numerics don't behave as expected. E.g., the last time steps before the test ConstantParamFunc failed:

  MAIN: 
  MAIN: -------------------------------------
  MAIN: Time: 34/50:   6.800E+00
  MAIN: -------------------------------------
  MAIN: 
  HeatSolver: Solving the energy equation for temperature
  HeatSolve: 
  HeatSolve: 
  HeatSolve: -------------------------------------
  HeatSolve:  TEMPERATURE ITERATION           1
  HeatSolve: -------------------------------------
  HeatSolve: 
  HeatSolve: Starting Assembly...
  HeatSolve: Assembly:
  HeatSolve: Assembly done
  ComputeChange: NS (ITER=1) (NRM,RELC): ( 0.16009565     0.62252546E-01 ) :: heat equation
  HeatSolve: iter:    1 Assembly: (s)    0.00    0.00
  HeatSolve: iter:    1 Solve:    (s)    0.00    0.00
  HeatSolve:  Result Norm   :   0.16009565005742657
  HeatSolve:  Relative Change :    6.2252546152800785E-002
  MAIN: 
  MAIN: -------------------------------------
  MAIN: Time: 35/50:   7.000E+00
  MAIN: -------------------------------------
  MAIN: 
  HeatSolver: Solving the energy equation for temperature
  HeatSolve: 
  HeatSolve: 
  HeatSolve: -------------------------------------
  HeatSolve:  TEMPERATURE ITERATION           1
  HeatSolve: -------------------------------------
  HeatSolve: 
  HeatSolve: Starting Assembly...
  HeatSolve: Assembly:
  HeatSolve: Assembly done
  ERROR:: IterSolve: Numerical Error: System diverged over maximum tolerance.

I don't know what could be causing that though...

@juharu
Copy link
Contributor

juharu commented Jan 23, 2025 via email

@juharu
Copy link
Contributor

juharu commented Jan 23, 2025 via email

@juharu
Copy link
Contributor

juharu commented Jan 23, 2025 via email

@mmuetzel
Copy link
Contributor Author

I switched the iterator on the ConstParamFunc, ConvergenceControl, HeatControlExplicit, TransientCostFourHeater and fsi_box tests (in devel now).

Thank you for looking into this.

I rebased this PR on top of your latest changes. Let's see if this will make a difference.

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Jan 23, 2025

That doesn't seem to have made much difference.
Still the following test failures:

The following tests FAILED:
	126 - ConstantParamFunc (Failed)                        serial
	155 - ConvergenceControl (Failed)                       serial transient
	292 - H1BasisEvaluation (Failed)                        benchmark serial
	301 - HeatControlExplicit (Failed)                      control quick serial
	433 - OptimizeSimplexFourHeaters (Failed)               control serial
	434 - OptimizeSimplexFourHeatersInt (Failed)            control serial
	540 - SD_H1BasisEvaluation (Failed)                     benchmark serendipity serial
	680 - TransientCostFourHeaters (Failed)                 serial
	801 - fsi_box (Failed)                                  elasticsolve fsi serial transient
	920 - pointload2 (Failed)                               serial

@juharu
Copy link
Contributor

juharu commented Jan 23, 2025 via email

@juharu
Copy link
Contributor

juharu commented Jan 23, 2025 via email

@mmuetzel
Copy link
Contributor Author

Can I somehow misuse the git "push" triggered "Action" to do debugging (without waiting for too long) ?

Debugging using only GitHub actions is pretty tedious unfortunately. Unfortunately, I'm not aware of any option to log into the runners and run commands manually.
I would be very interested if someone finds out how to do that.

The next "best" thing (but still pretty bad) what I mostly resort to is:

  • Fork the repository to my own user account.
  • Enable running actions on that fork.
  • Optionally, disable any workflows that I'm currently not interested in. (This can be done on the "three-dot-menu" after selecting the respective workflow on the "Actions" tab.)
  • Optionally, modify the "configure" step of the workflow file in that fork to disable some build options (e.g., the GUI) to reduce the build time.
  • Optionally, modify the "check" step of the workflow file in that fork to reduce the number of tests that are being run.
  • Optionally, modify the source files in that fork to output more intermediate results.
  • Then iterate until there is some useful information.

That is still quite tedious and time consuming. It would be much easier if I knew how to stop in a debugger or anything like that in a GitHub action.

@juharu
Copy link
Contributor

juharu commented Jan 24, 2025 via email

@juharu
Copy link
Contributor

juharu commented Jan 24, 2025 via email

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Jan 24, 2025

An observation about the linear system iterator fails on ARM64 platform: It seems that the DNRM2() - function used by all iterative methods fails at random intervals. I don't know the reason. I think this is linked in from the openblas() - library here ?

Good finding!

Afaict, Ubuntu 24.04 distributes OpenBLAS version 0.3.26.

I found the following commit for upstream OpenBLAS that could be related:
OpenMathLib/OpenBLAS@3345007

I.e., they changed the implementation of DNRM2 on the Neoverse N2 due to inaccuracies. That processor is used on the GitHub runners.
IIUC, the first version with that change is OpenBLAS 0.3.29.

I'll try to switch to using the reference BLAS implementation on the ARM64 runner. Maybe, that will make a difference.

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Jan 24, 2025

Switching to the reference BLAS implementation reduced the number of failing tests from 10 to four:

The following tests FAILED:
	292 - H1BasisEvaluation (Failed)                        benchmark serial
	540 - SD_H1BasisEvaluation (Failed)                     benchmark serendipity serial
	549 - SD_P2ndDerivatives (Failed)                       quick serial
	920 - pointload2 (Failed)                               serial
Error: Process completed with exit code 8.

That's a huge step forward. 🎉

The remaining test failures are mainly the same ones that are also failing on macOS. (Not sure if that is coincidental.)

The only other (and new) one might be because numerical results are slightly different using the reference implementation compared to an optimized implementation (like OpenBLAS). I haven't checked though...

@mmuetzel
Copy link
Contributor Author

Afaict, SD_P2ndDerivatives (the only new one with respect to the previously failing tests) is failing with an accuracy issue:

 Testing 2nd derivatives with p(8)...
(202)...PASSED (303)...PASSED (404)...PASSED (504)... dfyz:  0.85232585644089298       0.85232542765237440        4.2878851858052514E-007 >   4.2616292834281920E-007

@juharu
Copy link
Contributor

juharu commented Jan 24, 2025 via email

@juharu
Copy link
Contributor

juharu commented Jan 25, 2025 via email

@juharu
Copy link
Contributor

juharu commented Jan 25, 2025 via email

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Jan 26, 2025

Good finding.
Maybe, a miscompilation by gfortran of some OpenMP constructs for that CPU (aarch64) and/or for that OS (macOS)? It's probably not an issue in the OpenMP implementation itself because Ubuntu uses OpenMP as implemented by GCC (libopenmp) while macOS uses the OpenMP implementation by LLVM (libomp).
It's probably also possible that there is an issue with how OpenMP is used somewhere in the sources (and it just "happens" to work correctly for some combinations of operating system and used CPU).

Do you know which OpenMP constructs are causing the test errors? It's hard to find a bug report or potential workarounds with only using these broad keywords...

Is a combination of OpenMP and -O3 needed to reproduce the issue? Or is one of them enough?

Anyway: If I understand correctly, OpenMP is only used in parts of elmerfem. And for most use cases, it is not crucial whether OpenMP is used or not. OpenMP is deactivated by default in the build rules.
Would it be ok to just stop testing OpenMP on platforms/CPUs where it turned out that it is causing errors?

While looking through the code, I found a couple of places in NavierStokes.F90 and SParIterSolver.F90 where !omp is used instead of the (documented) !$omp.
I don't know if those are just typos or whether those instructions are meant to be disabled.
I'll open a separate PR with fixes for these potential typos. (Not sure if that helps with the issues here though.)
Edit: See #636

@juharu
Copy link
Contributor

juharu commented Jan 27, 2025 via email

mmuetzel added a commit to mmuetzel/elmerfem that referenced this pull request Jan 27, 2025
Building with OpenMP causes some test failures on macOS,
see ElmerCSC#634 (comment).
The affected tests currently are: `H1BasisEvaluation`,
`SD_H1BasisEvaluation`, and `pointload2`.

For the time being, disable building with OpenMP in CI.

The flags that are necessary to build with OpenMP using Homebrew on macOS
are left intact. So, it should be easy to enable runners with OpenMP in
case this issue will be fixed.

Note that the tests with OpenMP fail for optimization level `-O3` but
succeed with `-O2`.
mmuetzel added a commit to mmuetzel/elmerfem that referenced this pull request Jan 27, 2025
Building with OpenMP causes some test failures on macOS,
see ElmerCSC#634 (comment).
The affected tests currently are: `H1BasisEvaluation`,
`SD_H1BasisEvaluation`, and `pointload2`.

For the time being, disable building with OpenMP in CI.

The flags that are necessary to build with OpenMP using Homebrew on macOS
are still needed because OpenMP is a transitional build dependency of
SuiteSparse.

Note that the tests with OpenMP fail for optimization level `-O3` but
succeed with `-O2`.
mmuetzel added a commit to mmuetzel/elmerfem that referenced this pull request Jan 27, 2025
Building with OpenMP causes some test failures on macOS,
see ElmerCSC#634 (comment).
The affected tests currently are: `H1BasisEvaluation`,
`SD_H1BasisEvaluation`, and `pointload2`.

For the time being, disable building with OpenMP in CI.

OpenMP is needed as a transitional build dependency of the SuiteSparse
package from Homebrew. So, also disable building with CHOLMOD.

Note that the tests with OpenMP fail for optimization level `-O3` but
succeed with `-O2`.
mmuetzel added a commit to mmuetzel/elmerfem that referenced this pull request Jan 27, 2025
Building with OpenMP causes some test failures on macOS,
see ElmerCSC#634 (comment).
The affected tests currently are: `H1BasisEvaluation`,
`SD_H1BasisEvaluation`, and `pointload2`.

For the time being, disable building with OpenMP in CI.

OpenMP is needed as a transitional build dependency of the SuiteSparse
package from Homebrew. So, also disable building with CHOLMOD.

Note that the tests with OpenMP fail for optimization level `-O3` but
succeed with `-O2`.
mmuetzel added a commit to mmuetzel/elmerfem that referenced this pull request Jan 27, 2025
Building with OpenMP causes some test failures on macOS,
see ElmerCSC#634 (comment).
The affected tests currently are: `H1BasisEvaluation`,
`SD_H1BasisEvaluation`, and `pointload2`.

For the time being, disable building with OpenMP in CI.

OpenMP is needed as a transitional build dependency of the SuiteSparse
package from Homebrew. So, also disable building with CHOLMOD.

Note that the tests with OpenMP fail for optimization level `-O3` but
succeed with `-O2`.
mmuetzel added a commit to mmuetzel/elmerfem that referenced this pull request Jan 27, 2025
Building with OpenMP causes some test failures on macOS,
see ElmerCSC#634 (comment).
The affected tests currently are: `H1BasisEvaluation` and
`SD_H1BasisEvaluation`.

For the time being, disable building with OpenMP on macOS in CI.

OpenMP is needed as a transitional build dependency of the SuiteSparse
package from Homebrew. So, also disable building with CHOLMOD.

The logic to build with OpenMP is left in place. So, it should be easy to
re-activate building with it in case the underlying issues will be fixed.

Note that the tests with OpenMP fail for optimization level `-O3` but
succeed with `-O2`.
@mmuetzel
Copy link
Contributor Author

Both should be active at the same time to trigger the fail. I'd second the deactivation of OpenMP for testing on these platforms for the time being.

I opened #637 which disables building with OpenMP on the macOS CI runners.

@mmuetzel
Copy link
Contributor Author

Switched to a different configuration (without OpenMP) for which an Ubuntu on ARM runner is added in this PR.

mmuetzel added a commit to mmuetzel/elmerfem that referenced this pull request Jan 27, 2025
Building with OpenMP causes some test failures on macOS on Apple Silicon,
see ElmerCSC#634 (comment).
The affected tests currently are: `H1BasisEvaluation` and
`SD_H1BasisEvaluation`.

For the time being, disable building with OpenMP on Apple Silicon in CI.

OpenMP is needed as a transitional build dependency of the SuiteSparse
package from Homebrew. So, also disable building with CHOLMOD.

Note that the tests with OpenMP fail for optimization level `-O3` but
succeed with `-O2`.
@mmuetzel
Copy link
Contributor Author

When building without OpenMP, the only remaining test failure for Ubuntu on ARM is pointload2:

The following tests FAILED:
Errors while running CTest
	920 - pointload2 (Failed)                               serial
Error: Process completed with exit code 8.

That is the same test that is the only remaining test error for macOS on Apple Silicon when building without OpenMP.
That looks like a pattern.

mmuetzel added a commit to mmuetzel/elmerfem that referenced this pull request Jan 27, 2025
Building with OpenMP causes some test failures for macOS on Apple
Silicon, see:
ElmerCSC#634 (comment)
The affected tests currently are: `H1BasisEvaluation` and
`SD_H1BasisEvaluation`.

For the time being, disable building with OpenMP on Apple Silicon in CI.

OpenMP is needed as a transitional build dependency of the SuiteSparse
package from Homebrew. So, also disable building with CHOLMOD.

Note that the tests with OpenMP fail for optimization level `-O3` but
succeed with `-O2`.
mmuetzel added a commit to mmuetzel/elmerfem that referenced this pull request Jan 27, 2025
Building with OpenMP causes some test failures for macOS on Apple
Silicon, see:
ElmerCSC#634 (comment)
The affected tests currently are: `H1BasisEvaluation` and
`SD_H1BasisEvaluation`.

For the time being, disable building with OpenMP on Apple Silicon in CI.

OpenMP is needed as a transitional build dependency of the SuiteSparse
package from Homebrew. So, also disable building with CHOLMOD.

Note that the tests with OpenMP fail for optimization level `-O3` but
succeed with `-O2`.
mmuetzel added a commit to mmuetzel/elmerfem that referenced this pull request Jan 27, 2025
Building with OpenMP causes some test failures for macOS on Apple
Silicon, see:
ElmerCSC#634 (comment)
The affected tests currently are: `H1BasisEvaluation` and
`SD_H1BasisEvaluation`.

For the time being, disable building with OpenMP on Apple Silicon in CI.

OpenMP is needed as a transitional build dependency of the SuiteSparse
package from Homebrew. So, also disable building with CHOLMOD.

Note that the tests with OpenMP fail for optimization level `-O3` but
succeed with `-O2`.
mmuetzel added a commit to mmuetzel/elmerfem that referenced this pull request Jan 27, 2025
Building with OpenMP causes some test failures for macOS on Apple
Silicon, see:
ElmerCSC#634 (comment)
The affected tests currently are: `H1BasisEvaluation` and
`SD_H1BasisEvaluation`.

For the time being, disable building with OpenMP on Apple Silicon in CI.

OpenMP is needed as a transitional build dependency of the SuiteSparse
package from Homebrew. So, also disable building with CHOLMOD.

Note that the tests with OpenMP fail for optimization level `-O3` but
succeed with `-O2`.
mmuetzel added a commit to mmuetzel/elmerfem that referenced this pull request Jan 27, 2025
Building with OpenMP causes some test failures for macOS on Apple
Silicon, see:
ElmerCSC#634 (comment)
The affected tests currently are: `H1BasisEvaluation` and
`SD_H1BasisEvaluation`.

For the time being, disable building with OpenMP on Apple Silicon in CI.

OpenMP is needed as a transitional build dependency of the SuiteSparse
package from Homebrew. So, also disable building with CHOLMOD.

Note that the tests with OpenMP fail for optimization level `-O3` but
succeed with `-O2`.
mmuetzel added a commit to mmuetzel/elmerfem that referenced this pull request Jan 27, 2025
Building with OpenMP causes some test failures for macOS on Apple
Silicon, see:
ElmerCSC#634 (comment)
The affected tests currently are: `H1BasisEvaluation` and
`SD_H1BasisEvaluation`.

For the time being, disable building with OpenMP on Apple Silicon in CI.

OpenMP is needed as a transitional build dependency of the SuiteSparse
package from Homebrew. So, also disable building with CHOLMOD.

Note that the tests with OpenMP fail for optimization level `-O3` but
succeed with `-O2`.
mmuetzel added a commit to mmuetzel/elmerfem that referenced this pull request Jan 28, 2025
Building with OpenMP causes some test failures for macOS on Apple
Silicon, see:
ElmerCSC#634 (comment)
The affected tests currently are: `H1BasisEvaluation` and
`SD_H1BasisEvaluation`.

For the time being, disable building with OpenMP on Apple Silicon in CI.

OpenMP is needed as a transitional build dependency of the SuiteSparse
package from Homebrew. So, also disable building with CHOLMOD.
GitHub started hosting runners with Ubuntu on ARM64 processors for open
source projects for free:
https://github.blog/changelog/2025-01-16-linux-arm64-hosted-runners-now-available-for-free-in-public-repositories-public-preview/

Add one configuration that is using these runners to the build matrix.

Avoid using a configuration with OpenMP because there seem to be issues
with OpenMP on that platform.
mmuetzel added a commit to mmuetzel/elmerfem that referenced this pull request Jan 28, 2025
Building with OpenMP causes some test failures for macOS on Apple
Silicon, see:
ElmerCSC#634 (comment)
The affected tests currently are: `H1BasisEvaluation` and
`SD_H1BasisEvaluation`.

For the time being, disable building with OpenMP on Apple Silicon in CI.

OpenMP is needed as a transitional build dependency of the SuiteSparse
package from Homebrew. So, also disable building with CHOLMOD.
@juharu juharu merged commit a4d2120 into ElmerCSC:devel Jan 28, 2025
9 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants