Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests on multiple mpi ranks #24

Merged
merged 42 commits into from
Nov 29, 2023
Merged

tests on multiple mpi ranks #24

merged 42 commits into from
Nov 29, 2023

Conversation

bjpalmer
Copy link
Collaborator

@bjpalmer bjpalmer commented Oct 2, 2023

The func-test-mpi branch has been rebased against develop and appears to be ready to merge. This PR closes issue #27.

@bjpalmer bjpalmer self-assigned this Oct 2, 2023
@cameronrutherford
Copy link
Contributor

All the commits were from @jaelynlitz - is this her previous work?

I also would appreciate if you could bring over the relevant issue from GitLab, close that issue, and then ensure that this PR closes that issue upon merge.

@bjpalmer
Copy link
Collaborator Author

bjpalmer commented Oct 2, 2023

There doesn't appear to be an issue in Gitlab, unless I'm being obtuse. @jaelynlitz do you see it? I may have picked this up by scanning through the branches.

@cameronrutherford
Copy link
Contributor

cameronrutherford commented Oct 2, 2023

There doesn't appear to be an issue in Gitlab, unless I'm being obtuse. @jaelynlitz do you see it? I may have picked this up by scanning through the branches.

https://gitlab.pnnl.gov/exasgd/frameworks/exago/-/issues/279

https://gitlab.pnnl.gov/exasgd/frameworks/exago/-/merge_requests/269

https://gitlab.pnnl.gov/exasgd/frameworks/exago/-/merge_requests/472/diffs

https://gitlab.pnnl.gov/exasgd/frameworks/exago/-/merge_requests/293

https://gitlab.pnnl.gov/exasgd/frameworks/exago/-/issues/284

Does this resolve the SCOPFLOW tests printing duplicate output perhaps? I am confused as there are two existing PRs which have been merged targeting pflow, and this would be the third.

It would be great if you could consolidate the open issue into individual components, and clarify what in particular this PR is addressing.

@bjpalmer bjpalmer changed the title test pflow on multiple mpi ranks tests on multiple mpi ranks Oct 4, 2023
@bjpalmer bjpalmer linked an issue Oct 4, 2023 that may be closed by this pull request
4 tasks
@cameronrutherford
Copy link
Contributor

Tentatively marking for 1.6.1 release unless this is about to be merged

@bjpalmer
Copy link
Collaborator Author

If you are otherwise ready to go, I'd skip this one. There still seem to be some issues.

@bjpalmer
Copy link
Collaborator Author

Attached is a log file from Valgrind when running the FUNCTIONALITY_TEST_SCOPFLOW_HIOP_SERIAL_TESTSUITE test in the test suite. This corresponds to the hiop_serial.toml file under tests/functionality/scopflow.

log.2203724.txt

@bjpalmer
Copy link
Collaborator Author

I looked through the code to see if I can track down some of the error showing up in Valgrind. I can make the errors coming from PetscMemcpy go away by using a char* buffer and strncpy even though I'm not completely sure why the original error is showing up.

Most of the remaining errors are complaints that the following three variables in OPFLOW are not initialized:

ignore_lineflow_constraints
include_powerimbalance_variables
solver.default_value

@abhyshr, what are appropriate default values for these variables and where should they be set?

@bjpalmer
Copy link
Collaborator Author

Test 34 is now running correctly on Newell after getting rid of the error reported from PetscMemcpy, but I think it would still be a good idea to make sure the variables above are initialized to something.

@bjpalmer
Copy link
Collaborator Author

I looked in the code in the opflow.h file and it appears that both ignore_lineflow_constraints and include_powerimbalance_variables should be initialized to false. Maybe valgrind doesn't like ExaGOBoolOption function?

@bjpalmer
Copy link
Collaborator Author

I was able to track down some of the uninitialized variables. The variables are initialized in the SCOPFLOW and OPFLOW Initialization routines but then are overwritten in the selfcheck.cpp file with variables that may not have been initialized. The failure of test 31 on Newell and Deception disappeared but new failures for tests 34 and 35 showed up. @abhyshr, can you take a look at these failures? The test 34 failure is a largish change in the objective function and the number of iterations going from 1 to 2. Is it possible that this may actually be an improved answer?

@cameronrutherford
Copy link
Contributor

I was able to track down some of the uninitialized variables. The variables are initialized in the SCOPFLOW and OPFLOW Initialization routines but then are overwritten in the selfcheck.cpp file with variables that may not have been initialized. The failure of test 31 on Newell and Deception disappeared but new failures for tests 34 and 35 showed up. @abhyshr, can you take a look at these failures? The test 34 failure is a largish change in the objective function and the number of iterations going from 1 to 2. Is it possible that this may actually be an improved answer?

Are the test failures in CI? Would be easiest to link logs there...

@bjpalmer
Copy link
Collaborator Author

The test 34 failure is showing up on both Newell and Deception in CI.

@abhyshr
Copy link
Collaborator

abhyshr commented Oct 24, 2023

I am not sure if we have changed anything in SOPFLOW that will affect the solution. I don't recall of anything at this point. Not sure if it has to do with any of the dependency libraries. Did something change in the input files. I am not sure. Might be best to update the reference solution for the two tests for now and see if we are get any more failures in the future.

@bjpalmer
Copy link
Collaborator Author

I tried rebasing against develop and got a conflict with .github/workflows/spack_cpu_build.yaml but there is no .github directory listed in the top level exago directory. How am I getting a conflict here?

@cameronrutherford
Copy link
Contributor

cameronrutherford commented Oct 25, 2023

I tried rebasing against develop and got a conflict with .github/workflows/spack_cpu_build.yaml but there is no .github directory listed in the top level exago directory. How am I getting a conflict here?

I assume that you are not able to view the .github directory in your terminal when doing this, as that folder is in good health in exago@develop. Make sure to ls -al when looking for hidden files.

It appears as though your git history is mangled, so I suggest the generic approach assuming you are in a checked out local version of func-test-mpi, and no existing develop branch has been checked out locally, and ensuring you are able to correctly pick commits during git rebase -i:

$ git fetch --all
$ git checkout -b develop --track origin/develop
$ git reset --hard origin/develop
$ git checkout -
$ git rebase -i develop
$ git push -f origin func-test-mpi

@bjpalmer
Copy link
Collaborator Author

Is that second line correct? The develop branch already exists.

@cameronrutherford
Copy link
Contributor

Is that second line correct? The develop branch already exists.

assuming you are in a checked out local version of func-test-mpi, and no existing develop branch has been checked out locally

@bjpalmer
Copy link
Collaborator Author

Sorry, I meant the second git instruction

$ git checkout -b develop --track origin/develop

I'm getting a complaint about the -b.

@bjpalmer
Copy link
Collaborator Author

bjpalmer commented Nov 9, 2023

I just rebased against develop again this morning. If everything works, I'd merge before anything else changes.

Bruce J Palmer and others added 26 commits November 28, 2023 13:47
…nd SCOPFLOW

to adapt tests for running on multiple MPI ranks.
* Minor fix for Summit build system

* Fix '--nnodes'-->'-nodes' on Summit

* Attempt to update Summit modules

* Reinstall Ginkgo and python dependencies on Summit

* Enforce cuda@11.4.2 on Summit

* Specify RelWithDebInfo for ExaGO and HiOp on Summit

* Update Spack

* Relax constraints on exago dependencies on Summit

* Add constraints on HiOp in the spack config. Part of the ExaGO package was conflicting with building HiOp in release mode.

* Cleaner module install on Summit

* Update spack_cpu_build.yaml to work without fork

* Update .github/workflows/spack_cpu_build.yaml

* Update Spack

* Try updating pybind11 submodule to see if it fixes errors with exago+python builds

---------

Co-authored-by: Cameron Rutherford <robert.rutherford@pnnl.gov>
* OPFLOW: initial implementation of RAJA/HiOp sparse GPU-based solver

WIP - HIOP Sparse solver with GPU model

OPFLOW: Started work on support for HIOP sparse solver interface for GPUs.

Added a copy of hiop sparse solver interface.

OPFLOW: Added model skeleton for GPU sparse version (copying from pbpolrajahiop)

Fixed build

Did some copy paste to add a test for HIOPSPARSE. This test is not actually
functional yet.

Started updating the hiopsparse model and solver code.

More work on updating the solver and model

Added scalar and vector unit tests for model to be used with HIOP sparse solver on GPU

Apply cmake lint

Fix unit tests.

Set the size of array when using Umpire memset.

Code formatting

Some minor changes to get PBPOLRAJAHIOPSPARSE model code to compile

Separate BUS/LINE/GEN/.../Param structs into reusable module

Minor edit

Rename files

Fix typo

Use BUS/LINE/GEN/.../Param structs in Raja HiOp Sparse model (compiles)

Updating HIOP sparse solver GPU API

Completed bounds kernels

Completed scalar and vector functions

WIP - HIOP Sparse solver with GPU model

OPFLOW: Started work on support for HIOP sparse solver interface for GPUs.

Added a copy of hiop sparse solver interface.

OPFLOW: Added model skeleton for GPU sparse version (copying from pbpolrajahiop)

Fixed build

Did some copy paste to add a test for HIOPSPARSE. This test is not actually
functional yet.

Started updating the hiopsparse model and solver code.

More work on updating the solver and model

Added scalar and vector unit tests for model to be used with HIOP sparse solver on GPU

Apply cmake lint

Fix unit tests.

Set the size of array when using Umpire memset.

Code formatting

Rename files

Use BUS/LINE/GEN/.../Param structs in Raja HiOp Sparse model (compiles)

Updating HIOP sparse solver GPU API

Completed bounds kernels

Jacobian and Hessian for sparse model (CPU --> GPU copy)

Use correct array lengths in Eq. Jacobian

Fix bug in Jacobian.

Fix unused variable/parameter errors

OPFLOW: rework solution callback for RAJA/HIOP GPU-based solver

Formatting changes

* Add unit test for RAJA/HiOp Sparse GPU model (9-bus only)

* Apply pre-commmit fixes

* Add test for 200-bus case

* Apply pre-commmit fixes

---------

Co-authored-by: Abhyankar, Shrirang G <shrirang.abhyankar@pnnl.gov>
* Boilerplate scripts to install modules on Ascent via submodule Spack

* Fix '--nnodes'-->'-nodes' on Ascent

* Improve Ascent env.sh

* magma@2.6.2 on Ascent

* Apply pre-commmit fixes

* Relax constraints on exago dependencies on Ascent and build ~python

* concretizer: reuse was causing several packages to be duplicated in the environment. Require clean concretizations on  Ascent.

* Minor module update on Ascent

* Add LAPACK_LIBRARIES to Ascent base script. CMAKE was picking up python's openblas otherwise.

* Error with unzip.

* Apply pre-commmit fixes

* Add working build on ascent.

* Add working gcc11.2.0 spack spec.

* Add Ascent Spack pipeline. [ascent-rebuild]

* Update gcc version to 11.2.0 in base.sh [skip-ci]

* Fix stages of Ascent pipeline [ascent-rebuild]

* Add working ascent spack build.

* Add hiop@develop force rebuild to PNNL CI [ascent-rebuild] [newell-rebuild] [deception-rebuild] [incline-rebuild].

* Update Ascent spack built tcl modules

* Only test ascent on tcl module update [ci-skip]

* Update base.sh to disable python on ascent [skip ci]

* Remove LAPACK_LIBRARIES spec [ascent-test]

* Update ascent.gitlab-ci.yml to fix needs/dependencies [ascent-test]

* Update deception spack built tcl modules - [deception-test]

* Try again with Python, but have Spack build it instead of using the external module [ascent-rebuild]

* Force python rebuild on ascent and use hiop@0.7.2 on incline [ascent-rebuild] [newell-rebuild] [incline-rebuild]

* Pin hiop@1.0.0 on all CI platforms [decetpion-rebuild] [ascent-rebuild] [newell-rebuild] [incline-rebuild]

* Fix false positive/negative in Ascent pipelines [deception-rebuild] [ascent-test]

* Update incline spack built tcl modules - [incline-test]

* Update newell spack built tcl modules - [newell-test]

* Fix HiOp spec on Ascent [ascent-rebuild].

* Update deception spack built tcl modules - [deception-test]

* Update CPU Spack build with issue for each failing build [ci skip]

* Update Ascent spack built tcl modules [ascent-test]

* Add 1.0.0 dep into CHANGELOG.

* Add ascent-skip to CI to get tests passing [ascent-test]

---------

Co-authored-by: nkoukpaizan <nkoukpaizan@users.noreply.github.com>
Co-authored-by: Cameron Rutherford <robert.rutherford@pnnl.gov>
Co-authored-by: cameronrutherford <cameronrutherford@users.noreply.github.com>
Co-authored-by: spack-auto-module <spack.bot@no-reply.com>
* Add CPU build with hiop+sparse and exago~ipopt+hiop+raja

* Update .github/workflows/spack_cpu_build.yaml

* `+mpi` to `+raja` CPU build

* Add HIOPRAJASPARSE model if sparse and raja enabled

* Fix other HIOPRAJASPARSE ifdef
…nd SCOPFLOW

to adapt tests for running on multiple MPI ranks.
…variables

errors in Valgrind and modified a few test values so that tests pass.
@cameronrutherford
Copy link
Contributor

Your tests were failing because I killed GitLab CI for a bit there in #85. My bad - I have kicked off fresh pipelines and once tests pass we can merge.

Thanks again for this bruce - makes our debugging much quicker

@cameronrutherford cameronrutherford merged commit f19b82f into develop Nov 29, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add parallel testing to SOPFLOW toml infrastructure
6 participants