Skip to content

Commit

Permalink
plm/pals: change to terminate prteds handling
Browse files Browse the repository at this point in the history
something changed in PR 1907 that lead to non-exiting prted's
when using the pals PLM.  somehow the terminate_orteds code was messing
up the DVM state leading to the HNP not actually convincing the prted's
to exit.

When running regression tests on Aurora systems this ended up with
10s of orphaned prteds and palsd daemons, which apparently results in
a leak of SS11 casinni resources and eventually CXI emitting error messages
and not allowing more jobs in the allocation to run.

related to openpmix#1907

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
  • Loading branch information
hppritcha authored and rhc54 committed Feb 7, 2024
1 parent 129571d commit 781e5d2
Showing 1 changed file with 7 additions and 10 deletions.
17 changes: 7 additions & 10 deletions src/mca/plm/pals/plm_pals_module.c
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
* Copyright (c) 2017-2019 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* Copyright (c) 2021-2023 Nanook Consulting. All rights reserved.
* Copyright (c) 2023 Triad National Security, LLC. All rights
* Copyright (c) 2023-2024 Triad National Security, LLC. All rights
* reserved.
* $COPYRIGHT$
*
Expand Down Expand Up @@ -84,7 +84,7 @@
*/
static int plm_pals_init(void);
static int plm_pals_launch_job(prte_job_t *jdata);
static int plm_pals_terminate_orteds(void);
static int plm_pals_terminate_prteds(void);
static int plm_pals_signal_job(pmix_nspace_t jobid, int32_t signal);
static int plm_pals_finalize(void);

Expand All @@ -98,7 +98,7 @@ prte_plm_base_module_t prte_plm_pals_module = {
.set_hnp_name = prte_plm_base_set_hnp_name,
.spawn = plm_pals_launch_job,
.terminate_job = prte_plm_base_prted_terminate_job,
.terminate_orteds = plm_pals_terminate_orteds,
.terminate_orteds = plm_pals_terminate_prteds,
.terminate_procs = prte_plm_base_prted_kill_local_procs,
.signal_job = plm_pals_signal_job,
.finalize = plm_pals_finalize
Expand Down Expand Up @@ -429,15 +429,15 @@ static void launch_daemons(int fd, short args, void *cbdata)
}

/**
* Terminate the orteds for a given job
* Terminate the prteds for a given job
*/
static int plm_pals_terminate_orteds(void)
static int plm_pals_terminate_prteds(void)
{
int rc;
prte_job_t *jdata;

PMIX_OUTPUT_VERBOSE((10, prte_plm_base_framework.framework_output,
"%s plm:pals: terminating orteds", PRTE_NAME_PRINT(PRTE_PROC_MY_NAME)));
"%s plm:pals: terminating prteds", PRTE_NAME_PRINT(PRTE_PROC_MY_NAME)));

/* deregister the waitpid callback to ensure we don't make it look like
* pals failed when it didn't. Since the pals may have already completed,
Expand All @@ -453,11 +453,8 @@ static int plm_pals_terminate_orteds(void)
PRTE_ERROR_LOG(rc);
}

jdata = prte_get_job_data_object(PRTE_PROC_MY_NAME->nspace);
PRTE_ACTIVATE_JOB_STATE(jdata, PRTE_JOB_STATE_DAEMONS_TERMINATED);

PMIX_OUTPUT_VERBOSE((10, prte_plm_base_framework.framework_output,
"%s plm:pals: terminated orteds", PRTE_NAME_PRINT(PRTE_PROC_MY_NAME)));
"%s plm:pals: terminated orteds %d", PRTE_NAME_PRINT(PRTE_PROC_MY_NAME),rc));
return rc;
}

Expand Down

0 comments on commit 781e5d2

Please sign in to comment.