-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lifecycle: add poststop hook #8194
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having tests would be great. I'm also unclear what the semantics of post-stop is and/or what use cases we want to target - It's unclear to me when they should run?
- Should they run if the main tasks fail? Clean up tasks should probably run all the time
- Should they run if
nomad job stop
is invoked? I think current implementation would probably not run them - Should sidecars run concurrently with post-stop tasks? Having side-cars run until the very end makes sense.
Also, we probably need to be clear with post-stop semantics to users in our documentation. For example, If a main task succeeds but post-stop task fail (or the host dies during post-stop task), the allocation might be rescheduled and rerun.
@@ -63,15 +77,29 @@ func (c *taskHookCoordinator) setTasks(tasks []*structs.Task) { | |||
if !c.hasPrestartTasks() { | |||
c.mainTaskCtxCancel() | |||
} | |||
if !c.hasMainTasks() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to allow jobs to not have main tasks? That seems like an invalid TaskGroup?
Also, we can close channel if there aren't any post-start tasks - this will make taskStateUpdated
basically a no-op in the common case.
for task := range c.poststopTasks { | ||
st := states[task] | ||
if st == nil || !st.Successful() { | ||
continue | ||
} | ||
|
||
delete(c.poststopTasks, task) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to track post stop tasks? We don't depend on their state currently, as nothing is blocked by them.
9fbbf9b
to
8a98124
Compare
E2E tests:
|
1feda39
to
5e6dcd8
Compare
1ec0c1c
to
810ca4b
Compare
4238e63
to
c343cdb
Compare
b958586
to
77efd8f
Compare
This pull request is being automatically deployed with Vercel (learn more). 🔍 Inspect: https://vercel.com/hashicorp/nomad/n6d295jf2 [Deployment for 758ca14 canceled] |
for _, task := range ar.tasks { | ||
if task.IsPoststopTask() { | ||
<-task.WaitCh() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is responsible for starting poststop tasks after the other tasks have finished.
This covers the nomad job stop
case:
- wait until existing tasks have finished running
- start poststop tasks
- wait until poststop tasks finish
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this compare to having the coordinator tracking main task completion? The coordinator logic now already tracks task completions (so main tasks start after pre-start) - can we reuse the same state machine rather than add the logic here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was discussed offline & the goal is to do a refactor in another pull request to move the poststop specific code out of the alloc runner & back into the task hook coordinator. The reason it was avoided this time around because I didn't know how to block progression of the allocrunner from exiting during a service job stop & still signal the poststop tasks to start
if tr.IsPoststopTask() { | ||
continue | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Poststop tasks have special kill behavior: i.e. we don't want to kill them if we receive a kill signal from nomad job stop
. We want to wait until everything else has been killed, then we want to run the poststop tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skipping the rest of this loop essentially says: don't track poststop tasks along with the main group of tasks (in the set of liverunners
), that way we won't log them later as being killed when there is a kill event
@@ -574,7 +593,8 @@ func (ar *allocRunner) killTasks() map[string]*structs.TaskState { | |||
// Kill the rest concurrently | |||
wg := sync.WaitGroup{} | |||
for name, tr := range ar.tasks { | |||
if tr.IsLeader() { | |||
// Filter out poststop tasks so they run after all the other tasks are killed | |||
if tr.IsLeader() || tr.IsPoststopTask() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the line that actually prevents poststop tasks from being killed in a kill event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The checks for IsPoststopTask()
spread throughout the code, especially when they're part of a multi-clause boolean check, strike me as a bit low-level and specific to this task type, vs. capturing a general property of the tasks.
Would it be reasonable to explicitly define interface methods for e.g. tr.isReadyToStart()
and tr.isReadyToKill()
(or similar) that hid the property checks behind the task runner interface? (OTOH, this could easily be future-proofing something we don't care about yet, or run contrary to our standard practices around expanding interface footprint.)
|
||
delete(c.mainTasksRunning, task) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a main task is dead, remove from the set (when all tasks are removed from the set, poststop tasks may proceed with execution)
} | ||
|
||
func (c *taskHookCoordinator) StartPoststopTasks() { | ||
c.poststopTaskCtxCancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs comment: helper function for starting poststop tasks outside of the handleTaskStateUpdate() infinite loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel qualified to verify the correctness of this approach, though the broad shape seems fine. My comments are mostly around naming + interface conventions, and a bit about testing and clocks.
It does seem like a test at least documenting the current behavior when a poststop task gets explicitly killed would be useful. Similarly, the current e2e tests only appear to check poststop task behavior for batch jobs. Is it our intent to only support hooking into batch jobs, or could we add some additional tests for what happens with other job types? (NMD-017 doesn't appear to present a spec or even opinion on this, so I'm open to whatever answer makes sense right now, as long as we have some documentation and tests to explain it.)
@@ -523,6 +541,7 @@ func (ar *allocRunner) handleTaskStateUpdates() { | |||
// prevent looping before TaskRunners have transitioned | |||
// to Dead. | |||
for _, tr := range liveRunners { | |||
ar.logger.Info("killing task: ", tr.Task().Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this actually happen at the info
level? My expectation is that those messages will actually hit the console in a default config, and adding a new sort of tracing/debug message here (w/o putting it at debug
log level) could create log noise that most operators won't know what to do with.
@@ -574,7 +593,8 @@ func (ar *allocRunner) killTasks() map[string]*structs.TaskState { | |||
// Kill the rest concurrently | |||
wg := sync.WaitGroup{} | |||
for name, tr := range ar.tasks { | |||
if tr.IsLeader() { | |||
// Filter out poststop tasks so they run after all the other tasks are killed | |||
if tr.IsLeader() || tr.IsPoststopTask() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The checks for IsPoststopTask()
spread throughout the code, especially when they're part of a multi-clause boolean check, strike me as a bit low-level and specific to this task type, vs. capturing a general property of the tasks.
Would it be reasonable to explicitly define interface methods for e.g. tr.isReadyToStart()
and tr.isReadyToKill()
(or similar) that hid the property checks behind the task runner interface? (OTOH, this could easily be future-proofing something we don't care about yet, or run contrary to our standard practices around expanding interface footprint.)
ephemeralTask := alloc.Job.TaskGroups[0].Tasks[1] | ||
ephemeralTask.Name = "quit" | ||
ephemeralTask.Lifecycle.Hook = structs.TaskLifecycleHookPoststop | ||
ephemeralTask.Config["run_for"] = "10s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these numbers wall-clock time? (I.e., will this test take a minimum of 10 seconds to run?) It may not be a big deal in isolation, but actual blocking sleep calls scattered throughout a test suite can pile up and make tests slooooowwww.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple of questions and we can follow up in sync
for _, task := range ar.tasks { | ||
if task.IsPoststopTask() { | ||
<-task.WaitCh() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this compare to having the coordinator tracking main task completion? The coordinator logic now already tracks task completions (so main tasks start after pre-start) - can we reuse the same state machine rather than add the logic here?
// TestAllocRunner_Lifecycle_Poststop asserts that a service job with 1 | ||
// postop lifecycle hook starts all 3 tasks, only | ||
// the ephemeral one finishes, and the other 2 exit when the alloc is stopped. | ||
func TestAllocRunner_Lifecycle_Poststop(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest changing the name of this test - as this tests stop behavior - so TestAllocRunner_Lifecycle_Poststop_IfStopped
?
Also, should add a test if main tasks complete naturally, for batch and service jobs.
Also, should add a test for when
758ca14
to
2eca731
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this to be included in the beta release. We will follow up with additional tests and refactoring.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
For #8193
Overview
A
poststop
hook has been added to thelifecycle
stanza to allow users to run tasks after all other tasks in an allocation have completed.Behavior
This
poststop
task will run after main tasks are finished running. It will run no matter the final status of the main tasks, whether it is completed, failed or killed.poststop
waits until main tasks have finished before startingpoststop
will only run poststop tasks after the allocation receives a kill signal or a main task failsFuture Development
I plan to make these improvements in separate pull requests