Stop Heartbeat monitor jobs on cancelation #20570

jsoriano · 2020-08-12T11:26:50Z

If a monitor is stopped, for example when using autodiscover, the
scheduled tasks should be stopped too. Scheduler was rescheduling tasks
forever once started, though these tasks were not being executed because
they are also aware of the context.

This change avoids the execution and rescheduling of tasks once its job
context is done.

Checklist

My code follows the style guidelines of this project
~~I have commented my code, particularly in hard-to-understand areas~~
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
~~I have added tests that prove my fix is effective or that my feature works~~
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Start heartbeat with autodiscover enabled and -d scheduler.
Start some container/pod.
Wait for the monitor to be configured and executed.
Stop the container/pod.
Messages about jobs execution like the following one should eventually stop appearing:

2020-08-12T13:13:43.705+0200	DEBUG	[scheduler]	scheduler/scheduler.go:201	Job 'auto-http-0X2C5537D51C1B9524' returned at 2020-08-12 13:13:43.70518639 +0200 CEST m=+66.010433861

Related issues

Closes Autodiscover based on services doesn't stop monitors when service is deleted #20544

elasticmachine · 2020-08-12T11:26:52Z

Pinging @elastic/uptime (Team:Uptime)

If a monitor is stopped, for example when using autodiscover, the scheduled tasks should be stopped too. Scheduler was rescheduling tasks forever once started, though these tasks were not being executed because they are also aware of the context. This change avoids the execution and rescheduling of tasks once its job context is done.

elasticmachine · 2020-08-12T12:19:23Z

💚 Build Succeeded

Expand to view the summary

Build stats

Build Cause: [Pull request #20570 updated]
Start Time: 2020-08-12T11:29:50.990+0000
Duration: 49 min 28 sec

Test stats 🧪

Test	Results
Failed	0
Passed	1200
Skipped	28
Total	1228

andrewvc · 2020-08-13T01:25:00Z

Really interesting and good find @jsoriano ! I tried to add a test here, but I couldn't, because there was no externally visible thing to test. We are however missing cancellation tests. Would you mind adding this one below to this patch? This should be added to scheduler_test.go.

I guess it's kind of weird to add a test that doesn't really test the patch, but given that there's no good test to write, I think this will have to do. I suppose we could add more insight into the scheduler internals, but that feels like a weird design choice.

func TestCancellingJobs(t *testing.T) {
	s := NewWithLocation(10, monitoring.NewRegistry(), tarawaTime())

	require.NoError(t, s.Start())

	// Mutex to guard removeFn
	taskInitMtx := sync.Mutex{}
	// Let the job run once, then cancel it immediately
	taskInitMtx.Lock()
	var removeFn func ()
	timesRan := batomic.MakeInt(0)
	removeFn, err := s.Add(testSchedule{delay: 0}, "testCancel", func(ctx context.Context) []TaskFunc {
		timesRan.Inc()
		taskInitMtx.Lock()
		removeFn()
		taskInitMtx.Unlock()
		return nil
	})
	require.NoError(t, err)
	taskInitMtx.Unlock()

	// It's hard to tell if the job still exists since
	// we just recursively requeue them, but we should know after a second
	time.Sleep(time.Second)
	require.Equal(t, 1, timesRan.Load())

	require.NoError(t, s.Stop())
}

jsoriano · 2020-08-13T09:39:19Z

@andrewvc yeah, I also couldn't find a way to test this. As you said we would need to expose scheduler or timer queue internals and can be weird. I thought that a way to do it could be to expose the length of the timer queue, but this is not reliable because the task is not in the queue while it is being executed.

Regarding the test for cancelation, wdyt about discussing it in a separate PR? I think it is always complicated to automatically test for things that are not expected to happen. In this case I am concerned by the sleep, I would prefer not having to add it.

andrewvc

LGTM. We can discuss the added test in a separate PR

If a monitor is stopped, for example when using autodiscover, the scheduled tasks should be stopped too. Scheduler was rescheduling tasks forever once started, though these tasks were not being executed because they are also aware of the context. This change avoids the execution and rescheduling of tasks once its job context is done. (cherry picked from commit a6d98d6)

If a monitor is stopped, for example when using autodiscover, the scheduled tasks should be stopped too. Scheduler was rescheduling tasks forever once started, though these tasks were not being executed because they are also aware of the context. This change avoids the execution and rescheduling of tasks once its job context is done.

…0588) If a monitor is stopped, for example when using autodiscover, the scheduled tasks should be stopped too. Scheduler was rescheduling tasks forever once started, though these tasks were not being executed because they are also aware of the context. This change avoids the execution and rescheduling of tasks once its job context is done. (cherry picked from commit 5921705)

jsoriano added review needs_backport PR is waiting to be backported to other branches. Team:obs-ds-hosted-services Label for the Observability Hosted Services team v7.10.0 labels Aug 12, 2020

jsoriano requested a review from andrewvc August 12, 2020 11:26

jsoriano requested a review from a team as a code owner August 12, 2020 11:26

jsoriano self-assigned this Aug 12, 2020

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Aug 12, 2020

jsoriano force-pushed the heartbeat-stop-job branch from 75b3e62 to cab5285 Compare August 12, 2020 11:28

andrewvc approved these changes Aug 13, 2020

View reviewed changes

jsoriano merged commit a6d98d6 into elastic:master Aug 13, 2020

jsoriano deleted the heartbeat-stop-job branch August 13, 2020 13:58

jsoriano mentioned this pull request Aug 13, 2020

Cherry-pick #20570 to 7.x: Stop Heartbeat monitor jobs on cancelation #20587

Merged

6 tasks

jsoriano removed the needs_backport PR is waiting to be backported to other branches. label Aug 13, 2020

jsoriano mentioned this pull request Aug 13, 2020

Cherry-pick #20570 to 7.9: Stop Heartbeat monitor jobs on cancelation #20588

Merged

6 tasks

jsoriano added the v7.9.0 label Aug 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop Heartbeat monitor jobs on cancelation #20570

Stop Heartbeat monitor jobs on cancelation #20570

jsoriano commented Aug 12, 2020

elasticmachine commented Aug 12, 2020

elasticmachine commented Aug 12, 2020

Build stats

Test stats 🧪

andrewvc commented Aug 13, 2020

jsoriano commented Aug 13, 2020

andrewvc left a comment

Stop Heartbeat monitor jobs on cancelation #20570

Stop Heartbeat monitor jobs on cancelation #20570

Conversation

jsoriano commented Aug 12, 2020

Checklist

How to test this PR locally

Related issues

elasticmachine commented Aug 12, 2020

elasticmachine commented Aug 12, 2020

💚 Build Succeeded

Build stats

Test stats 🧪

andrewvc commented Aug 13, 2020

jsoriano commented Aug 13, 2020

andrewvc left a comment

Choose a reason for hiding this comment