Move first ES cluster state observation out of go routine #5783

jiuker · 2022-06-16T08:16:31Z

There is a process to execute in Start, but the execution point is uncertain. As a result, the lastHealth in the observer structure is unknown, and pkg/controller/elasticsearch/pdb/reconcile.go:196 returns 0. The probability of restarting the operator will cause *v1beta1.PodDisruptionBudgetStatus MinAvailable to be updated from 1 (it is the correct 1 before operator restart) to 2. After the observe is executed, the lastHealth is green, and the MinAvailable is updated from 2 to 1. Therefore, restart the operator *v1beta1.PodDisruptionBudgetStatus is updated twice.

Repro steps:

Add fmt.Println("update!") line 182 of pkg/controller/common/reconciler/reconciler.go`:
start operator
kubectl apply -f config/samples/elasticsearch/elasticsearch.yaml。
wait all pod ready
restart operator
output:

update!
...
update!

Resolves #5812.

…certain. As a result, the lasthealth in the observer structure is unknown, and pkg/controller/elasticsearch/pdb/reconcile.go:196 returns 0. The probability of restarting the operator will cause *v1beta1.PodDisruptionBudgetStatus minavailable is updated from 1 (if it is the correct 1 before operator is restarted) to 2. After the observe is executed, the lasthealth is green, and the minavailable is updated from 2 to 1. Therefore, restart the operator *v1beta1.PodDisruptionBudgetStatus is updated twice.

elasticmachine · 2022-06-16T08:16:34Z

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

thbkrkr

Nice catch!

If you look at runPeriodically(), there was an intention to trigger a first observation before doing it periodically but it has been done before the go routine starts. So I would rather we fix runPeriodically() / Start(). I think we can merge the two because the indirection provided by runPeriodically() makes reading the flow a bit harder and let us make this bug exist, and then move the first observation out of the go routine, like this:

diff:

diff --git a/pkg/controller/elasticsearch/observer/observer.go b/pkg/controller/elasticsearch/observer/observer.go
index 1fb8bc8aa..31fd5d415 100644
--- a/pkg/controller/elasticsearch/observer/observer.go
+++ b/pkg/controller/elasticsearch/observer/observer.go
@@ -65,9 +65,26 @@ func NewObserver(cluster types.NamespacedName, esClient client.Client, settings
        return &observer
 }
 
-// Start the observer in a separate goroutine
+// Start starts the Observer in a separate goroutine.
+// The cluster state is observed periodically until the observer stop channel is closed.
 func (o *Observer) Start() {
-       go o.runPeriodically()
+       // initial synchronous observation
+       o.observe()
+       // periodic asynchronous observations
+       go func() {
+               ticker := time.NewTicker(o.settings.ObservationInterval)
+               defer ticker.Stop()
+
+               for {
+                       select {
+                       case <-ticker.C:
+                               o.observe()
+                       case <-o.stopChan:
+                               log.Info("Stopping observer for cluster", "namespace", o.cluster.Namespace, "es_name", o.cluster.Name)
+                               return
+                       }
+               }
+       }()
 }
 
 // Stop the observer loop
@@ -85,25 +102,6 @@ func (o *Observer) LastHealth() esv1.ElasticsearchHealth {
        return o.lastHealth
 }
 
-// runPeriodically triggers a state retrieval every tick,
-// until the given context is cancelled
-func (o *Observer) runPeriodically() {
-       o.observe()
-
-       ticker := time.NewTicker(o.settings.ObservationInterval)
-       defer ticker.Stop()
-
-       for {
-               select {
-               case <-ticker.C:
-                       o.observe()
-               case <-o.stopChan:
-                       log.Info("Stopping observer for cluster", "namespace", o.cluster.Namespace, "es_name", o.cluster.Name)
-                       return
-               }
-       }
-}
-

cla-checker-service · 2022-06-20T15:10:04Z

💚 CLA has been signed

thbkrkr · 2022-06-20T17:12:13Z

Jenkins test this please

This breaks the unit tests.

jiuker · 2022-06-21T01:39:40Z

I have make event chan asynchronous。But I don't know if it's consistent with the original logic。 @thbkrkr
In particular, I want to make it clear that this submission has modified the non test code. Please confirm carefully.

pkg/controller/elasticsearch/observer/watch.go
@@ -17,8 +17,9 @@ import (
 // whose health has changed between 2 observations.
 // Aimed to be used for triggering a reconciliation.
 func WatchClusterHealthChange(m *Manager) *source.Channel {
- 	evtChan := make(chan event.GenericEvent)
+	var evtChan chan event.GenericEvent
	m.AddObservationListener(healthChangeListener(evtChan))
+ 	evtChan = make(chan event.GenericEvent, len(m.listeners))
 	return &source.Channel{
 		// Each event in Source will be consumed and turned into
 		// a reconciliation request.

thbkrkr · 2022-06-21T10:30:54Z

I opened #5812 to discuss it because it is more involved than I thought.

jiuker · 2022-06-22T01:02:58Z

I modified the E2E. Becase our usage scenarios do not conform to the E2E. So there is a problem with the E2E test, and the program runs without problem. @thbkrkr
The receiver is asynchronous. Therefore, the E2E test should also be asynchronous reception. You can see it in detail
$GOPATH/go/pkg/mod/sigs. k8s. io/controller- runtime@v0.12.1/pkg/source/source. go:273

code

	// Start implements Source and should only be called by the Controller.
func (cs *Channel) Start(
	ctx context.Context,
	handler handler.EventHandler,
	queue workqueue.RateLimitingInterface,
	prct ...predicate.Predicate) error {
	// Source should have been specified by the user.
	if cs.Source == nil {
		return fmt.Errorf("must specify Channel.Source")
	}

	// stop should have been injected before Start was called
	if cs.stop == nil {
		return fmt.Errorf("must call InjectStop on Channel before calling Start")
	}

	// use default value if DestBufferSize not specified
	if cs.DestBufferSize == 0 {
		cs.DestBufferSize = defaultBufferSize
	}

	dst := make(chan event.GenericEvent, cs.DestBufferSize)

	cs.destLock.Lock()
	cs.dest = append(cs.dest, dst)
	cs.destLock.Unlock()

	cs.once.Do(func() {
		// Distribute GenericEvents to all EventHandler / Queue pairs Watching this source
		go cs.syncLoop(ctx)
	})

	go func() {
		for evt := range dst {
			shouldHandle := true
			for _, p := range prct {
				if !p.Generic(evt) {
					shouldHandle = false
					break
				}
			}

			if shouldHandle {
				handler.Generic(evt, queue)
			}
		}
	}()

	return nil
}

E2E diff

			eventsCluster2 <- cluster
 		}
 	})

+ 	doneCh := make(chan struct{})
+ 	go func() {
+ 		// events should be propagated to both listeners
+ 		<-eventsCluster1
+ 		<-eventsCluster2
+ 		<-eventsCluster1
+ 		<-eventsCluster2
+ 		doneCh <- struct{}{}
+ 	}()
 	// observe 2 clusters
 	obs1 := m.Observe(cluster1, fakeEsClient200(client.BasicAuth{}))
 	defer obs1.Stop()
 	obs2 := m.Observe(cluster2, fakeEsClient200(client.BasicAuth{}))
 	defer obs2.Stop()

- 	// events should be propagated to both listeners
- 	<-eventsCluster1
- 	<-eventsCluster2
- 	<-eventsCluster1
- 	<-eventsCluster2
+	<-doneCh
 }

 func esObject(n types.NamespacedName) esv1.Elasticsearch {

jiuker · 2022-06-23T14:48:03Z

It is possible that other test cases are affected. I should do more tests.

jiuker · 2022-06-23T15:58:19Z

I have tested package observer all test Fun。Make jenkins test again。 @thbkrkr

pkg/controller/elasticsearch/observer/observer.go

pkg/controller/elasticsearch/observer/observer_test.go

pkg/controller/elasticsearch/observer/manager_test.go

jiuker · 2022-06-28T00:58:06Z

review again @barkbay

pkg/controller/elasticsearch/observer/observer.go

godoc Co-authored-by: Thibault Richard <thbkrkr@users.noreply.github.com>

jiuker · 2022-06-28T08:55:40Z

review again @thbkrkr

thbkrkr · 2022-06-28T09:09:00Z

Jenkins test this please

barkbay

LGTM

barkbay · 2022-10-10T07:17:22Z

I think this PR has actually introduced some performance issues described here: #6078

Couple of improvements on the observer: * The first synchronous observation implemented in #5783 is no longer invoked if the Elasticsearch Service is not "ready" * observer.Start() is invoked outside of any lock-protected section An APM span added around observer.observe(...)

Move the first observation out of the goroutine to make it synchronous to avoid an unnecessary PDB update caused when the operator restarts and we reconcile Elasticsearch with an 'unknown' health. This occured at the second reconciliation loop because we couldn't get the health at the first reconcile loop due to completely asynchronous observers. Co-authored-by: guozhi.li <guozhi.li@daocloud.io> Co-authored-by: Thibault Richard <thbkrkr@users.noreply.github.com>

guozhi.li added 2 commits June 16, 2022 16:03

botelastic bot added the triage label Jun 16, 2022

jiuker mentioned this pull request Jun 16, 2022

How to become a member #5784

Closed

thbkrkr self-assigned this Jun 20, 2022

thbkrkr added the >bug Something isn't working label Jun 20, 2022

botelastic bot removed the triage label Jun 20, 2022

thbkrkr requested changes Jun 20, 2022

View reviewed changes

thbkrkr added the v2.4.0 label Jun 20, 2022

thbkrkr changed the title ~~Fix update twice~~ Move first ES cluster state observation out of go routine Jun 20, 2022

Move first ES cluster state observation out of go routine

5e7543a

jiuker force-pushed the fix_update_twice branch from dcd5c12 to 5e7543a Compare June 20, 2022 15:30

This comment was marked as resolved.

Sign in to view

thbkrkr previously approved these changes Jun 20, 2022

View reviewed changes

make chan asynchronous？

85a9a26

thbkrkr mentioned this pull request Jun 21, 2022

Unnecessary PodDisruptionBudget update on operator restart #5812

Closed

elastic deleted a comment from elasticmachine Jun 21, 2022

guozhi.li added 2 commits June 22, 2022 08:49

Merge remote-tracking branch 'upstream/main' into fix_update_twice

af9ff65

make e2e fun

1170143

jiuker requested a review from thbkrkr June 23, 2022 07:24

guozhi.li added 2 commits June 23, 2022 23:52

fix test case

7f38922

Merge branch 'main' into fix_update_twice

4afb6ba

barkbay reviewed Jun 27, 2022

View reviewed changes

pkg/controller/elasticsearch/observer/observer.go Outdated Show resolved Hide resolved

pkg/controller/elasticsearch/observer/observer_test.go Outdated Show resolved Hide resolved

pkg/controller/elasticsearch/observer/manager_test.go Outdated Show resolved Hide resolved

close chan and set godoc

57597ba

thbkrkr reviewed Jun 28, 2022

View reviewed changes

pkg/controller/elasticsearch/observer/observer.go Outdated Show resolved Hide resolved

Update pkg/controller/elasticsearch/observer/observer.go

0a232a5

godoc Co-authored-by: Thibault Richard <thbkrkr@users.noreply.github.com>

thbkrkr approved these changes Jun 29, 2022

View reviewed changes

jiuker requested a review from barkbay July 4, 2022 09:54

barkbay approved these changes Jul 4, 2022

View reviewed changes

thbkrkr merged commit 8eaeef1 into elastic:main Jul 4, 2022

barkbay mentioned this pull request Oct 10, 2022

Synchronous observation side effects #6078

Closed

barkbay mentioned this pull request Oct 12, 2022

Elasticsearch observer improvements to avoid blocking between workers. #6084

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move first ES cluster state observation out of go routine #5783

Move first ES cluster state observation out of go routine #5783

jiuker commented Jun 16, 2022 •

edited by thbkrkr

Loading

elasticmachine commented Jun 16, 2022

thbkrkr left a comment •

edited

Loading

cla-checker-service bot commented Jun 20, 2022 •

edited

Loading

This comment was marked as resolved.

thbkrkr commented Jun 20, 2022

jiuker commented Jun 21, 2022 •

edited

Loading

thbkrkr commented Jun 21, 2022

jiuker commented Jun 22, 2022 •

edited

Loading

jiuker commented Jun 23, 2022 •

edited

Loading

jiuker commented Jun 23, 2022

jiuker commented Jun 28, 2022

jiuker commented Jun 28, 2022 •

edited

Loading

thbkrkr commented Jun 28, 2022

barkbay left a comment

barkbay commented Oct 10, 2022

Move first ES cluster state observation out of go routine #5783

Move first ES cluster state observation out of go routine #5783

Conversation

jiuker commented Jun 16, 2022 • edited by thbkrkr Loading

elasticmachine commented Jun 16, 2022

thbkrkr left a comment • edited Loading

Choose a reason for hiding this comment

cla-checker-service bot commented Jun 20, 2022 • edited Loading

This comment was marked as resolved.

thbkrkr commented Jun 20, 2022

jiuker commented Jun 21, 2022 • edited Loading

thbkrkr commented Jun 21, 2022

jiuker commented Jun 22, 2022 • edited Loading

jiuker commented Jun 23, 2022 • edited Loading

jiuker commented Jun 23, 2022

jiuker commented Jun 28, 2022

jiuker commented Jun 28, 2022 • edited Loading

thbkrkr commented Jun 28, 2022

barkbay left a comment

Choose a reason for hiding this comment

barkbay commented Oct 10, 2022

jiuker commented Jun 16, 2022 •

edited by thbkrkr

Loading

thbkrkr left a comment •

edited

Loading

cla-checker-service bot commented Jun 20, 2022 •

edited

Loading

jiuker commented Jun 21, 2022 •

edited

Loading

jiuker commented Jun 22, 2022 •

edited

Loading

jiuker commented Jun 23, 2022 •

edited

Loading

jiuker commented Jun 28, 2022 •

edited

Loading