Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move first ES cluster state observation out of go routine #5783

Merged
merged 10 commits into from
Jul 4, 2022

Conversation

jiuker
Copy link
Contributor

@jiuker jiuker commented Jun 16, 2022

There is a process to execute in Start, but the execution point is uncertain. As a result, the lastHealth in the observer structure is unknown, and pkg/controller/elasticsearch/pdb/reconcile.go:196 returns 0. The probability of restarting the operator will cause *v1beta1.PodDisruptionBudgetStatus MinAvailable to be updated from 1 (it is the correct 1 before operator restart) to 2. After the observe is executed, the lastHealth is green, and the MinAvailable is updated from 2 to 1. Therefore, restart the operator *v1beta1.PodDisruptionBudgetStatus is updated twice.

Repro steps:

  1. Add fmt.Println("update!") line 182 of pkg/controller/common/reconciler/reconciler.go`:
  2. start operator
  3. kubectl apply -f config/samples/elasticsearch/elasticsearch.yaml。
  4. wait all pod ready
  5. restart operator
  6. output:
update!
...
update!

Resolves #5812.

guozhi.li added 2 commits June 16, 2022 16:03
…certain. As a result, the lasthealth in the observer structure is unknown, and pkg/controller/elasticsearch/pdb/reconcile.go:196 returns 0. The probability of restarting the operator will cause *v1beta1.PodDisruptionBudgetStatus minavailable is updated from 1 (if it is the correct 1 before operator is restarted) to 2. After the observe is executed, the lasthealth is green, and the minavailable is updated from 2 to 1. Therefore, restart the operator *v1beta1.PodDisruptionBudgetStatus is updated twice.
…certain. As a result, the lasthealth in the observer structure is unknown, and pkg/controller/elasticsearch/pdb/reconcile.go:196 returns 0. The probability of restarting the operator will cause *v1beta1.PodDisruptionBudgetStatus minavailable is updated from 1 (if it is the correct 1 before operator is restarted) to 2. After the observe is executed, the lasthealth is green, and the minavailable is updated from 2 to 1. Therefore, restart the operator *v1beta1.PodDisruptionBudgetStatus is updated twice.
@elasticmachine
Copy link

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@botelastic botelastic bot added the triage label Jun 16, 2022
@jiuker jiuker mentioned this pull request Jun 16, 2022
@thbkrkr thbkrkr self-assigned this Jun 20, 2022
@thbkrkr thbkrkr added the >bug Something isn't working label Jun 20, 2022
@botelastic botelastic bot removed the triage label Jun 20, 2022
Copy link
Contributor

@thbkrkr thbkrkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

If you look at runPeriodically(), there was an intention to trigger a first observation before doing it periodically but it has been done before the go routine starts. So I would rather we fix runPeriodically() / Start(). I think we can merge the two because the indirection provided by runPeriodically() makes reading the flow a bit harder and let us make this bug exist, and then move the first observation out of the go routine, like this:

diff:
diff --git a/pkg/controller/elasticsearch/observer/observer.go b/pkg/controller/elasticsearch/observer/observer.go
index 1fb8bc8aa..31fd5d415 100644
--- a/pkg/controller/elasticsearch/observer/observer.go
+++ b/pkg/controller/elasticsearch/observer/observer.go
@@ -65,9 +65,26 @@ func NewObserver(cluster types.NamespacedName, esClient client.Client, settings
        return &observer
 }
 
-// Start the observer in a separate goroutine
+// Start starts the Observer in a separate goroutine.
+// The cluster state is observed periodically until the observer stop channel is closed.
 func (o *Observer) Start() {
-       go o.runPeriodically()
+       // initial synchronous observation
+       o.observe()
+       // periodic asynchronous observations
+       go func() {
+               ticker := time.NewTicker(o.settings.ObservationInterval)
+               defer ticker.Stop()
+
+               for {
+                       select {
+                       case <-ticker.C:
+                               o.observe()
+                       case <-o.stopChan:
+                               log.Info("Stopping observer for cluster", "namespace", o.cluster.Namespace, "es_name", o.cluster.Name)
+                               return
+                       }
+               }
+       }()
 }
 
 // Stop the observer loop
@@ -85,25 +102,6 @@ func (o *Observer) LastHealth() esv1.ElasticsearchHealth {
        return o.lastHealth
 }
 
-// runPeriodically triggers a state retrieval every tick,
-// until the given context is cancelled
-func (o *Observer) runPeriodically() {
-       o.observe()
-
-       ticker := time.NewTicker(o.settings.ObservationInterval)
-       defer ticker.Stop()
-
-       for {
-               select {
-               case <-ticker.C:
-                       o.observe()
-               case <-o.stopChan:
-                       log.Info("Stopping observer for cluster", "namespace", o.cluster.Namespace, "es_name", o.cluster.Name)
-                       return
-               }
-       }
-}
-

@thbkrkr thbkrkr added the v2.4.0 label Jun 20, 2022
@thbkrkr thbkrkr changed the title Fix update twice Move first ES cluster state observation out of go routine Jun 20, 2022
@cla-checker-service
Copy link

cla-checker-service bot commented Jun 20, 2022

💚 CLA has been signed

@jiuker jiuker force-pushed the fix_update_twice branch from dcd5c12 to 5e7543a Compare June 20, 2022 15:30
@jiuker

This comment was marked as resolved.

@thbkrkr
Copy link
Contributor

thbkrkr commented Jun 20, 2022

Jenkins test this please

thbkrkr
thbkrkr previously approved these changes Jun 20, 2022
@thbkrkr thbkrkr dismissed their stale review June 20, 2022 18:03

This breaks the unit tests.

@jiuker
Copy link
Contributor Author

jiuker commented Jun 21, 2022

I have make event chan asynchronous。But I don't know if it's consistent with the original logic。 @thbkrkr
In particular, I want to make it clear that this submission has modified the non test code. Please confirm carefully.

pkg/controller/elasticsearch/observer/watch.go
@@ -17,8 +17,9 @@ import (
 // whose health has changed between 2 observations.
 // Aimed to be used for triggering a reconciliation.
 func WatchClusterHealthChange(m *Manager) *source.Channel {
- 	evtChan := make(chan event.GenericEvent)
+	var evtChan chan event.GenericEvent
	m.AddObservationListener(healthChangeListener(evtChan))
+ 	evtChan = make(chan event.GenericEvent, len(m.listeners))
 	return &source.Channel{
 		// Each event in Source will be consumed and turned into
 		// a reconciliation request.

@thbkrkr
Copy link
Contributor

thbkrkr commented Jun 21, 2022

I opened #5812 to discuss it because it is more involved than I thought.

@elastic elastic deleted a comment from elasticmachine Jun 21, 2022
@elastic elastic deleted a comment from elasticmachine Jun 21, 2022
@elastic elastic deleted a comment from elasticmachine Jun 21, 2022
@jiuker
Copy link
Contributor Author

jiuker commented Jun 22, 2022

I modified the E2E. Becase our usage scenarios do not conform to the E2E. So there is a problem with the E2E test, and the program runs without problem. @thbkrkr
The receiver is asynchronous. Therefore, the E2E test should also be asynchronous reception. You can see it in detail
$GOPATH/go/pkg/mod/sigs. k8s. io/controller- runtime@v0.12.1/pkg/source/source. go:273

code
	// Start implements Source and should only be called by the Controller.
func (cs *Channel) Start(
	ctx context.Context,
	handler handler.EventHandler,
	queue workqueue.RateLimitingInterface,
	prct ...predicate.Predicate) error {
	// Source should have been specified by the user.
	if cs.Source == nil {
		return fmt.Errorf("must specify Channel.Source")
	}

	// stop should have been injected before Start was called
	if cs.stop == nil {
		return fmt.Errorf("must call InjectStop on Channel before calling Start")
	}

	// use default value if DestBufferSize not specified
	if cs.DestBufferSize == 0 {
		cs.DestBufferSize = defaultBufferSize
	}

	dst := make(chan event.GenericEvent, cs.DestBufferSize)

	cs.destLock.Lock()
	cs.dest = append(cs.dest, dst)
	cs.destLock.Unlock()

	cs.once.Do(func() {
		// Distribute GenericEvents to all EventHandler / Queue pairs Watching this source
		go cs.syncLoop(ctx)
	})

	go func() {
		for evt := range dst {
			shouldHandle := true
			for _, p := range prct {
				if !p.Generic(evt) {
					shouldHandle = false
					break
				}
			}

			if shouldHandle {
				handler.Generic(evt, queue)
			}
		}
	}()

	return nil
}
E2E diff
			eventsCluster2 <- cluster
 		}
 	})

+ 	doneCh := make(chan struct{})
+ 	go func() {
+ 		// events should be propagated to both listeners
+ 		<-eventsCluster1
+ 		<-eventsCluster2
+ 		<-eventsCluster1
+ 		<-eventsCluster2
+ 		doneCh <- struct{}{}
+ 	}()
 	// observe 2 clusters
 	obs1 := m.Observe(cluster1, fakeEsClient200(client.BasicAuth{}))
 	defer obs1.Stop()
 	obs2 := m.Observe(cluster2, fakeEsClient200(client.BasicAuth{}))
 	defer obs2.Stop()

- 	// events should be propagated to both listeners
- 	<-eventsCluster1
- 	<-eventsCluster2
- 	<-eventsCluster1
- 	<-eventsCluster2
+	<-doneCh
 }

 func esObject(n types.NamespacedName) esv1.Elasticsearch {

@jiuker jiuker requested a review from thbkrkr June 23, 2022 07:24
@jiuker
Copy link
Contributor Author

jiuker commented Jun 23, 2022

It is possible that other test cases are affected. I should do more tests.

@jiuker
Copy link
Contributor Author

jiuker commented Jun 23, 2022

I have tested package observer all test Fun。Make jenkins test again。 @thbkrkr

@jiuker
Copy link
Contributor Author

jiuker commented Jun 28, 2022

review again @barkbay

godoc

Co-authored-by: Thibault Richard <thbkrkr@users.noreply.github.com>
@jiuker
Copy link
Contributor Author

jiuker commented Jun 28, 2022

review again @thbkrkr

@thbkrkr
Copy link
Contributor

thbkrkr commented Jun 28, 2022

Jenkins test this please

@jiuker jiuker requested a review from barkbay July 4, 2022 09:54
Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thbkrkr thbkrkr merged commit 8eaeef1 into elastic:main Jul 4, 2022
@barkbay
Copy link
Contributor

barkbay commented Oct 10, 2022

I think this PR has actually introduced some performance issues described here: #6078

barkbay added a commit that referenced this pull request Oct 14, 2022
Couple of improvements on the observer:
* The first synchronous observation implemented in #5783 is no longer invoked if the Elasticsearch Service is not "ready"
* observer.Start() is invoked outside of any lock-protected section
An APM span added around observer.observe(...)
fantapsody pushed a commit to fantapsody/cloud-on-k8s that referenced this pull request Feb 7, 2023
Move the first observation out of the goroutine to make it synchronous to avoid
an unnecessary PDB update caused when the operator restarts and we reconcile
Elasticsearch with an 'unknown' health. This occured at the second reconciliation loop
because we couldn't get the health at the first reconcile loop due to completely asynchronous
observers.

Co-authored-by: guozhi.li <guozhi.li@daocloud.io>
Co-authored-by: Thibault Richard <thbkrkr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working v2.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unnecessary PodDisruptionBudget update on operator restart
4 participants