cluster: make sure we don't miss the first pushPull #1456

iksaif · 2018-07-04T14:11:25Z

During the join, memberlist initiates a pushPull to get initial data.
Unfortunately, at this point the nflog and silence listener have not
been registered yet, so the first data arrives only after one pushPull
cycle (1min by default !).

clems4ever · 2018-07-04T15:14:08Z

LGTM

During the join, memberlist initiates a pushPull to get initial data. Unfortunately, at this point the nflog and silence listener have not been registered yet, so the first data arrives only after one pushPull cycle (1min by default !). Signed-off-by: Corentin Chary <c.chary@criteo.com>

mxinden · 2018-07-05T12:19:11Z

cmd/alertmanager/main.go

@@ -206,7 +204,7 @@ func main() {
 			cancel()
 			peer.Leave(10 * time.Second)
 		}()
-		go peer.Settle(ctx, *gossipInterval*10)
+		go peer.Settle(ctx, *pushPullInterval)


If I understand the Settle() function correctly, it initially waits for interval before starting to check whether the cluster is settled. By increasing this interval (gossipInterval -> pushPullInterval) marking a cluster as settled is delayed for every setup, even though it might already be settled.

Why not going back to a low interval and moving the peer.Settle below the peer.Join logic?

Yes, that would work too, I just though pushPullInterval made more sense here (but you are right, it might be a bit too long).

I'll be happy to change to whatever you think makes more sense.

I would also expect peer.Settle() to be called after peer.Join().

moved, re-changed the interval value

simonpasquier

A few comments but looks great overall.

simonpasquier · 2018-07-05T12:29:37Z

cmd/alertmanager/main.go

+			*peerReconnectTimeout,
+		)
+		if err != nil {
+			level.Error(logger).Log("msg", "Unable to initialize gossip mesh", "err", err)


s/initialize/join/

simonpasquier · 2018-07-05T12:30:00Z

cmd/alertmanager/main.go

@@ -263,6 +261,18 @@ func main() {
 		wg.Wait()
 	}()

+	// Peer state listener have been registered, now we can join and get the initial state.


s/listener/listeners/

simonpasquier · 2018-07-05T12:32:12Z

cluster/delegate.go

@@ -135,6 +135,7 @@ func (d *delegate) NotifyMsg(b []byte) {
 		level.Warn(d.logger).Log("msg", "decode broadcast", "err", err)
 		return
 	}
+	level.Debug(d.logger).Log("received", "NotifyMsg", "len", len(b), "key", p.Key)


I'm not sure the extra logging is required.

simonpasquier · 2018-07-05T12:32:41Z

cluster/delegate.go

 	for _, p := range fs.Parts {
 		s, ok := d.states[p.Key]
 		if !ok {
+			level.Debug(d.logger).Log("received", "unknown state key", "len", len(buf), "key", p.Key)


Ditto or it should be Warn().

simonpasquier · 2018-07-05T12:34:16Z

cmd/alertmanager/main.go

@@ -206,7 +204,7 @@ func main() {
 			cancel()
 			peer.Leave(10 * time.Second)
 		}()
-		go peer.Settle(ctx, *gossipInterval*10)
+		go peer.Settle(ctx, *pushPullInterval)


I would also expect peer.Settle() to be called after peer.Join().

Signed-off-by: Corentin Chary <c.chary@criteo.com>

mxinden · 2018-07-05T12:50:57Z

Closes #1457

mxinden

This looks good to me. Thanks for the quick adjustments.

Leaving last call to @stuartnelson3 and @simonpasquier.

mxinden · 2018-07-09T09:16:30Z

@iksaif Thanks for your help!

* cluster: make sure we don't miss the first pushPull During the join, memberlist initiates a pushPull to get initial data. Unfortunately, at this point the nflog and silence listener have not been registered yet, so the first data arrives only after one pushPull cycle (1min by default !). Signed-off-by: Corentin Chary <c.chary@criteo.com>

Alertmanager is exiting with a non-zero exit code if the initial cluster join fails. This behavior could be not wanted because: - As Alertmanager is a critical component with an at-least-once guarantee, failing on joining the cluster is unnecessary as Alertmanager still functions by itself. - In an environment like Kubernetes discovering peers via DNS, peers might roll out one-by-one, leaving the DNS entries unpopulated for the first peer of a set. Failing on initial join prevents a roll-out. Instead of failing on the initial join this patch only logs the failure. The cluster can be later joined via the `handleReconnect`. This is a regression introduced in PR prometheus#1456 [1]. [1] prometheus#1456

Alertmanager is exiting with a non-zero exit code if the initial cluster join fails. This behavior could be not wanted because: - As Alertmanager is a critical component with an at-least-once guarantee, failing on joining the cluster is unnecessary as Alertmanager still functions by itself. - In an environment like Kubernetes discovering peers via DNS, peers might roll out one-by-one, leaving the DNS entries unpopulated for the first peer of a set. Failing on initial join prevents a roll-out. Instead of failing on the initial join this patch only logs the failure. The cluster can be later joined via the `handleReconnect`. This is a regression introduced in PR prometheus#1456 [1]. [1] prometheus#1456 Signed-off-by: Max Leonard Inden <IndenML@gmail.com>

Alertmanager is exiting with a non-zero exit code if the initial cluster join fails. This behavior could be not wanted because: - As Alertmanager is a critical component with an at-least-once guarantee, failing on joining the cluster is unnecessary as Alertmanager still functions by itself. - In an environment like Kubernetes discovering peers via DNS, peers might roll out one-by-one, leaving the DNS entries unpopulated for the first peer of a set. Failing on initial join prevents a roll-out. Instead of failing on the initial join this patch only logs the failure. The cluster can be later joined via the `handleReconnect`. This is a regression introduced in PR #1456 [1]. [1] #1456 Signed-off-by: Max Leonard Inden <IndenML@gmail.com>

Alertmanager is exiting with a non-zero exit code if the initial cluster join fails. This behavior could be not wanted because: - As Alertmanager is a critical component with an at-least-once guarantee, failing on joining the cluster is unnecessary as Alertmanager still functions by itself. - In an environment like Kubernetes discovering peers via DNS, peers might roll out one-by-one, leaving the DNS entries unpopulated for the first peer of a set. Failing on initial join prevents a roll-out. Instead of failing on the initial join this patch only logs the failure. The cluster can be later joined via the `handleReconnect`. This is a regression introduced in PR prometheus#1456 [1]. [1] prometheus#1456 Signed-off-by: Max Leonard Inden <IndenML@gmail.com>

iksaif force-pushed the master branch 3 times, most recently from cee9d3e to 58abd4c Compare July 4, 2018 14:26

iksaif mentioned this pull request Jul 5, 2018

gossip takes one pushPull interval to settle #1457

Closed

stuartnelson3 mentioned this pull request Jul 5, 2018

config: fix regression with Pager Duty #1455

Merged

mxinden reviewed Jul 5, 2018

View reviewed changes

simonpasquier requested changes Jul 5, 2018

View reviewed changes

iksaif force-pushed the master branch from 58abd4c to a28a413 Compare July 5, 2018 12:50

cluster: move Settle() after join and fix logging

5465c66

Signed-off-by: Corentin Chary <c.chary@criteo.com>

iksaif force-pushed the master branch from a28a413 to 5465c66 Compare July 5, 2018 12:50

mxinden approved these changes Jul 5, 2018

View reviewed changes

stuartnelson3 approved these changes Jul 6, 2018

View reviewed changes

simonpasquier approved these changes Jul 9, 2018

View reviewed changes

mxinden merged commit 42ea9a5 into prometheus:master Jul 9, 2018

mxinden mentioned this pull request Jul 10, 2018

*: Cut 0.15.1 #1462

Merged

mxinden mentioned this pull request Jul 11, 2018

cluster: Do not exit when failing to join cluster #1465

Merged

simonpasquier mentioned this pull request Jul 24, 2018

AlertManager times out when waiting for cluster being settled #1477

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster: make sure we don't miss the first pushPull #1456

cluster: make sure we don't miss the first pushPull #1456

iksaif commented Jul 4, 2018

clems4ever commented Jul 4, 2018

mxinden Jul 5, 2018

iksaif Jul 5, 2018

simonpasquier Jul 5, 2018

iksaif Jul 5, 2018

simonpasquier left a comment

simonpasquier Jul 5, 2018

iksaif Jul 5, 2018

simonpasquier Jul 5, 2018

iksaif Jul 5, 2018

simonpasquier Jul 5, 2018

iksaif Jul 5, 2018

simonpasquier Jul 5, 2018

iksaif Jul 5, 2018

simonpasquier Jul 5, 2018

mxinden commented Jul 5, 2018

mxinden left a comment

mxinden commented Jul 9, 2018

cluster: make sure we don't miss the first pushPull #1456

cluster: make sure we don't miss the first pushPull #1456

Conversation

iksaif commented Jul 4, 2018

clems4ever commented Jul 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonpasquier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mxinden commented Jul 5, 2018

mxinden left a comment

Choose a reason for hiding this comment

mxinden commented Jul 9, 2018