Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cluster peers DNS refresh job #1428

Merged
merged 6 commits into from
Nov 23, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 70 additions & 6 deletions cluster/cluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,13 @@ type Peer struct {
peers map[string]peer
failedPeers []peer

knownPeers []string
advertiseAddr string

failedReconnectionsCounter prometheus.Counter
reconnectionsCounter prometheus.Counter
failedRefreshCounter prometheus.Counter
refreshCounter prometheus.Counter
peerLeaveCounter prometheus.Counter
peerUpdateCounter prometheus.Counter
peerJoinCounter prometheus.Counter
Expand Down Expand Up @@ -95,6 +100,7 @@ const (
DefaultProbeInterval = 1 * time.Second
DefaultReconnectInterval = 10 * time.Second
DefaultReconnectTimeout = 6 * time.Hour
DefaultRefreshInterval = 30 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be quiet long to prevent a partition during deployment.

maxGossipPacketSize = 1400
)

Expand All @@ -112,6 +118,7 @@ func Join(
probeInterval time.Duration,
reconnectInterval time.Duration,
reconnectTimeout time.Duration,
refreshInterval time.Duration,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is not configurable via a command line flag anymore, there is no reason for the parameter, right?

) (*Peer, error) {
bindHost, bindPortStr, err := net.SplitHostPort(bindAddr)
if err != nil {
Expand Down Expand Up @@ -164,11 +171,12 @@ func Join(
}

p := &Peer{
states: map[string]State{},
stopc: make(chan struct{}),
readyc: make(chan struct{}),
logger: l,
peers: map[string]peer{},
states: map[string]State{},
stopc: make(chan struct{}),
readyc: make(chan struct{}),
logger: l,
peers: map[string]peer{},
knownPeers: knownPeers,
}

p.register(reg)
Expand Down Expand Up @@ -221,6 +229,9 @@ func Join(
if reconnectTimeout != 0 {
go p.handleReconnectTimeout(5*time.Minute, reconnectTimeout)
}
if refreshInterval != 0 {
go p.handleRefresh(refreshInterval)
}

return p, nil
}
Expand Down Expand Up @@ -298,6 +309,15 @@ func (p *Peer) register(reg prometheus.Registerer) {
Help: "A counter of the number of cluster peer reconnections.",
})

p.failedRefreshCounter = prometheus.NewCounter(prometheus.CounterOpts{
Name: "alertmanager_cluster_refresh_failed_total",
Help: "A counter of the number of failed cluster peer refresh attempts.",
})
p.refreshCounter = prometheus.NewCounter(prometheus.CounterOpts{
Name: "alertmanager_cluster_refresh_total",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric name and metric description seem diverged to me. alertmanager_cluster_refresh_total sounds like the amount of times a refresh happened and not amount of times a peer joined the cluster due to a refresh. How about alertmanager_cluster_refresh_join_total? (Not quite perfect either)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, renamed to alertmanager_cluster_refresh_join_total and alertmanager_cluster_refresh_join_failed_total to indicate that there are counters for join

Help: "A counter of the number of cluster peer joined via refresh.",
})

p.peerLeaveCounter = prometheus.NewCounter(prometheus.CounterOpts{
Name: "alertmanager_cluster_peers_left_total",
Help: "A counter of the number of peers that have left.",
Expand All @@ -312,7 +332,7 @@ func (p *Peer) register(reg prometheus.Registerer) {
})

reg.MustRegister(clusterFailedPeers, p.failedReconnectionsCounter, p.reconnectionsCounter,
p.peerLeaveCounter, p.peerUpdateCounter, p.peerJoinCounter)
p.peerLeaveCounter, p.peerUpdateCounter, p.peerJoinCounter, p.refreshCounter, p.failedRefreshCounter)
}

func (p *Peer) handleReconnectTimeout(d time.Duration, timeout time.Duration) {
Expand Down Expand Up @@ -382,6 +402,50 @@ func (p *Peer) reconnect() {
}
}

func (p *Peer) handleRefresh(d time.Duration) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're talking about proper DNS support for alertmanager, it would be better to respect the TTL of the record as advertised by the authority.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you shorten the interval to 10s or 15s?
Even though I'm sometimes running into this problem of members not finding each other, I still think that 30s is enough for the cluster to heal.
Every time this happened I got some duplicate alerts, which is something I can live with, if fixed within < 30s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grobie It looks like Go net package doesn't expose TTL values https://golang.org/pkg/net/#IPAddr.
I guess the only way to get the TTL value is to use smth like https://stackoverflow.com/a/48997354, which would look quite cumbersome.

I personally prefer to use this, as it's a simpler solution, but I can prepare a change if you want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ends up relying on using the internal resolver, which should be properly caching/refreshing responses. I would say we can shorten the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Shortened to 15s.

tick := time.NewTicker(d)
defer tick.Stop()

for {
select {
case <-p.stopc:
return
case <-tick.C:
p.refresh()
}
}
}

func (p *Peer) refresh() {
logger := log.With(p.logger, "msg", "refresh")

resolvedPeers, err := resolvePeers(context.Background(), p.knownPeers, p.advertiseAddr, net.Resolver{}, false)
if err != nil {
level.Debug(logger).Log("peers", p.knownPeers, "err", err)
}

members := p.mlist.Members()
for _, peer := range resolvedPeers {
var isPeerFound bool
for _, member := range members {
if member.Address() == peer {
isPeerFound = true
break
}
}

if !isPeerFound {
if _, err := p.mlist.Join([]string{peer}); err != nil {
p.failedRefreshCounter.Inc()
level.Debug(logger).Log("result", "failure", "addr", peer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that this could also be a Info or Warn.

Copy link
Contributor Author

@povilasv povilasv Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 changed to be Warn

} else {
p.refreshCounter.Inc()
level.Debug(logger).Log("result", "success", "addr", peer)
}
}
}
}

func (p *Peer) peerJoin(n *memberlist.Node) {
p.peerLock.Lock()
defer p.peerLock.Unlock()
Expand Down
6 changes: 6 additions & 0 deletions cluster/cluster_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ func TestJoinLeave(t *testing.T) {
DefaultProbeInterval,
DefaultReconnectInterval,
DefaultReconnectTimeout,
DefaultRefreshInterval,
)
require.NoError(t, err)
require.NotNil(t, p)
Expand All @@ -64,6 +65,7 @@ func TestJoinLeave(t *testing.T) {
DefaultProbeInterval,
DefaultReconnectInterval,
DefaultReconnectTimeout,
DefaultRefreshInterval,
)
require.NoError(t, err)
require.NotNil(t, p2)
Expand Down Expand Up @@ -93,6 +95,7 @@ func TestReconnect(t *testing.T) {
DefaultProbeInterval,
DefaultReconnectInterval,
DefaultReconnectTimeout,
DefaultRefreshInterval,
)
require.NoError(t, err)
require.NotNil(t, p)
Expand All @@ -113,6 +116,7 @@ func TestReconnect(t *testing.T) {
DefaultProbeInterval,
DefaultReconnectInterval,
DefaultReconnectTimeout,
DefaultRefreshInterval,
)
require.NoError(t, err)
require.NotNil(t, p2)
Expand Down Expand Up @@ -148,6 +152,7 @@ func TestRemoveFailedPeers(t *testing.T) {
DefaultProbeInterval,
DefaultReconnectInterval,
DefaultReconnectTimeout,
DefaultRefreshInterval,
)
require.NoError(t, err)
require.NotNil(t, p)
Expand Down Expand Up @@ -194,6 +199,7 @@ func TestInitiallyFailingPeers(t *testing.T) {
DefaultProbeInterval,
DefaultReconnectInterval,
DefaultReconnectTimeout,
DefaultRefreshInterval,
)
require.NoError(t, err)
require.NotNil(t, p)
Expand Down
2 changes: 2 additions & 0 deletions cmd/alertmanager/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,7 @@ func main() {
settleTimeout = kingpin.Flag("cluster.settle-timeout", "Maximum time to wait for cluster connections to settle before evaluating notifications.").Default(cluster.DefaultPushPullInterval.String()).Duration()
reconnectInterval = kingpin.Flag("cluster.reconnect-interval", "Interval between attempting to reconnect to lost peers.").Default(cluster.DefaultReconnectInterval.String()).Duration()
peerReconnectTimeout = kingpin.Flag("cluster.reconnect-timeout", "Length of time to attempt to reconnect to a lost peer.").Default(cluster.DefaultReconnectTimeout.String()).Duration()
refreshInterval = kingpin.Flag("cluster.refresh-interval", "Interval between attempting to refresh cluster.peers DNS records.").Default(cluster.DefaultReconnectInterval.String()).Duration()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a good default value be enough for now, or is a custom configuration necessary for most environments?

Copy link
Contributor Author

@povilasv povilasv Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think 30s should be fine for most environments, typically alertmanger is quick to start, so IMO anything longer than that would slow down startup/gossip settling

)

kingpin.Version(version.Print("alertmanager"))
Expand Down Expand Up @@ -196,6 +197,7 @@ func main() {
*probeInterval,
*reconnectInterval,
*peerReconnectTimeout,
*refreshInterval,
)
if err != nil {
level.Error(logger).Log("msg", "Unable to initialize gossip mesh", "err", err)
Expand Down