Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix cluster view not update on recover state #93

Closed
shyimo opened this issue Feb 28, 2021 · 0 comments · Fixed by #92
Closed

fix cluster view not update on recover state #93

shyimo opened this issue Feb 28, 2021 · 0 comments · Fixed by #92
Labels
bug Something isn't working

Comments

@shyimo
Copy link
Contributor

shyimo commented Feb 28, 2021

When reconcile loop starts and enter to recover mode, it may see the cluster state differently then the real state cause of network latency.
for example. if we delete Redis nodes 0, 4 & 5 the operator get only see node-0 is down at first and the reconcile loop will fail on first attempt

example log:

{"level":"info","ts":1614507150.0866873,"logger":"controllers.RedisCluster","msg":"Reconciling RedisCluster"}
{"level":"info","ts":1614507150.100218,"logger":"controllers.RedisCluster","msg":"Cluster is healthy"}
{"level":"info","ts":1614507189.9657013,"logger":"controllers.RedisCluster","msg":"Reconciling RedisCluster"}
{"level":"info","ts":1614507189.9765944,"logger":"controllers.RedisCluster","msg":"Found terminating leader: 0"}
{"level":"info","ts":1614507189.9938948,"logger":"controllers.RedisCluster","msg":"Updated state to: [Recovering]"}
{"level":"info","ts":1614507189.9942093,"logger":"controllers.RedisCluster","msg":"Reconciling RedisCluster"}
{"level":"info","ts":1614507189.9942489,"logger":"controllers.RedisCluster","msg":"Handling cluster recovery..."}
{"level":"info","ts":1614507190.005573,"logger":"controllers.RedisCluster","msg":"Leader: 0(terminating,fail)-[3(up,ok)]Leader: 1(up,ok)-[4(up,ok)]Leader: 2(up,ok)-[5(up,ok)]"}
{"level":"info","ts":1614507190.0055935,"logger":"controllers.RedisCluster","msg":"Waiting for pod delete: redis-node-0"}
{"level":"info","ts":1614507228.0057752,"logger":"controllers.RedisCluster","msg":"Waiting for leader [0] failover"}
{"level":"info","ts":1614507228.0148532,"logger":"controllers.RedisCluster","msg":"Forgetting lost nodes..."}
{"level":"info","ts":1614507228.0263112,"logger":"controllers.RedisCluster","msg":"Running cluster FORGET with: 10.28.25.229 48d18716ed18aaec43b85b22600eba29abac838f"}
{"level":"info","ts":1614507228.0263402,"logger":"controllers.RedisCluster","msg":"Running cluster FORGET with: 10.28.9.250 48d18716ed18aaec43b85b22600eba29abac838f"}
{"level":"info","ts":1614507228.026317,"logger":"controllers.RedisCluster","msg":"Running cluster FORGET with: 10.28.9.158 48d18716ed18aaec43b85b22600eba29abac838f"}
{"level":"info","ts":1614507228.0293198,"logger":"controllers.RedisCluster","msg":"Running cluster FORGET with: 10.28.25.229 cec3505fdb091f2a03da1c599e1d978090228576"}
{"level":"info","ts":1614507228.0293357,"logger":"controllers.RedisCluster","msg":"Running cluster FORGET with: 10.28.9.250 cec3505fdb091f2a03da1c599e1d978090228576"}
{"level":"info","ts":1614507228.0293546,"logger":"controllers.RedisCluster","msg":"Running cluster FORGET with: 10.28.9.158 cec3505fdb091f2a03da1c599e1d978090228576"}
{"level":"info","ts":1614507228.0323966,"logger":"controllers.RedisCluster","msg":"Running cluster FORGET with: 10.28.25.229 31e133a80c6458ae0df87595355224a0fc955903"}
{"level":"info","ts":1614507228.0324514,"logger":"controllers.RedisCluster","msg":"Running cluster FORGET with: 10.28.9.158 31e133a80c6458ae0df87595355224a0fc955903"}
{"level":"info","ts":1614507228.0325422,"logger":"controllers.RedisCluster","msg":"Running cluster FORGET with: 10.28.9.250 31e133a80c6458ae0df87595355224a0fc955903"}
{"level":"info","ts":1614507228.0357723,"logger":"controllers.RedisCluster","msg":"Recreating leader [0] using node [3]"}
{"level":"info","ts":1614507228.0743792,"logger":"controllers.RedisCluster","msg":"Waiting for pod network interfaces..."}
{"level":"info","ts":1614507230.0746157,"logger":"controllers.RedisCluster","msg":"New leader pods created: [0] "}
{"level":"info","ts":1614507230.0746427,"logger":"controllers.RedisCluster","msg":"Waiting for pod ready: redis-node-0(10.28.20.66)"}
{"level":"info","ts":1614507242.074819,"logger":"controllers.RedisCluster","msg":"Waiting for Redis on 10.28.20.66"}
{"level":"info","ts":1614507242.0781474,"logger":"controllers.RedisCluster","msg":"Replicating leader: 10.28.20.66->10.28.9.158"}
{"level":"info","ts":1614507243.0947092,"logger":"controllers.RedisCluster","msg":"Waiting for CLUSTER MEET (10.28.9.158, 10.28.20.66)"}
{"level":"info","ts":1614507243.09646,"logger":"controllers.RedisCluster","msg":"Replication successful"}
{"level":"info","ts":1614507243.0964847,"logger":"controllers.RedisCluster","msg":"Waiting for CLUSTER REPLICATION (10.28.9.158, 71883d6b57dba4d9c7e2843a17e164f2632e8c95)"}
{"level":"info","ts":1614507245.101405,"logger":"controllers.RedisCluster","msg":"Waiting for SYNC to start on 10.28.20.66"}
{"level":"info","ts":1614507245.1086235,"logger":"controllers.RedisCluster","msg":"Node 10.28.20.66 SYNC status: -0.00"}
{"level":"info","ts":1614507247.111961,"logger":"controllers.RedisCluster","msg":"Node 10.28.20.66 SYNC status: -0.00"}
{"level":"info","ts":1614507249.2477214,"logger":"controllers.RedisCluster","msg":"Node 10.28.20.66 LOAD ETA: 1"}
{"level":"info","ts":1614507251.1518047,"logger":"controllers.RedisCluster","msg":"Node 10.28.20.66 LOAD ETA: 2"}
{"level":"info","ts":1614507253.2528505,"logger":"controllers.RedisCluster","msg":"Node 10.28.20.66 LOAD ETA: 1"}
{"level":"info","ts":1614507255.1121004,"logger":"controllers.RedisCluster","msg":"Node 10.28.20.66 is synced"}
{"level":"info","ts":1614507255.1121376,"logger":"controllers.RedisCluster","msg":"Leader replication successful"}
{"level":"info","ts":1614507255.1136394,"logger":"controllers.RedisCluster","msg":"Starting manual failover on leader: 10.28.9.158(cc06c25d9940df63a10faffb1643621afbe678f6)"}
{"level":"info","ts":1614507255.116044,"logger":"controllers.RedisCluster","msg":"Running 'cluster failover ' on 10.28.20.66"}
{"level":"info","ts":1614507255.11838,"logger":"controllers.RedisCluster","msg":"Waiting for [10.28.20.66] to become leader"}
{"level":"info","ts":1614507255.1226153,"logger":"controllers.RedisCluster","msg":"[OK] Leader failover successful for (10.28.9.158). New leader: (10.28.20.66)"}
{"level":"info","ts":1614507255.1226304,"logger":"controllers.RedisCluster","msg":"[OK] Leader [0] recreated successfully; new IP: [10.28.20.66]"}
{"level":"info","ts":1614507255.1317248,"logger":"controllers.RedisCluster","msg":"Found failed follower: 4"}
{"level":"info","ts":1614507255.1317399,"logger":"controllers.RedisCluster","msg":"Cluster recovery failed"}
{"level":"error","ts":1614507255.1317434,"logger":"controllers.RedisCluster","msg":"Handling error","error":"Cluster recovery not complete","errorVerbose":"Cluster recovery not complete\ngithub.com/PayU/Redis-Operator/controllers.(*RedisClusterReconciler).recoverCluster\n\t/workspace/controllers/rediscluster.go:707\ngithub.com/PayU/Redis-Operator/controllers.(*RedisClusterReconciler).handleRecoveringState\n\t/workspace/controllers/helpers.go:97\ngithub.com/PayU/Redis-Operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:79\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:244\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:90\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128\ngithub.com/PayU/Redis-Operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:87\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:244\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:90"}

the solution for that will be to fetch the cluster view again, right after we finish to handle the failed leader nodes (if any)

@shyimo shyimo added the bug Something isn't working label Feb 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

Successfully merging a pull request may close this issue.

1 participant