You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When reconcile loop starts and enter to recover mode, it may see the cluster state differently then the real state cause of network latency.
for example. if we delete Redis nodes 0, 4 & 5 the operator get only see node-0 is down at first and the reconcile loop will fail on first attempt
example log:
{"level":"info","ts":1614507150.0866873,"logger":"controllers.RedisCluster","msg":"Reconciling RedisCluster"}
{"level":"info","ts":1614507150.100218,"logger":"controllers.RedisCluster","msg":"Cluster is healthy"}
{"level":"info","ts":1614507189.9657013,"logger":"controllers.RedisCluster","msg":"Reconciling RedisCluster"}
{"level":"info","ts":1614507189.9765944,"logger":"controllers.RedisCluster","msg":"Found terminating leader: 0"}
{"level":"info","ts":1614507189.9938948,"logger":"controllers.RedisCluster","msg":"Updated state to: [Recovering]"}
{"level":"info","ts":1614507189.9942093,"logger":"controllers.RedisCluster","msg":"Reconciling RedisCluster"}
{"level":"info","ts":1614507189.9942489,"logger":"controllers.RedisCluster","msg":"Handling cluster recovery..."}
{"level":"info","ts":1614507190.005573,"logger":"controllers.RedisCluster","msg":"Leader: 0(terminating,fail)-[3(up,ok)]Leader: 1(up,ok)-[4(up,ok)]Leader: 2(up,ok)-[5(up,ok)]"}
{"level":"info","ts":1614507190.0055935,"logger":"controllers.RedisCluster","msg":"Waiting for pod delete: redis-node-0"}
{"level":"info","ts":1614507228.0057752,"logger":"controllers.RedisCluster","msg":"Waiting for leader [0] failover"}
{"level":"info","ts":1614507228.0148532,"logger":"controllers.RedisCluster","msg":"Forgetting lost nodes..."}
{"level":"info","ts":1614507228.0263112,"logger":"controllers.RedisCluster","msg":"Running cluster FORGET with: 10.28.25.229 48d18716ed18aaec43b85b22600eba29abac838f"}
{"level":"info","ts":1614507228.0263402,"logger":"controllers.RedisCluster","msg":"Running cluster FORGET with: 10.28.9.250 48d18716ed18aaec43b85b22600eba29abac838f"}
{"level":"info","ts":1614507228.026317,"logger":"controllers.RedisCluster","msg":"Running cluster FORGET with: 10.28.9.158 48d18716ed18aaec43b85b22600eba29abac838f"}
{"level":"info","ts":1614507228.0293198,"logger":"controllers.RedisCluster","msg":"Running cluster FORGET with: 10.28.25.229 cec3505fdb091f2a03da1c599e1d978090228576"}
{"level":"info","ts":1614507228.0293357,"logger":"controllers.RedisCluster","msg":"Running cluster FORGET with: 10.28.9.250 cec3505fdb091f2a03da1c599e1d978090228576"}
{"level":"info","ts":1614507228.0293546,"logger":"controllers.RedisCluster","msg":"Running cluster FORGET with: 10.28.9.158 cec3505fdb091f2a03da1c599e1d978090228576"}
{"level":"info","ts":1614507228.0323966,"logger":"controllers.RedisCluster","msg":"Running cluster FORGET with: 10.28.25.229 31e133a80c6458ae0df87595355224a0fc955903"}
{"level":"info","ts":1614507228.0324514,"logger":"controllers.RedisCluster","msg":"Running cluster FORGET with: 10.28.9.158 31e133a80c6458ae0df87595355224a0fc955903"}
{"level":"info","ts":1614507228.0325422,"logger":"controllers.RedisCluster","msg":"Running cluster FORGET with: 10.28.9.250 31e133a80c6458ae0df87595355224a0fc955903"}
{"level":"info","ts":1614507228.0357723,"logger":"controllers.RedisCluster","msg":"Recreating leader [0] using node [3]"}
{"level":"info","ts":1614507228.0743792,"logger":"controllers.RedisCluster","msg":"Waiting for pod network interfaces..."}
{"level":"info","ts":1614507230.0746157,"logger":"controllers.RedisCluster","msg":"New leader pods created: [0] "}
{"level":"info","ts":1614507230.0746427,"logger":"controllers.RedisCluster","msg":"Waiting for pod ready: redis-node-0(10.28.20.66)"}
{"level":"info","ts":1614507242.074819,"logger":"controllers.RedisCluster","msg":"Waiting for Redis on 10.28.20.66"}
{"level":"info","ts":1614507242.0781474,"logger":"controllers.RedisCluster","msg":"Replicating leader: 10.28.20.66->10.28.9.158"}
{"level":"info","ts":1614507243.0947092,"logger":"controllers.RedisCluster","msg":"Waiting for CLUSTER MEET (10.28.9.158, 10.28.20.66)"}
{"level":"info","ts":1614507243.09646,"logger":"controllers.RedisCluster","msg":"Replication successful"}
{"level":"info","ts":1614507243.0964847,"logger":"controllers.RedisCluster","msg":"Waiting for CLUSTER REPLICATION (10.28.9.158, 71883d6b57dba4d9c7e2843a17e164f2632e8c95)"}
{"level":"info","ts":1614507245.101405,"logger":"controllers.RedisCluster","msg":"Waiting for SYNC to start on 10.28.20.66"}
{"level":"info","ts":1614507245.1086235,"logger":"controllers.RedisCluster","msg":"Node 10.28.20.66 SYNC status: -0.00"}
{"level":"info","ts":1614507247.111961,"logger":"controllers.RedisCluster","msg":"Node 10.28.20.66 SYNC status: -0.00"}
{"level":"info","ts":1614507249.2477214,"logger":"controllers.RedisCluster","msg":"Node 10.28.20.66 LOAD ETA: 1"}
{"level":"info","ts":1614507251.1518047,"logger":"controllers.RedisCluster","msg":"Node 10.28.20.66 LOAD ETA: 2"}
{"level":"info","ts":1614507253.2528505,"logger":"controllers.RedisCluster","msg":"Node 10.28.20.66 LOAD ETA: 1"}
{"level":"info","ts":1614507255.1121004,"logger":"controllers.RedisCluster","msg":"Node 10.28.20.66 is synced"}
{"level":"info","ts":1614507255.1121376,"logger":"controllers.RedisCluster","msg":"Leader replication successful"}
{"level":"info","ts":1614507255.1136394,"logger":"controllers.RedisCluster","msg":"Starting manual failover on leader: 10.28.9.158(cc06c25d9940df63a10faffb1643621afbe678f6)"}
{"level":"info","ts":1614507255.116044,"logger":"controllers.RedisCluster","msg":"Running 'cluster failover ' on 10.28.20.66"}
{"level":"info","ts":1614507255.11838,"logger":"controllers.RedisCluster","msg":"Waiting for [10.28.20.66] to become leader"}
{"level":"info","ts":1614507255.1226153,"logger":"controllers.RedisCluster","msg":"[OK] Leader failover successful for (10.28.9.158). New leader: (10.28.20.66)"}
{"level":"info","ts":1614507255.1226304,"logger":"controllers.RedisCluster","msg":"[OK] Leader [0] recreated successfully; new IP: [10.28.20.66]"}
{"level":"info","ts":1614507255.1317248,"logger":"controllers.RedisCluster","msg":"Found failed follower: 4"}
{"level":"info","ts":1614507255.1317399,"logger":"controllers.RedisCluster","msg":"Cluster recovery failed"}
{"level":"error","ts":1614507255.1317434,"logger":"controllers.RedisCluster","msg":"Handling error","error":"Cluster recovery not complete","errorVerbose":"Cluster recovery not complete\ngithub.com/PayU/Redis-Operator/controllers.(*RedisClusterReconciler).recoverCluster\n\t/workspace/controllers/rediscluster.go:707\ngithub.com/PayU/Redis-Operator/controllers.(*RedisClusterReconciler).handleRecoveringState\n\t/workspace/controllers/helpers.go:97\ngithub.com/PayU/Redis-Operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:79\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:244\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:90\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128\ngithub.com/PayU/Redis-Operator/controllers.(*RedisClusterReconciler).Reconcile\n\t/workspace/controllers/rediscluster_controller.go:87\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:244\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.6/pkg/util/wait/wait.go:90"}
the solution for that will be to fetch the cluster view again, right after we finish to handle the failed leader nodes (if any)
The text was updated successfully, but these errors were encountered:
When reconcile loop starts and enter to recover mode, it may see the cluster state differently then the real state cause of network latency.
for example. if we delete Redis nodes 0, 4 & 5 the operator get only see node-0 is down at first and the reconcile loop will fail on first attempt
example log:
the solution for that will be to fetch the cluster view again, right after we finish to handle the failed leader nodes (if any)
The text was updated successfully, but these errors were encountered: