Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]Restart redis cluster ops is always in running state #8924

Closed
tianyue86 opened this issue Feb 14, 2025 · 2 comments
Closed

[BUG]Restart redis cluster ops is always in running state #8924

tianyue86 opened this issue Feb 14, 2025 · 2 comments
Assignees
Labels
kind/bug Something isn't working
Milestone

Comments

@tianyue86
Copy link

tianyue86 commented Feb 14, 2025

Describe the bug

Kubernetes: v1.31.1-aliyun.1
KubeBlocks: 1.0.0-beta.28
kbcli: 1.0.0-beta.13

To Reproduce
Steps to reproduce the behavior:

  1. Create redis cluster in running state
apiVersion: apps.kubeblocks.io/v1
kind: Cluster
metadata:
  name: redis-lphhul
  namespace: default
spec:
  clusterDef: redis
  topology: replication
  terminationPolicy: Delete
  componentSpecs:
  - name: redis
    serviceVersion: 7.0.6
    replicas: 2
    disableExporter: false
    resources:
      limits:
        cpu: 100m
        memory: 0.5Gi
      requests:
        cpu: 100m
        memory: 0.5Gi
    volumeClaimTemplates:
      - name: data
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 20Gi
  - name: redis-sentinel
    serviceVersion: 7.0.6
    replicas: 3
    resources:
      limits:
        cpu: 100m
        memory:  0.5Gi
      requests:
        cpu: 100m
        memory:  0.5Gi
    volumeClaimTemplates:
      - name: data
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 20Gi

k get cluster -A|grep redis
default     redis-lphhul                 redis                Delete               Running   59m
  1. restart it
kbcli cluster restart redis-lphhul --auto-approve --force=true  --namespace default
OpsRequest redis-lphhul-restart-phmlp created successfully, you can view the progress:
	kbcli cluster describe-ops redis-lphhul-restart-phmlp -n default
  1. check cluster status
k get cluster -A|grep redis
default     redis-lphhul                 redis                Delete               Updating   64m

k get pod|grep redis
redis-lphhul-redis-0                    3/3     Running                 0             63m
redis-lphhul-redis-1                    3/3     Running                 0             63m
redis-lphhul-redis-sentinel-0           2/2     Running                 0             21m
redis-lphhul-redis-sentinel-1           2/2     Running                 0             22m
redis-lphhul-redis-sentinel-2           2/2     Running                 0             22m
  1. check ops status
kbcli cluster list-ops redis-lphhul --status all  --namespace default
NAME                         NAMESPACE   TYPE      CLUSTER        COMPONENT              STATUS     PROGRESS   CREATED-TIME                 
redis-lphhul-expose-sdkfn    default     Expose    redis-lphhul                          Creating   -/-        Feb 14,2025 11:17 UTC+0800   
redis-lphhul-restart-phmlp   default     Restart   redis-lphhul   redis,redis-sentinel   Running    3/5        Feb 14,2025 11:56 UTC+0800

k describe opsrequest redis-lphhul-restart-phmlp
Name:         redis-lphhul-restart-phmlp
Namespace:    default
Labels:       app.kubernetes.io/instance=redis-lphhul
              app.kubernetes.io/managed-by=kubeblocks
              operations.kubeblocks.io/ops-type=Restart
Annotations:  <none>
API Version:  operations.kubeblocks.io/v1alpha1
Kind:         OpsRequest
Metadata:
  Creation Timestamp:  2025-02-14T03:56:06Z
  Finalizers:
    opsrequest.kubeblocks.io/finalizer
  Generate Name:  redis-lphhul-restart-
  Generation:     2
  Owner References:
    API Version:     apps.kubeblocks.io/v1
    Kind:            Cluster
    Name:            redis-lphhul
    UID:             f7fe0ab7-2d4f-4a65-b28b-403c498849ed
  Resource Version:  54403050
  UID:               18cfc5dd-17d9-4247-b7f4-6f5f162dbde0
Spec:
  Cluster Name:                    redis-lphhul
  Enqueue On Force:                false
  Force:                           true
  Pre Condition Deadline Seconds:  0
  Restart:
    Component Name:  redis
    Component Name:  redis-sentinel
  Type:              Restart
Status:
  Cluster Generation:  3
  Components:
    Redis:
      Phase:  Updating
      Progress Details:
        Object Key:  Pod/redis-lphhul-redis-1
        Status:      Pending
        Object Key:  Pod/redis-lphhul-redis-0
        Status:      Pending
    Redis - Sentinel:
      Phase:  Running
      Progress Details:
        End Time:    2025-02-14T03:57:11Z
        Message:     Successfully restart: Pod/redis-lphhul-redis-sentinel-0 in Component: redis-sentinel
        Object Key:  Pod/redis-lphhul-redis-sentinel-0
        Start Time:  2025-02-14T03:56:49Z
        Status:      Succeed
        End Time:    2025-02-14T03:56:28Z
        Message:     Successfully restart: Pod/redis-lphhul-redis-sentinel-2 in Component: redis-sentinel
        Object Key:  Pod/redis-lphhul-redis-sentinel-2
        Start Time:  2025-02-14T03:56:07Z
        Status:      Succeed
        End Time:    2025-02-14T03:56:49Z
        Message:     Successfully restart: Pod/redis-lphhul-redis-sentinel-1 in Component: redis-sentinel
        Object Key:  Pod/redis-lphhul-redis-sentinel-1
        Start Time:  2025-02-14T03:56:28Z
        Status:      Succeed
  Conditions:
    Last Transition Time:  2025-02-14T03:56:06Z
    Message:               wait for the controller to process the OpsRequest: redis-lphhul-restart-phmlp in Cluster: redis-lphhul
    Reason:                WaitForProgressing
    Status:                True
    Type:                  WaitForProgressing
    Last Transition Time:  2025-02-14T03:56:06Z
    Message:               OpsRequest: redis-lphhul-restart-phmlp is validated
    Reason:                ValidateOpsRequestPassed
    Status:                True
    Type:                  Validated
    Last Transition Time:  2025-02-14T03:56:06Z
    Message:               Start to restart database in Cluster: redis-lphhul
    Reason:                RestartStarted
    Status:                True
    Type:                  Restarting
  Phase:                   Running
  Progress:                3/5
  Start Timestamp:         2025-02-14T03:56:06Z
Events:                    <none>


Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@tianyue86 tianyue86 added the kind/bug Something isn't working label Feb 14, 2025
@tianyue86 tianyue86 added this to the Release 1.0.0 milestone Feb 14, 2025
@shanshanying
Copy link
Contributor

shanshanying commented Feb 14, 2025

status of compontent redis is Updating:

    Redis:
      Phase:  Updating
      Progress Details:
        Object Key:  Pod/redis-lphhul-redis-1
        Status:      Pending
        Object Key:  Pod/redis-lphhul-redis-0
        Status:      Pending

And redis its has generation:2 but observedGeneration:1

k get its redis-lphhul-redis -oyaml | yq '.status'
availableReplicas: 2
conditions:
  - lastTransitionTime: "2025-02-14T07:50:47Z"
    message: ""
    observedGeneration: 1
    reason: Ready
    status: "True"
    type: InstanceReady
  - lastTransitionTime: "2025-02-14T07:50:47Z"
    message: ""
    observedGeneration: 1
    reason: Available
    status: "True"
    type: InstanceAvailable

and we got error in kb logs:

025-02-14T09:42:59.075Z	INFO	handle role change event	{"controller": "event", "controllerGroup": "", "controllerKind": "Event", "Event": {"name":"redis-lphhul-redis-1.45f44902","namespace":"default"}, "namespace": "default", "name": "redis-lphhul-redis-1.45f44902", "reconcileID": "4c0ed092-c01f-449a-ab7b-e76e1a07f4fd", "event": {"name":"redis-lphhul-redis-1.45f44902","namespace":"default"}, "pod": "redis-lphhul-redis-1", "role": "primary", "originalRole": ""}
2025-02-14T09:43:10.089Z	INFO	handle role change event	{"controller": "event", "controllerGroup": "", "controllerKind": "Event", "Event": {"name":"redis-lphhul-redis-0.12fe8a4f","namespace":"default"}, "namespace": "default", "name": "redis-lphhul-redis-0.12fe8a4f", "reconcileID": "eeb24168-5498-49dd-ae2e-5b7b9bcb1106", "event": {"name":"redis-lphhul-redis-0.12fe8a4f","namespace":"default"}, "pod": "redis-lphhul-redis-0", "role": "secondary", "originalRole": ""}
2025-02-14T09:43:13.732Z	INFO	successfully call switchover action for pod	{"controller": "instanceset", "controllerGroup": "workloads.kubeblocks.io", "controllerKind": "InstanceSet", "InstanceSet": {"name":"redis-lphhul-redis","namespace":"default"}, "namespace": "default", "name": "redis-lphhul-redis", "reconcileID": "fb9b48f5-cd13-4740-b60f-21cce90fa801", "InstanceSet": {"name":"redis-lphhul-redis","namespace":"default"}, "pod": "redis-lphhul-redis-1"}
2025-02-14T09:43:13.750Z	ERROR	delete *v1.Pod error: redis-lphhul-redis-1	{"controller": "instanceset", "controllerGroup": "workloads.kubeblocks.io", "controllerKind": "InstanceSet", "InstanceSet": {"name":"redis-lphhul-redis","namespace":"default"}, "namespace": "default", "name": "redis-lphhul-redis", "reconcileID": "fb9b48f5-cd13-4740-b60f-21cce90fa801", "InstanceSet": {"name":"redis-lphhul-redis","namespace":"default"}, "error": "Operation cannot be fulfilled on pods \"redis-lphhul-redis-1\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/apecloud/kubeblocks/pkg/controller/kubebuilderx.(*PlanBuilder).deleteObject
	/src/pkg/controller/kubebuilderx/plan_builder.go:292
github.com/apecloud/kubeblocks/pkg/controller/kubebuilderx.(*PlanBuilder).defaultWalkFunc
	/src/pkg/controller/kubebuilderx/plan_builder.go:252
github.com/apecloud/kubeblocks/pkg/controller/kubebuilderx.(*Plan).Execute
	/src/pkg/controller/kubebuilderx/plan_builder.go:226
github.com/apecloud/kubeblocks/pkg/controller/kubebuilderx.(*controller).Commit
	/src/pkg/controller/kubebuilderx/controller.go:123
github.com/apecloud/kubeblocks/controllers/workloads.(*InstanceSetReconciler).Reconcile
	/src/controllers/workloads/instanceset_controller.go:92
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227
2025-02-14T09:43:14.118Z	INFO	handle role change event	{"controller": "event", "controllerGroup": "", "controllerKind": "Event", "Event": {"name":"redis-lphhul-redis-0.f8381609","namespace":"default"}, "namespace": "default", "name": "redis-lphhul-redis-0.f8381609", "reconcileID": "cb26d55c-c6da-45e4-a3be-ec636fdd55a4", "event": {"name":"redis-lphhul-redis-0.f8381609","namespace":"default"}, "pod": "redis-lphhul-redis-0", "role": "primary", "originalRole": ""}

@shanshanying
Copy link
Contributor

in KB lgos, we see for several minutes POD deletesion errors and then followed by SwitchOver error:

Image

redis pod-0 's kbagent logs is

2025-02-14T10:02:47Z	INFO	Action Executed	{"action": "postProvision", "result": "exit code: 1, stderr: Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.\nWarning: Using a password with '-a' or '-u' option on the command line interface may not be safe.\nWarning: Using a password with '-a' or '-u' option on the command line interface may not be safe.\nCommand failed with redis-lphhul-redis-0.redis-lphhul-redis-headless.default.svc.cluster.local\n6379\nFunction 'get_master_addr_by_name' failed in 1 times. Retrying in 5 seconds...\nWarning: Using a password with '-a' or '-u' option on the command line interface may not be safe.\nCommand failed with redis-lphhul-redis-0.redis-lphhul-redis-headless.default.svc.cluster.local\n6379\nFunction 'get_master_addr_by_name' failed in 2 times. Retrying in 5 seconds...\nWarning: Using a password with '-a' or '-u' option on the command line interface may not be safe.\nCommand failed with redis-lphhul-redis-0.redis-lphhul-redis-headless.default.svc.cluster.local\n6379\nFunction 'get_master_addr_by_name' failed after 3 retries.\nFailed to get master address after maximum retries.\n: failed"}
2025-02-14T10:02:47Z	INFO	HTTP API Called	{"user-agent": "Go-http-client/1.1", "method": "POST", "path": "/v1.0/action", "status code": 200, "cost": 10027}
2025-02-14T10:03:25Z	INFO	send probe event	{"probe": "roleProbe", "probe": "roleProbe", "code": 0, "output": "secondary", "message": ""}
2025-02-14T10:03:29Z	INFO	send probe event	{"probe": "roleProbe", "probe": "roleProbe", "code": 0, "output": "primary", "message": ""}
2025-02-14T10:03:43Z	INFO	Action Executed	{"action": "switchover", "result": "switchover not triggered for primary, nothing to do\nexecute_sub_command output: OK\nCommand executed successfully.\nSentinel failover started with redis-lphhul-redis-sentinel-0.redis-lphhul-redis-sentinel-headless.default.svc.cluster.local\nSwitchover successful: new master is redis-lphhul-redis-0.redis-lphhul-redis-headless.default.svc.cluster.local\n"}
2025-02-14T10:03:43Z	INFO	HTTP API Called	{"user-agent": "Go-http-client/1.1", "method": "POST", "path": "/v1.0/action", "status code": 200, "cost": 15080}
2025-02-14T10:03:56Z	INFO	send probe event	{"probe": "roleProbe", "probe": "roleProbe", "code": 0, "output": "secondary", "message": ""}
2025-02-14T10:04:00Z	INFO	send probe event	{"probe": "roleProbe", "probe": "roleProbe", "code": 0, "output": "primary", "message": ""}
2025-02-14T10:04:14Z	INFO	Action Executed	{"action": "switchover", "result": "switchover not triggered for primary, nothing to do\nexecute_sub_command output: OK\nCommand executed successfully.\nSentinel failover started with redis-lphhul-redis-sentinel-0.redis-lphhul-redis-sentinel-headless.default.svc.cluster.local\nSwitchover successful: new master is redis-lphhul-redis-0.redis-lphhul-redis-headless.default.svc.cluster.local\n"}
2025-02-14T10:04:14Z	INFO	HTTP API Called	{"user-agent": "Go-http-client/1.1", "method": "POST", "path": "/v1.0/action", "status code": 200, "cost": 15068}
2025-02-14T10:04:26Z	INFO	send probe event	{"probe": "roleProbe", "probe": "roleProbe", "code": 0, "output": "secondary", "message": ""}
2025-02-14T10:04:30Z	INFO	send probe event	{"probe": "roleProbe", "probe": "roleProbe", "code": 0, "output": "primary", "message": ""}
2025-02-14T10:04:44Z	INFO	Action Executed	{"action": "switchover", "result": "switchover not triggered for primary, nothing to do\nexecute_sub_command output: OK\nCommand executed successfully.\nSentinel failover started with redis-lphhul-redis-sentinel-0.redis-lphhul-redis-sentinel-headless.default.svc.cluster.local\nSwitchover successful: new master is redis-lphhul-redis-0.redis-lphhul-redis-headless.default.svc.cluster.local\n"}
2025-02-14T10:04:44Z	INFO	HTTP API Called	{"user-agent": "Go-http-client/1.1", "method": "POST", "path": "/v1.0/action", "status code": 200, "cost": 15049}
2025-02-14T10:04:56Z	INFO	send probe event	{"probe": "roleProbe", "probe": "roleProbe", "code": 0, "output": "secondary", "message": ""}
2025-02-14T10:05:00Z	INFO	send probe event	{"probe": "roleProbe", "probe": "roleProbe", "code": 0, "output": "primary", "message": ""}
2025-02-14T10:05:14Z	INFO	Action Executed	{"action": "switchover", "result": "switchover not triggered for primary, nothing to do\nexecute_sub_command output: OK\nCommand executed successfully.\nSentinel failover started with redis-lphhul-redis-sentinel-0.redis-lphhul-redis-sentinel-headless.default.svc.cluster.local\nSwitchover successful: new master is redis-lphhul-redis-0.redis-lphhul-redis-headless.default.svc.cluster.local\n"}
2025-02-14T10:05:14Z	INFO	HTTP API Called	{"user-agent": "Go-http-client/1.1", "method": "POST", "path": "/v1.0/action", "status code": 200, "cost": 15069}
2025-02-14T10:05:26Z	INFO	send probe event	{"probe": "roleProbe", "probe": "roleProbe", "code": 0, "output": "secondary", "message": ""}
2025-02-14T10:05:30Z	INFO	send probe event	{"probe": "roleProbe", "probe": "roleProbe", "code": 0, "output": "primary", "message": ""}
2025-02-14T10:05:44Z	INFO	Action Executed	{"action": "switchover", "result": "switchover not triggered for primary, nothing to do\nexecute_sub_command output: OK\nCommand executed successfully.\nSentinel failover started with redis-lphhul-redis-sentinel-0.redis-lphhul-redis-sentinel-headless.default.svc.cluster.local\nSwitchover successful: new master is redis-lphhul-redis-0.redis-lphhul-redis-headless.default.svc.cluster.local\n"}
2025-02-14T10:05:44Z	INFO	HTTP API Called	{"user-agent": "Go-http-client/1.1", "method": "POST", "path": "/v1.0/action", "status code": 200, "cost": 15068}
2025-02-14T10:05:56Z	INFO	send probe event	{"probe": "roleProbe", "probe": "roleProbe", "code": 0, "output": "secondary", "message": ""}
2025-02-14T10:06:00Z	INFO	send probe event	{"probe": "roleProbe", "probe": "roleProbe", "code": 0, "output": "primary", "message": ""}
2025-02-14T10:06:14Z	INFO	Action Executed	{"action": "switchover", "result": "switchover not triggered for primary, nothing to do\nexecute_sub_command output: OK\nCommand executed successfully.\nSentinel failover started with redis-lphhul-redis-sentinel-0.redis-lphhul-redis-sentinel-headless.default.svc.cluster.local\nSwitchover successful: new master is redis-lphhul-redis-0.redis-lphhul-redis-headless.default.svc.cluster.local\n"}
2025-02-14T10:06:14Z	INFO	HTTP API Called	{"user-agent": "Go-http-client/1.1", "method": "POST", "path": "/v1.0/action", "status code": 200, "cost": 15076}
2025-02-14T10:06:26Z	INFO	send probe event	{"probe": "roleProbe", "probe": "roleProbe", "code": 0, "output": "secondary", "message": ""}
2025/02/14 10:06:29 The incoming connection cannot be served, because 8 concurrent connections are served. Try increasing Server.Concurrency
2025/02/14 10:08:12 The incoming connection cannot be served, because 8 concurrent connections are served. Try increasing Server.Concurrency
2025/02/14 10:10:56 The incoming connection cannot be served, because 8 concurrent connections are served. Try increasing Server.Concurrency

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants