Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changefeed with sink-uri=kafka status become failed after all PD restart #2389

Closed
Tammyxia opened this issue Jul 27, 2021 · 8 comments
Closed
Assignees
Labels
area/ticdc Issues or PRs related to TiCDC. bug-from-internal-test Bugs found by internal testing. component/sink Sink component. difficulty/medium Medium task. severity/major type/bug The issue is confirmed as a bug.

Comments

@Tammyxia
Copy link

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.
  • 2x capture, 2x changefeed:
    Starting component cdc: /root/.tiup/components/cdc/v4.0.14/cdc cli changefeed list --pd=http://172.16.6.28:2379
    [
    {
    "id": "kafka-task-11",
    "summary": {
    "state": "normal",
    "tso": 426608082704400445,
    "checkpoint": "2021-07-27 18:11:26.586",
    "error": null
    }
    },
    {
    "id": "replication-task-11",
    "summary": {
    "state": "normal",
    "tso": 426608707681910785,
    "checkpoint": "2021-07-27 18:51:10.686",
    "error": null
    }
    }

  • Restart PD: $tiup cluster restart 360UP -R pd

  1. What did you expect to see?
  • No any error
  1. What did you see instead?
  • Changefeed with sink-uri=kafka status become failed after all PD restart
    Starting component cdc: /root/.tiup/components/cdc/v4.0.14/cdc cli changefeed list --pd=http://172.16.6.28:2379
    [
    {
    "id": "replication-task-11",
    "summary": {
    "state": "normal",
    "tso": 426608859290271791,
    "checkpoint": "2021-07-27 19:00:49.026",
    "error": null
    }
    },
    {
    "id": "kafka-task-11",
    "summary": {
    "state": "failed",
    "tso": 426608874219372599,
    "checkpoint": "2021-07-27 19:01:45.976",
    "error": {
    "addr": "172.16.6.32:8300",
    "code": "CDC-owner-1001",
    "message": "rpc error: code = Unknown desc = rpc error: code = Unavailable desc = not leader"
    }
    }
    }
    ]

  • Check cf status again, the failed kafka-task-11 checkpoint is still updating...
    Starting component cdc: /root/.tiup/components/cdc/v4.0.14/cdc cli changefeed list --pd=http://172.16.6.28:2379
    [
    {
    "id": "kafka-task-11",
    "summary": {
    "state": "failed",
    "tso": 426609068572934145,
    "checkpoint": "2021-07-27 19:14:07.376",
    "error": {
    "addr": "172.16.6.32:8300",
    "code": "CDC-owner-1001",
    "message": "rpc error: code = Unknown desc = rpc error: code = Unavailable desc = not leader"
    }
    }
    },
    {
    "id": "replication-task-11",
    "summary": {
    "state": "normal",
    "tso": 426609068677791745,
    "checkpoint": "2021-07-27 19:14:07.776",
    "error": null
    }
    }

  1. Versions of the cluster

    • Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

      4.0.14
      
    • TiCDC version (execute cdc version):

      ["Welcome to Change Data Capture (CDC)"] [release-version=v4.0.14] [git-hash=5a7851967f686da896b45acd3f3e968bfe53d6bd] [git-branch=heads/refs/tags/v4.0.14]
      
@Tammyxia Tammyxia added type/bug The issue is confirmed as a bug. severity/major labels Jul 27, 2021
@Tammyxia
Copy link
Author

Tammyxia commented Jul 27, 2021

  • What's more, cli changefeed list don't work anymore...

$ tiup cdc:v4.0.14 cli changefeed list --pd=http://172.16.6.24:237
Starting component cdc: /root/.tiup/components/cdc/v4.0.14/cdc cli changefeed list --pd=http://172.16.6.24:237
Error: fail to open PD etcd client, pd="http://172.16.6.24:237": context deadline exceeded
Usage:
cdc cli changefeed list [flags]

Flags:
-a, --all List all replication tasks(including removed and finished)
-h, --help help for list

Global Flags:
--ca string CA certificate path for TLS connection
--cert string Certificate path for TLS connection
-i, --interact Run cdc cli with readline
--key string Private key path for TLS connection
--log-level string log level (etc: debug|info|warn|error) (default "warn")
--pd string PD address, use ',' to separate multiple PDs (default "http://127.0.0.1:2379")

fail to open PD etcd client, pd="http://172.16.6.24:237": context deadline exceeded

@asddongmen asddongmen added bug-from-internal-test Bugs found by internal testing. component/sink Sink component. difficulty/medium Medium task. labels Jul 28, 2021
@3AceShowHand
Copy link
Contributor

{
"id": "kafka-task-11",
"summary": {
"state": "failed",
"tso": 426608874219372599,
"checkpoint": "2021-07-27 19:01:45.976",
"error": {
"addr": "172.16.6.32:8300",
"code": "CDC-owner-1001",
"message": "rpc error: code = Unknown desc = rpc error: code = Unavailable desc = not leader"
}
}
}

@3AceShowHand
Copy link
Contributor

checkpoint still updating after failed.

@3AceShowHand
Copy link
Contributor

the error may happen when try to newChangefeed

@3AceShowHand
Copy link
Contributor

#942 may still unsolved, because resultErr != nil always be false, so ddlHandler and primarySink will not be closed if make changefeed failed.

@3AceShowHand
Copy link
Contributor

"message": "rpc error: code = Unknown desc = rpc error: code = Unavailable desc = not leader" this may happend because of call pdClient or etcdClient

@3AceShowHand
Copy link
Contributor

The problem should have already fix by #2370.

replay the scenario on release-4.0 before try to fix it.

@3AceShowHand
Copy link
Contributor

3AceShowHand commented Aug 25, 2021

image
The problem is already fixed in release-4.0 branch.

I have manually test several times, test-2 is created and run in release-4.0, it works fine

@AkiraXie AkiraXie added the area/ticdc Issues or PRs related to TiCDC. label Mar 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. bug-from-internal-test Bugs found by internal testing. component/sink Sink component. difficulty/medium Medium task. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

4 participants