Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The CDC client is still using the old PD address #9584

Closed
jacktd9 opened this issue Aug 15, 2023 · 10 comments · Fixed by #9713
Closed

The CDC client is still using the old PD address #9584

jacktd9 opened this issue Aug 15, 2023 · 10 comments · Fixed by #9713
Assignees
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. area/ticdc Issues or PRs related to TiCDC. severity/minor type/bug The issue is confirmed as a bug.

Comments

@jacktd9
Copy link

jacktd9 commented Aug 15, 2023

What did you do?

  1. Initially, there were 3 old PD nodes (pd1, pd2, pd3).
  2. A scaling operation was performed, adding 3 new PD nodes (pd4, pd5, pd6).
  3. After waiting for 5 minutes.
  4. A downsizing operation was executed, removing 3 old PD nodes (pd1, pd2, pd3).
  5. The CDC changefeed operation was restored.

An error occurred during the command execution, and it was found that the CDC logs are still accessing the old PD
image

current PD address is... 2479
image

What did you expect to see?

no error

What did you see instead?

connect pd failed

Versions of the cluster

cluster version : v6.5.3

@jacktd9 jacktd9 added area/ticdc Issues or PRs related to TiCDC. type/bug The issue is confirmed as a bug. labels Aug 15, 2023
@nongfushanquan
Copy link
Contributor

/assign @asddongmen

@asddongmen
Copy link
Contributor

@jacktd9 May I ask if the changefeed has resumed normal synchronization? In other words, was the error log you found temporary or has it not been resolved yet?

@jacktd9
Copy link
Author

jacktd9 commented Aug 17, 2023

  1. When we reload CDC and then execute the same 'resume' command again, the command was successful.

Similarly, in the scenario where '1' has already reloaded CDC, we attempted to update the changefeed configuration, but it returned a 500 error. We discovered that this seemed to be due to the effect of the old PD still stored in the upstream info. We tried expanding one of the previously downsized PD nodes and then executed the same 'update' command again, which succeeded this time.
image
image

@asddongmen
Copy link
Contributor

asddongmen commented Aug 18, 2023

  1. When we reload CDC and then execute the same 'resume' command again, the command was successful.

Similarly, in the scenario where '1' has already reloaded CDC, we attempted to update the changefeed configuration, but it returned a 500 error. We discovered that this seemed to be due to the effect of the old PD still stored in the upstream info. We tried expanding one of the previously downsized PD nodes and then executed the same 'update' command again, which succeeded this time. image image

So, if I understand correctly, TiCDC's pdClient is still using the old address and cannot update to the new one?

Based on our discussion and my comprehension, here is a summary:

  1. There are warning logs indicating that pdClient cannot connect to the old PD address, but changefeed can still advance.
  2. After restarting the cdc process, you can successfully execute cdc cli changefeed resume, but are unable to execute cdc cli changefeed update.
  3. When you scale out the PD cluster with one of the old PD nodes, TiCDC can correctly handle update changefeed requests.

Please correct me if I misunderstood anything. @jacktd9

@asddongmen asddongmen added severity/minor affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. labels Aug 31, 2023
ti-chi-bot bot pushed a commit that referenced this issue Nov 16, 2023
@asddongmen
Copy link
Contributor

Fixed in v6.5.6.

@kennytm
Copy link
Contributor

kennytm commented Jan 9, 2025

We have found that it is still not able to create changefeeds after scale-out -> scale-in operation. Reproduction:

tiup playground v7.1.5 --db 1 --kv 1 --pd 3 --ticdc 3 --tiflash 0 --without-monitor
# sanity check, ensure that tidb & cdc works
mysql -u root -h 127.1 -P 4000 -e 'select * from mysql.tidb'
tiup cdc:v7.1.5 cli --server 127.0.0.1:8300 changefeed create --sink-uri 'blackhole://' -c test0
tiup cdc:v7.1.5 cli --server 127.0.0.1:8300 changefeed remove -c test0
# perform scale-out (do not scale out all 3 simultaneously!)
tiup playground scale-out --pd 1
tiup playground scale-out --pd 1
tiup playground scale-out --pd 1
# note the PIDs of the first 3 PDs 
tiup playground display 
# perform scale-in
tiup playground scale-in --pid 23397,23398,23399
# sanity check, tidb should still work
mysql -u root -h 127.1 -P 4000 -e 'select * from mysql.tidb'
mysql -u root -h 127.1 -P 4000 -e 'show config where type = "pd" and name = "cluster-version"'
# try to create changefeed again
tiup cdc:v7.1.5 cli --server 127.0.0.1:8300 changefeed create --sink-uri 'blackhole://' -c test1
# ^ the program above is now stuck.
#   cdc log shows a lot of warnings like:
#
#   [2025/01/09 15:24:37.834 +08:00] [WARN] [pd_service_discovery.go:370] ["[pd] failed to get cluster id"] [url=http://127.0.0.1:2382] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2382: connect: connection refused\" target:127.0.0.1:2382 status:TRANSIENT_FAILURE: error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2382: connect: connection refused\" target:127.0.0.1:2382 status:TRANSIENT_FAILURE"]
#
#   doing Ctrl+C may also give us
#
#   Error: [CDC:ErrNewStore]new store failed: [pd] failed to get cluster id

This is reproduced on v7.1.4, v7.1.6, v8.5.0. I'm going to reopen the issue.

@kennytm kennytm reopened this Jan 9, 2025
@github-project-automation github-project-automation bot moved this from Done to In Progress in Question and Bug Reports Jan 9, 2025
@lidezhu
Copy link
Collaborator

lidezhu commented Jan 10, 2025

It seems cdc get pd address by common line arguments.
Image
So this behavior should meet the current expectation.

@kennytm
Copy link
Contributor

kennytm commented Jan 10, 2025

@lidezhu No the current behavior can be explained saying the CDC owner handles the API by using the --pd it was initialized with, but it is not the behavior expected by the customers.

@lidezhu
Copy link
Collaborator

lidezhu commented Jan 13, 2025

It should be a problem of pd: tikv/pd#8993

@lidezhu lidezhu added affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. affects-8.5 This bug affects the 8.5.x(LTS) versions. and removed affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. affects-8.5 This bug affects the 8.5.x(LTS) versions. labels Jan 15, 2025
@lidezhu
Copy link
Collaborator

lidezhu commented Jan 15, 2025

Create a new issue #12004 for the convenience of later cherry pick's management. Close this one.

@lidezhu lidezhu closed this as completed Jan 15, 2025
@github-project-automation github-project-automation bot moved this from In Progress to Done in Question and Bug Reports Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. area/ticdc Issues or PRs related to TiCDC. severity/minor type/bug The issue is confirmed as a bug.
Projects
Development

Successfully merging a pull request may close this issue.

5 participants