Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ticdc redo apply shouldn't depend on meta #6189

Closed
4 tasks done
hicqu opened this issue Jul 5, 2022 · 21 comments · Fixed by #6207
Closed
4 tasks done

ticdc redo apply shouldn't depend on meta #6189

hicqu opened this issue Jul 5, 2022 · 21 comments · Fixed by #6207
Labels
affects-5.3 affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.1 This bug affects the 6.1.x(LTS) versions. area/ticdc Issues or PRs related to TiCDC. bug-from-internal-test Bugs found by internal testing. question Further information is requested. severity/critical type/bug The issue is confirmed as a bug.

Comments

@hicqu
Copy link
Contributor

hicqu commented Jul 5, 2022

Before asking a question, make sure you have

What is your question?

In the current implement, when applying redo log on a given downstream sink, the event timestamp range is specified from (meta.checkpointTs, meta.resolvedTs].

redo-log meta file is flushed periodically. However, events can be flushed into downstream sinks after log file is flushed instead of redo-log meta. It makes cdc redo apply not be able to ensure downstream integration. For example:

  1. cdc table redo log is flushed with timestamp ts1;
  2. however the current redo-meta resolved timestamp is ts0, which is less than ts1;
  3. events before ts1 have been flushed into downstream sinks;
  4. cdc is killed and cdc redo apply is called with the current redo-meta.
  5. after recovered, the downstream can contain events after the point ts0, which breaks the required integration.

I have prepared a branch, and please notice the fatal log. And the panic really happens when I run the branch:

[2022/07/05 16:00:27.117 +08:00] [FATAL] [mysql.go:237] ["[QP] mysqlSink.FlushRowChangedEvents is called with ts less than LogWriterResolved"] [stack="github.com/pingcap/tiflow/cdc/sink/mysql.(*mysqlSink).FlushRowChangedEvents\n\tgithub.com/pingcap/tiflow/cdc/sink/mysql/mysql.go:237\ngithub.com/pingcap/tiflow/cdc/sink.(*tableSink).flushResolvedTs\n\tgithub.com/pingcap/tiflow/cdc/sink/table_sink.go:109\ngithub.com/pingcap/tiflow/cdc/sink.(*tableSink).FlushRowChangedEvents\n\tgithub.com/pingcap/tiflow/cdc/sink/table_sink.go:86\ngithub.com/pingcap/tiflow/cdc/processor/pipeline.(*sinkNode).flushSink\n\tgithub.com/pingcap/tiflow/cdc/processor/pipeline/sink.go:155\ngithub.com/pingcap/tiflow/cdc/processor/pipeline.(*sinkNode).updateBarrierTs\n\tgithub.com/pingcap/tiflow/cdc/processor/pipeline/sink.go:338\ngithub.com/pingcap/tiflow/cdc/processor/pipeline.(*sinkNode).HandleMessage\n\tgithub.com/pingcap/tiflow/cdc/processor/pipeline/sink.go:329\ngithub.com/pingcap/tiflow/cdc/processor/pipeline.(*sinkNode).Receive\n\tgithub.com/pingcap/tiflow/cdc/processor/pipeline/sink.go:289\ngithub.com/pingcap/tiflow/pkg/pipeline.(*nodeRunner).run\n\tgithub.com/pingcap/tiflow/pkg/pipeline/runner.go:65\ngithub.com/pingcap/tiflow/pkg/pipeline.(*Pipeline).driveRunner\n\tgithub.com/pingcap/tiflow/pkg/pipeline/pipeline.go:97"]

The test shows that the above guess is correct.

To resolve the critical bug, we can consider to

  • always flush log meta when flushing log files, or
  • remove log meta from the design, and put checkpointTs and resolvedTs into redo-log entry
@hicqu hicqu added the question Further information is requested. label Jul 5, 2022
@hicqu
Copy link
Contributor Author

hicqu commented Jul 5, 2022

/label bug-from-internal-test

@ti-chi-bot ti-chi-bot added the bug-from-internal-test Bugs found by internal testing. label Jul 5, 2022
@amyangfei
Copy link
Contributor

amyangfei commented Jul 7, 2022

The timing should be

  1. flush redo log meta
  2. -> update global resolved ts
  3. -> flush row changes to backend sink according to global resolved ts

Which rule above is broken?

@amyangfei
Copy link
Contributor

/cc @CharlesCheung96

@hicqu
Copy link
Contributor Author

hicqu commented Jul 7, 2022

/label

@CharlesCheung96
Copy link
Contributor

/type bug

@ti-chi-bot ti-chi-bot added the type/bug The issue is confirmed as a bug. label Jul 7, 2022
@CharlesCheung96
Copy link
Contributor

/label affects-5.3
/label affects-5.4
/label affects-6.1

@ti-chi-bot ti-chi-bot added affects-5.3 affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.1 This bug affects the 6.1.x(LTS) versions. labels Jul 7, 2022
@hicqu
Copy link
Contributor Author

hicqu commented Jul 7, 2022

/label severity/critical

@ti-chi-bot
Copy link
Member

@hicqu: The label(s) severity/critical cannot be applied. These labels are supported: duplicate, bug-from-internal-test, bug-from-user, affects-4.0, affects-5.0, affects-5.1, affects-5.2, affects-5.3, affects-5.4, affects-6.0, affects-6.1, may-affects-4.0, may-affects-5.0, may-affects-5.1, may-affects-5.2, may-affects-5.3, may-affects-5.4, may-affects-6.0, may-affects-6.1, needs-cherry-pick-release-4.0, needs-cherry-pick-release-5.0, needs-cherry-pick-release-5.1, needs-cherry-pick-release-5.2, needs-cherry-pick-release-5.3, needs-cherry-pick-release-5.4, needs-cherry-pick-release-6.0, needs-cherry-pick-release-6.1, question, release-blocker, wontfix, require-LGT1, require-LGT3.

In response to this:

/label severity/critical

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@zhaoxinyu
Copy link
Contributor

/severity critical

@zhaoxinyu
Copy link
Contributor

/remove may-affects-4.0

@zhaoxinyu
Copy link
Contributor

/remove-label may-affects-4.0

@zhaoxinyu
Copy link
Contributor

/remove-label may-affects-5.0
/remove-label may-affects-5.1
/remove-label may-affects-5.2

@zhaoxinyu
Copy link
Contributor

/remove-label may-affects-6.0

@nongfushanquan
Copy link
Contributor

/label affects-6.1

@nongfushanquan
Copy link
Contributor

/open

@nongfushanquan
Copy link
Contributor

/reopen

@ti-chi-bot
Copy link
Member

@nongfushanquan: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@nongfushanquan
Copy link
Contributor

after pr 6207, issue 6277 will track the new problem on this situation.

@nongfushanquan
Copy link
Contributor

/close

@ti-chi-bot
Copy link
Member

@nongfushanquan: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@nongfushanquan
Copy link
Contributor

/area ticdc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.3 affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.1 This bug affects the 6.1.x(LTS) versions. area/ticdc Issues or PRs related to TiCDC. bug-from-internal-test Bugs found by internal testing. question Further information is requested. severity/critical type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants