Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(stream): create cdc table reader and source data stream with retry #19467

Merged
merged 24 commits into from
Dec 3, 2024

Conversation

StrikeW
Copy link
Contributor

@StrikeW StrikeW commented Nov 20, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Previously we try to create the table reader in src/stream/src/from_proto/stream_cdc_scan.rs, if upstream database is down, the executor cannot be built so that the cluster will enter recovery loop. This pr try to create the table reader after we forward the initial barrier, so that it won't block recovery if cannot connect to upstream db.

We also add retry logic to the source executor when we create the data stream at first time.

close #17807

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

Copy link
Member

@fuyufjh fuyufjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM!

src/stream/src/executor/source/source_executor.rs Outdated Show resolved Hide resolved
let mut need_resume_after_build = false;
// loop to create source stream until success
loop {
if let Some(barrier) = build_source_stream_and_poll_barrier(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI. Link a Slack conversation here.

@tabVersion: It also affects the process of creating table, auto retry here means the create mv / create table command will always succeed, even if meeting some runtime error, such as Kafka group auth issue. It prevents rw from consuming data and requires the user setting up the ACL. We can only find this issue when building an actor.

So does this issue get noticed here? What happen if I create a source with some bad settings (such the permission issue mentioned above)?

Copy link
Contributor Author

@StrikeW StrikeW Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It prevents rw from consuming data and requires the user setting up the ACL. We can only find this issue when building an actor.
What happen if I create a source with some bad settings (such the permission issue mentioned above)?

I think MQ source enumerator should employ a validation logic to check these prerequisites, so that prerequisite for connectivity should be no problem when we are creating executors. But I am not sure whether we have the check for MQ sources. cc @tabVersion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for relying on the validation logic.

(Just saying, if someone is too lazy to write a full set of validation, they may just build a real Consumer and see if anything goes well.)

@StrikeW StrikeW added this pull request to the merge queue Dec 3, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 3, 2024
@StrikeW StrikeW added this pull request to the merge queue Dec 3, 2024
Copy link
Member

@xxchan xxchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SourceExecutor generally LGTM. It would be better to have some e2e tests though.

Comment on lines +540 to +541
loop {
if let Some(barrier) = build_source_stream_and_poll_barrier(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: while let might be better than loop { if let

@@ -736,6 +828,29 @@ impl<S: StateStore> SourceExecutor<S> {
}
}

async fn build_source_stream_and_poll_barrier(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe this should be called

Suggested change
async fn build_source_stream_and_poll_barrier(
async fn build_source_stream_or_receive_barrier(

Merged via the queue into main with commit 404998e Dec 3, 2024
28 of 29 checks passed
@StrikeW StrikeW deleted the siyuan/cdc-retry branch December 3, 2024 13:39
StrikeW added a commit that referenced this pull request Dec 4, 2024
StrikeW added a commit that referenced this pull request Dec 5, 2024
lmatz added a commit that referenced this pull request Dec 5, 2024
xxchan added a commit that referenced this pull request Dec 9, 2024
xxchan added a commit that referenced this pull request Dec 12, 2024
xxchan added a commit that referenced this pull request Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug(cdc source): should retry on upstream failure instead of blocking recovery
6 participants