Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: lost data due to mysql tow-phase-commit #1054

Closed
wants to merge 2 commits into from
Closed

fix: lost data due to mysql tow-phase-commit #1054

wants to merge 2 commits into from

Conversation

shaohk
Copy link
Contributor

@shaohk shaohk commented Nov 15, 2021

fix: lost data due to mysql tow-phase-commit

A Pull Request should be associated with an Issue.

We wish to have discussions in Issues. A single issue may be targeted by multiple PRs.
If you're offering a new feature or fixing anything, we'd like to know beforehand in Issues,
and potentially we'll be able to point development in a particular direction.

Related issue: #1030

Further notes in https://github.com/github/gh-ost/blob/master/.github/CONTRIBUTING.md
Thank you! We are open to PRs, but please understand if for technical reasons we are unable to accept each and any PR

Description

This PR [fix lost data due to mysql tow-phase-commit]

In case this PR introduced Go code changes:

  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.

@Fanduzi
Copy link
Contributor

Fanduzi commented Nov 16, 2021

Very clever!

@timvaillancourt
Copy link
Collaborator

@wangzihuacool are you able to test if this change fixes the issue?

I think this approach is a bit cleaner than using a lock in #1040, assuming it all works

@wangzihuacool
Copy link
Contributor

Both using a shared lock in #1040 and triggering a WriteChangelog event can fix the issue. I have tested.
I think using a shared lock is a direct and effective way to fix it, with better compatibility, just like the way rowcopy does. By contract triggering a writechangelog event seems to be a more clever way, but a bit confusing, with a small side effect that gh-ost migration may exits abnormally if WriteChangelog fails or timeout for some reason.

@shaohk
Copy link
Contributor Author

shaohk commented Nov 18, 2021

Both using a shared lock in #1040 and triggering a WriteChangelog event can fix the issue. I have tested. I think using a shared lock is a direct and effective way to fix it, with better compatibility, just like the way rowcopy does. By contract triggering a writechangelog event seems to be a more clever way, but a bit confusing, with a small side effect that gh-ost migration may exits abnormally if WriteChangelog fails or timeout for some reason.

Yes, using share lock is also a very effective way. But we are worried about an unexpected problem caused by share lock, so I added a write heartbeat here. Even if the gh-ost fails due to the heartbeat write timeout, this is a small probability time and acceptable.

@Fanduzi
Copy link
Contributor

Fanduzi commented Jan 6, 2022

Any progress?

@Fanduzi
Copy link
Contributor

Fanduzi commented Apr 30, 2022

Hi @dm-2 @timvaillancourt
#1040 and #1054 has had a PR for fixing the semi-sync data loss bug for some time now, is there any plans to merge?

@timvaillancourt timvaillancourt self-requested a review June 16, 2022 19:24
@timvaillancourt timvaillancourt added this to the v1.1.5 milestone Jun 16, 2022
Copy link
Collaborator

@timvaillancourt timvaillancourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Updating the heartbeat is an extremely safe operation with very low risk 👍

cc @dm-2 / @rashiq for review/merge

@@ -388,6 +388,12 @@ func (this *Migrator) Migrate() (err error) {
if err := this.addDMLEventsListener(); err != nil {
return err
}

// write a heartbeat data into the ghc table, and avoid lost data due to MySQL two-phase-commit.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shaohk could you add a bit more background in this comment on "why" writing the heartbeat causes us to avoid lost data due to MySQL two-phase-commit? That will be useful to those without context

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to describe the scene:

When using semi-sync and setting rpl_semi_sync_master_wait_point=AFTER_SYNC, if an INSERT statement is being committed but blocks due to an unmet ack count, the data inserted by the transaction is not visible to ReadMigrationRangeValues, so the copy of the existing data in the table does not include the new row inserted by the transaction. However, the binlog event for the transaction is already written to the binlog, so the addDMLEventsListener only captures the binlog event after the transaction, and thus the transaction's binlog event is not captured, resulting in data loss.

If you add a heartbeat, and the transaction commit blocks because the ack is not met, then the heartbeat will not be able to write, so the ReadMigrationRangeValues will not be run. When the heartbeat writes successfully, the ReadMigrationRangeValues will read the newly inserted data, thus Avoiding data loss due to the above problem

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fanduzi that's my understanding too 👍

I'm hoping we can add some of this context into the code-comment on line 392, so that this is easier to read for those without context

Copy link
Collaborator

@timvaillancourt timvaillancourt Jun 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shaohk two more questions:

  1. Would it make sense to move this to be within the .ReadMigrationRangeValues() func? The goal with the .Migrate() func is to leave a lot of the details to other functions
  2. What do you think about updating a ChangeLogState instead of the heartbeat? .WriteChangelogState() in the applier is used for this. Perhaps a new ChangeLogState like ReadMigrationRangeValues would work here 🤔. Example PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shaohk two more questions:

  1. Would it make sense to move this to be within the .ReadMigrationRangeValues() func? The goal with the .Migrate() func is to leave a lot of the details to other functions
  2. What do you think about updating a ChangeLogState instead of the heartbeat? .WriteChangelogState() in the applier is used for this. Perhaps a new ChangeLogState like ReadMigrationRangeValues would work here 🤔. Example PR

Sorry, I just saw these messages.

  1. I complete the annotation information in the code.
  2. I think the above two suggestions are very good, I will revise the PR and mention it again.

Copy link
Contributor Author

@shaohk shaohk Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shaohk two more questions:

  1. Would it make sense to move this to be within the .ReadMigrationRangeValues() func? The goal with the .Migrate() func is to leave a lot of the details to other functions
  2. What do you think about updating a ChangeLogState instead of the heartbeat? .WriteChangelogState() in the applier is used for this. Perhaps a new ChangeLogState like ReadMigrationRangeValues would work here 🤔. Example PR

@timvaillancourt
Use changelog state instead of the heartbeat PR

Copy link
Contributor Author

@shaohk shaohk Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to describe the scene:

When using semi-sync and setting rpl_semi_sync_master_wait_point=AFTER_SYNC, if an INSERT statement is being committed but blocks due to an unmet ack count, the data inserted by the transaction is not visible to ReadMigrationRangeValues, so the copy of the existing data in the table does not include the new row inserted by the transaction. However, the binlog event for the transaction is already written to the binlog, so the addDMLEventsListener only captures the binlog event after the transaction, and thus the transaction's binlog event is not captured, resulting in data loss.

If you add a heartbeat, and the transaction commit blocks because the ack is not met, then the heartbeat will not be able to write, so the ReadMigrationRangeValues will not be run. When the heartbeat writes successfully, the ReadMigrationRangeValues will read the newly inserted data, thus Avoiding data loss due to the above problem

@Fanduzi
Thanks, I added your description to the code comments. #1141

@timvaillancourt timvaillancourt modified the milestones: v1.1.5, v1.1.6 Jun 23, 2022
@timvaillancourt
Copy link
Collaborator

Closing in favour of #1141 👍

@timvaillancourt timvaillancourt modified the milestones: v1.1.6, v1.1.5 Jul 6, 2022
@dm-2 dm-2 removed this from the v1.1.5 milestone Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants