Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Dumping upgrade info interrupted by cosmovisor #9384

Merged
merged 4 commits into from
Jun 21, 2021

Conversation

yihuang
Copy link
Collaborator

@yihuang yihuang commented May 24, 2021

Solution:

  • dumping upgrade info before emit UPGRADED NEEDED log which will
    cause cosmovisor to kill chain process

Description

The problematic procedure:

  1. chain process output UPGRADE NEEDED log
  2. cosmovisor see the message and kill the chain binary
  3. chain process dump upgrade info and panic itself

the step 2 and 3 runs concurrently, so the dumping process can be interrupted by cosmovisor's terminate signal. The proposed solution is to dump upgrade info before emitting the log.
there are two problematic situation:

  1. the upgrade info file is created, but content is not written or flushed before killed, when the chain process restart, it'll panic because of json parsing error.
  2. the upgrade info file is not created at all, when the chain process restart, the store upgrades are not activated, will cause app hash mismatch error later on.

Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.

  • Targeted PR against correct branch (see CONTRIBUTING.md)
  • Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
  • Code follows the module structure standards.
  • Wrote unit and integration tests
  • Updated relevant documentation (docs/) or specification (x/<module>/spec/)
  • Added relevant godoc comments.
  • Added a relevant changelog entry to the Unreleased section in CHANGELOG.md
  • Re-reviewed Files changed in the Github PR explorer
  • Review Codecov Report in the comment section below once CI passes

Solution:
- dumping upgrade info before emit `UPGRADED NEEDED` log which will
  cause cosmovisor to kill chain process
@codecov
Copy link

codecov bot commented May 24, 2021

Codecov Report

Merging #9384 (e6c706b) into master (304d0ff) will decrease coverage by 0.15%.
The diff coverage is 100.00%.

❗ Current head e6c706b differs from pull request most recent head bce3a42. Consider uploading reports for the commit bce3a42 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9384      +/-   ##
==========================================
- Coverage   60.59%   60.43%   -0.16%     
==========================================
  Files         589      590       +1     
  Lines       37227    37243      +16     
==========================================
- Hits        22556    22508      -48     
- Misses      12727    12754      +27     
- Partials     1944     1981      +37     
Impacted Files Coverage Δ
x/upgrade/abci.go 92.30% <100.00%> (ø)
types/tx_msg.go 0.00% <0.00%> (-100.00%) ⬇️
x/distribution/simulation/operations.go 82.24% <0.00%> (-9.48%) ⬇️
x/params/simulation/operations.go 77.77% <0.00%> (-8.59%) ⬇️
x/staking/simulation/operations.go 73.46% <0.00%> (-6.47%) ⬇️
x/feegrant/simulation/operations.go 78.84% <0.00%> (-6.11%) ⬇️
x/staking/genesis.go 55.38% <0.00%> (-4.62%) ⬇️
x/gov/simulation/operations.go 80.88% <0.00%> (-4.57%) ⬇️
client/grpc/tmservice/service.go 70.09% <0.00%> (-0.28%) ⬇️
x/authz/client/testutil/tx.go 97.63% <0.00%> (-0.16%) ⬇️
... and 22 more

@cyberbono3
Copy link
Contributor

@yihuang what issue number does this PR resolve?

@yihuang
Copy link
Collaborator Author

yihuang commented Jun 3, 2021

@yihuang what issue number does this PR resolve?

I didn't open an issue for it, I described the bug in PR description directly. We find the issue in our testing.

@cyberbono3 cyberbono3 self-requested a review June 3, 2021 14:47
@cyberbono3
Copy link
Contributor

@yihuang How can i reproduce this bug?

@aaronc
Copy link
Member

aaronc commented Jun 3, 2021

Thanks @yihuang 🙏

@yihuang
Copy link
Collaborator Author

yihuang commented Jun 3, 2021

@yihuang How can i reproduce this bug?

Add code to read upgrade info like this, run a devnet with cosmovisor and do an upgrade, there are chances that it run to this panic.

@aaronc aaronc changed the title [upgrade] Dumping upgrade info interrupted by cosmovisor fix: Dumping upgrade info interrupted by cosmovisor Jun 3, 2021
@aaronc
Copy link
Member

aaronc commented Jun 3, 2021

@cyberbono3 I don't think we need to actually reproduce this. The patch is pretty straightforward and looks correct. Cosmovisor will currently kill the node once that message hits stdout and that could interrupt writing the file. So changing the order will fix that. What we really need to do for 0.42 nodes is get #8590 finished ASAP.

@aaronc aaronc added the A:automerge Automatically merge PR once all prerequisites pass. label Jun 21, 2021
@tac0turtle
Copy link
Member

@Mergifyio backport release/v0.42.x

mergify bot pushed a commit that referenced this pull request Jun 29, 2021
Solution:
- dumping upgrade info before emit `UPGRADED NEEDED` log which will
  cause cosmovisor to kill chain process

<!-- < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < ☺
v                               ✰  Thanks for creating a PR! ✰
v    Before smashing the submit button please review the checkboxes.
v    If a checkbox is n/a - please still include it but + a little note why
☺ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >  -->

## Description

The problematic procedure:

1. chain process output UPGRADE NEEDED log
2. cosmovisor see the message and kill the chain binary
3. chain process dump upgrade info and panic itself

the step 2 and 3 runs concurrently, so the dumping process can be interrupted by cosmovisor's terminate signal. The proposed solution is to dump upgrade info before emitting the log.
there are two problematic situation:
1. the upgrade info file is created, but content is not written or flushed before killed, when the chain process restart, it'll panic because of json parsing error.
2. the upgrade info file is not created at all, when the chain process restart, the [store upgrades](https://github.com/crypto-org-chain/chain-main/blob/master/app/app.go#L436) are not activated, will cause app hash mismatch error later on.

---

Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.

- [x] Targeted PR against correct branch (see [CONTRIBUTING.md](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#pr-targeting))
- [ ] Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
- [x] Code follows the [module structure standards](https://github.com/cosmos/cosmos-sdk/blob/master/docs/building-modules/structure.md).
- [ ] Wrote unit and integration [tests](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#testing)
- [ ] Updated relevant documentation (`docs/`) or specification (`x/<module>/spec/`)
- [ ] Added relevant `godoc` [comments](https://blog.golang.org/godoc-documenting-go-code).
- [ ] Added a relevant changelog entry to the `Unreleased` section in `CHANGELOG.md`
- [x] Re-reviewed `Files changed` in the Github PR explorer
- [x] Review `Codecov Report` in the comment section below once CI passes

(cherry picked from commit 0540ed2)
@mergify
Copy link
Contributor

mergify bot commented Jun 29, 2021

Command backport release/v0.42.x: success

Backports have been created

amaury1093 pushed a commit that referenced this pull request Jun 29, 2021
Solution:
- dumping upgrade info before emit `UPGRADED NEEDED` log which will
  cause cosmovisor to kill chain process

<!-- < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < ☺
v                               ✰  Thanks for creating a PR! ✰
v    Before smashing the submit button please review the checkboxes.
v    If a checkbox is n/a - please still include it but + a little note why
☺ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >  -->

## Description

The problematic procedure:

1. chain process output UPGRADE NEEDED log
2. cosmovisor see the message and kill the chain binary
3. chain process dump upgrade info and panic itself

the step 2 and 3 runs concurrently, so the dumping process can be interrupted by cosmovisor's terminate signal. The proposed solution is to dump upgrade info before emitting the log.
there are two problematic situation:
1. the upgrade info file is created, but content is not written or flushed before killed, when the chain process restart, it'll panic because of json parsing error.
2. the upgrade info file is not created at all, when the chain process restart, the [store upgrades](https://github.com/crypto-org-chain/chain-main/blob/master/app/app.go#L436) are not activated, will cause app hash mismatch error later on.

---

Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.

- [x] Targeted PR against correct branch (see [CONTRIBUTING.md](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#pr-targeting))
- [ ] Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
- [x] Code follows the [module structure standards](https://github.com/cosmos/cosmos-sdk/blob/master/docs/building-modules/structure.md).
- [ ] Wrote unit and integration [tests](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#testing)
- [ ] Updated relevant documentation (`docs/`) or specification (`x/<module>/spec/`)
- [ ] Added relevant `godoc` [comments](https://blog.golang.org/godoc-documenting-go-code).
- [ ] Added a relevant changelog entry to the `Unreleased` section in `CHANGELOG.md`
- [x] Re-reviewed `Files changed` in the Github PR explorer
- [x] Review `Codecov Report` in the comment section below once CI passes

(cherry picked from commit 0540ed2)

Co-authored-by: yihuang <huang@crypto.com>
@amaury1093
Copy link
Contributor

@Mergifyio backport release/v0.43.x

mergify bot pushed a commit that referenced this pull request Jun 29, 2021
Solution:
- dumping upgrade info before emit `UPGRADED NEEDED` log which will
  cause cosmovisor to kill chain process

<!-- < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < ☺
v                               ✰  Thanks for creating a PR! ✰
v    Before smashing the submit button please review the checkboxes.
v    If a checkbox is n/a - please still include it but + a little note why
☺ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >  -->

## Description

The problematic procedure:

1. chain process output UPGRADE NEEDED log
2. cosmovisor see the message and kill the chain binary
3. chain process dump upgrade info and panic itself

the step 2 and 3 runs concurrently, so the dumping process can be interrupted by cosmovisor's terminate signal. The proposed solution is to dump upgrade info before emitting the log.
there are two problematic situation:
1. the upgrade info file is created, but content is not written or flushed before killed, when the chain process restart, it'll panic because of json parsing error.
2. the upgrade info file is not created at all, when the chain process restart, the [store upgrades](https://github.com/crypto-org-chain/chain-main/blob/master/app/app.go#L436) are not activated, will cause app hash mismatch error later on.

---

Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.

- [x] Targeted PR against correct branch (see [CONTRIBUTING.md](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#pr-targeting))
- [ ] Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
- [x] Code follows the [module structure standards](https://github.com/cosmos/cosmos-sdk/blob/master/docs/building-modules/structure.md).
- [ ] Wrote unit and integration [tests](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#testing)
- [ ] Updated relevant documentation (`docs/`) or specification (`x/<module>/spec/`)
- [ ] Added relevant `godoc` [comments](https://blog.golang.org/godoc-documenting-go-code).
- [ ] Added a relevant changelog entry to the `Unreleased` section in `CHANGELOG.md`
- [x] Re-reviewed `Files changed` in the Github PR explorer
- [x] Review `Codecov Report` in the comment section below once CI passes

(cherry picked from commit 0540ed2)
@mergify
Copy link
Contributor

mergify bot commented Jun 29, 2021

Command backport release/v0.43.x: success

Backports have been created

@lampkin-diet
Copy link

Hello all. I don't know is it the same issue or maybe another one should be created.
I have the next issue on f479b515a8ce2353ab583525a029f7e68dad4e5f commit master branch.
After proposing software-upgrade and getting ready-to-upgrade status (all needed deposits and votes was set up) I have the next errors:

node0_1  | 7:31PM ERR UPGRADE "test_upgrade" NEEDED at height: 100: 
node1_1  | 7:31PM ERR UPGRADE "test_upgrade" NEEDED at height: 100: 
node2_1  | 7:31PM ERR UPGRADE "test_upgrade" NEEDED at height: 100: 
node2_1  | unexpected end of JSON input
node1_1  | starting to take backup of data directory at time 2021-08-10 19:31:43.193928884 +0000 UTC m=+541.849044875backup saved at location: /home/some/.somenode/data-backup-2021-8-10, completed at time: 2021-08-10 19:31:43.198799961 +0000 UTC m=+541.853915927
node0_1  | starting to take backup of data directory at time 2021-08-10 19:31:43.189057128 +0000 UTC m=+543.920612246backup saved at location: /home/some/.somenode/data-backup-2021-8-10, completed at time: 2021-08-10 19:31:43.198389991 +0000 UTC m=+543.929945105
....
node1_1  | 7:31PM ERR UPGRADE "test_upgrade" NEEDED at height: 100: 
node0_1  | 7:31PM ERR UPGRADE "test_upgrade" NEEDED at height: 100: 
node0_1  | starting to take backup of data directory at time 2021-08-10 19:31:43.416337948 +0000 UTC m=+544.147893062backup saved at location: /home/some/.somenode/data-backup-2021-8-10, completed at time: 2021-08-10 19:31:43.423575189 +0000 UTC m=+544.155130323
node1_1  | starting to take backup of data directory at time 2021-08-10 19:31:43.416321727 +0000 UTC m=+542.071437691backup saved at location: /home/some/.somenode/data-backup-2021-8-10, completed at time: 2021-08-10 19:31:43.423369993 +0000 UTC m=+542.078485960
node1_1  | time taken to complete the backup: 7.048269ms7:31PM INF starting ABCI with Tendermint
node0_1  | time taken to complete the backup: 7.237261ms7:31PM INF starting ABCI with Tendermint
...
....
node1_1  | 7:31PM ERR UPGRADE "test_upgrade" NEEDED at height: 100: 
node1_1  | starting to take backup of data directory at time 2021-08-10 19:31:43.564978705 +0000 UTC m=+542.220094651backup saved at location: /home/some/.somenode/data-backup-2021-8-10, completed at time: 2021-08-10 19:31:43.573258244 +0000 UTC m=+542.228374216
....
node0_1  | 7:31PM ERR UPGRADE "test_upgrade" NEEDED at height: 100: 
node0_1  | starting to take backup of data directory at time 2021-08-10 19:31:43.742330351 +0000 UTC m=+544.473885449backup saved at location: /home/some/.somenode/data-backup-2021-8-10, completed at time: 2021-08-10 19:31:43.752006195 +0000 UTC m=+544.483561322
node1_1  | time taken to complete the backup: 8.877305ms7:31PM INF starting ABCI with Tendermint
node0_1  | time taken to complete the backup: 9.675873ms7:31PM INF starting ABCI with Tendermint
....
....
node1_1  | 7:31PM ERR UPGRADE "test_upgrade" NEEDED at height: 100: 
node1_1  | unexpected end of JSON input

node0_1, node1_1, node2_1, node3_1 are docker containers
upgrade parameters are:

ENV DAEMON_HOME=/home/some/.somenode
ENV DAEMON_NAME=some-noded
ENV DAEMON_RESTART_AFTER_UPGRADE=true

Is it the same issue or maybe I can create another one?

@robert-zaremba
Copy link
Collaborator

With #8590 we don't have the race condition any more, because the "new" cosmovisor doesn't can log and only monitors the upgrade info file.

Raumo0 pushed a commit to mapofzones/cosmos-sdk that referenced this pull request Feb 13, 2022
* Fixed parse key issue (backport cosmos#9299) (cosmos#9561)

* Fixed parse key issue (cosmos#9299)

* Fixed parse key issue

* Added getconfig in root command

* uncommented changes in parse.go

(cherry picked from commit d7dd1d7)

# Conflicts:
#	simapp/simd/cmd/root.go

* Add changelog

Co-authored-by: Prathyusha Lakkireddy <prathyusha@vitwit.com>
Co-authored-by: Amaury M <1293565+amaurym@users.noreply.github.com>

* feat: return trace value from baseapp (backport cosmos#9578) (cosmos#9580)

* feat: return trace value from baseapp (cosmos#9578)

## Description

Closes: #XXXX

<!-- Add a description of the changes that this PR introduces and the files that
are the most critical to review. -->

* fix: Dumping upgrade info interrupted by cosmovisor (cosmos#9384) (cosmos#9608)

Solution:
- dumping upgrade info before emit `UPGRADED NEEDED` log which will
  cause cosmovisor to kill chain process

<!-- < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < ☺
v                               ✰  Thanks for creating a PR! ✰
v    Before smashing the submit button please review the checkboxes.
v    If a checkbox is n/a - please still include it but + a little note why
☺ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >  -->

## Description

The problematic procedure:

1. chain process output UPGRADE NEEDED log
2. cosmovisor see the message and kill the chain binary
3. chain process dump upgrade info and panic itself

the step 2 and 3 runs concurrently, so the dumping process can be interrupted by cosmovisor's terminate signal. The proposed solution is to dump upgrade info before emitting the log.
there are two problematic situation:
1. the upgrade info file is created, but content is not written or flushed before killed, when the chain process restart, it'll panic because of json parsing error.
2. the upgrade info file is not created at all, when the chain process restart, the [store upgrades](https://github.com/crypto-org-chain/chain-main/blob/master/app/app.go#L436) are not activated, will cause app hash mismatch error later on.

* feat: Error on blank chain-id in multisign command (backport cosmos#9593) (cosmos#9605)

* feat: Error on blank chain-id in multisign command (cosmos#9593)

Error on `tx multisign` command if chain-id is blank. This is a common cause of signature verification failures when combining signatures and the error message doesn't provide any clues to this common cause.

I have...

- [x] included the correct [type prefix](https://github.com/commitizen/conventional-commit-types/blob/v3.0.0/index.json) in the PR title
- [x] added `!` to the type prefix if API or client breaking change
- [x] targeted the correct branch (see [PR Targeting](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#pr-targeting))
- [] provided a link to the relevant issue or specification
- [ ] followed the guidelines for [building modules](https://github.com/cosmos/cosmos-sdk/blob/master/docs/building-modules)
- [ ] included the necessary unit and integration [tests](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#testing)
- [ ] added a changelog entry to `CHANGELOG.md`
- [ ] included comments for [documenting Go code](https://blog.golang.org/godoc)
- [ ] updated the relevant documentation or specification
- [ ] reviewed "Files changed" and left comments if necessary
- [ ] confirmed all CI checks have passed

### Reviewers Checklist

*All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.*

I have...

- [ ] confirmed the correct [type prefix](https://github.com/commitizen/conventional-commit-types/blob/v3.0.0/index.json) in the PR title
- [ ] confirmed `!` in the type prefix if API or client breaking change
- [ ] confirmed all author checklist items have been addressed
- [ ] reviewed state machine logic
- [ ] reviewed API design and naming
- [ ] reviewed documentation is accurate
- [ ] reviewed tests and test coverage
- [ ] manually tested (if applicable)

(cherry picked from commit f65b6c9)

# Conflicts:
#	CHANGELOG.md

* fix conflicts

* less change diff

Co-authored-by: Zaki Manian <zaki@manian.org>
Co-authored-by: Amaury M <1293565+amaurym@users.noreply.github.com>

* x/capability: Cap Initialization Fix (cosmos#9392)

* fix: correct ibc metric labels (cosmos#9645)

* backport cosmos/ibc-go#223

* add changelog

* fix unnecessary changes

* fix: Fix IBC Transfer Event (cosmos#9640)

* fix event type

* CHANGELOG

* Update CHANGELOG.md

Co-authored-by: Amaury M <1293565+amaurym@users.noreply.github.com>

* chore: v0.42.7 release notes & changelog (cosmos#9661)

* chore: v0.42.7 release notes & changelog

* Add ibc

* fix(keyring): update keyring for kwallet fix (backport cosmos#9563) (cosmos#9579)

* fix(keyring): update keyring for kwallet fix (cosmos#9563)

## Description

Closes: cosmos#9562

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Prathyusha Lakkireddy <prathyusha@vitwit.com>
Co-authored-by: Amaury M <1293565+amaurym@users.noreply.github.com>
Co-authored-by: Zaki Manian <zaki@manian.org>
Co-authored-by: Aditya <adityasripal@gmail.com>
Co-authored-by: colin axnér <25233464+colin-axner@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A:automerge Automatically merge PR once all prerequisites pass. C:x/upgrade
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants