Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post upgrade with x/upgrade queries not working #7385

Closed
4 tasks
RiccardoM opened this issue Sep 24, 2020 · 32 comments · Fixed by #7415
Closed
4 tasks

Post upgrade with x/upgrade queries not working #7385

RiccardoM opened this issue Sep 24, 2020 · 32 comments · Fixed by #7415
Assignees

Comments

@RiccardoM
Copy link
Contributor

RiccardoM commented Sep 24, 2020

Summary of Bug

Today we ran an automatic upgrade using the x/upgrade module.
After a first error in the upgrade during the de-serialization of the data from the old state, we pushed a fix and every validator was able to upgrade correctly.

After the upgrade, the chain continued to create blocks correctly. You can see it running from here.

Unluckily, we later found out that all the queries are now returning the same error:

From the REST APIs:

{
"error": "invalid request: failed to load state at height 516938; version does not exist (latest height: 516938)"
}

From the CLI:

invalid request: failed to load state at height 516938; version does not exist (latest height: 516938)

This happened on all the nodes that underwent the upgrade procedure, no matter what their pruning strategies was. We tested it with ones having pruning = "nothing, pruning = "everything" and pruning = "default". All with the same result.

Currently we're testing if restarting a node with pruning = "nothing", and making sure the update works at the first try, solves the problem.

Version

0.39.1

Steps to Reproduce

  1. Create an upgrade proposal
  2. Have the upgrade proposal run
  3. Try query some data

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@alexanderbez alexanderbez self-assigned this Sep 24, 2020
@RiccardoM
Copy link
Contributor Author

Small update. After checking these are the details of the update:

  • last block committed before the upgrade: 513,573
  • first block committed after the upgrade: 513,574

The time in between was due to the initial wrong migration process and its fix release.

If you try to query using the REST APIs for any height after the upgrade, the same error shows up:
test

{
  "error": "invalid request: failed to load state at height 513575; version does not exist (latest height: 517249)"
}

@anilcse
Copy link
Collaborator

anilcse commented Sep 24, 2020

@RiccardoM I've tested couple of upgrades locally, everything went smooth and I am able to query LCD without any issues. I didn't cover jailing events and withdraw commissions yet, but tried to cover other possible transactions. Do you mind sharing some info about your fix for the de-serialization issue you mentioned?

@anilcse
Copy link
Collaborator

anilcse commented Sep 24, 2020

Akash Network is going mainnet with 0.39.1, this could be really problematic for them if there's something like this.
/cc @jackzampolin

@RiccardoM
Copy link
Contributor Author

RiccardoM commented Sep 24, 2020

@RiccardoM I've tested couple of upgrades locally, everything went smooth and I am able to query LCD without any issues. I didn't cover jailing events and withdraw commissions yet, but tried to cover other possible transactions. Do you mind sharing some info about your fix for the de-serialization issue you mentioned?

You can see it here.

It was just a small addition to make sure that when reading the old state, Amino could de-serialize maps properly. It was just some application-layer bug. I think this has no impact on this bug.

@erikgrinaker
Copy link
Contributor

I'd be curious to see which specific IAVL substore is failing to load. For example, there might be a substore that is no longer in use after the upgrade, such that new versions aren't written to it - the failing code path tries to load all substores.

Might be an idea to write a script that inspects the latest version root of all IAVL substores in the application state.

@anilcse
Copy link
Collaborator

anilcse commented Sep 24, 2020

You can see it here.

It was just a small addition to make sure that when reading the old state, Amino could de-serialize maps properly. It was just some application-layer bug. I think this has no impact on this bug.

Okay. Is this only for REST or are you facing issues with CLI too?

@RiccardoM
Copy link
Contributor Author

You can see it here.
It was just a small addition to make sure that when reading the old state, Amino could de-serialize maps properly. It was just some application-layer bug. I think this has no impact on this bug.

Okay. Is this only for REST or are you facing issues with CLI too?

The same error is returned also for all CLI queries.

@RiccardoM
Copy link
Contributor Author

Update

I have now finished re-syncing a new node from scratch using prune = "nothing" and this did not solve the problem.

@erikgrinaker
Copy link
Contributor

Ok, so it's not related to pruning. I'm not really on the SDK team, so I don't have a lot of insight into what might cause this, but if you could stop a node, create a tarball of the entire application state, and make it available somewhere I can have a look at the data stores.

@RiccardoM
Copy link
Contributor Author

RiccardoM commented Sep 25, 2020

@erikgrinaker Suppose the application lives inside ~/.desmosd. What subfolders do you need?

@erikgrinaker
Copy link
Contributor

At least with the standard SDK layout I'd need data/application.db.

@anilcse
Copy link
Collaborator

anilcse commented Sep 25, 2020

On a quick look, I believe we are facing store issues due to these changes: https://github.com/desmos-labs/desmos/blob/v0.12.0/x/posts/keeper/v0120migration.go#L10-L56
I haven't dig through it well, may be I could be wrong. Sharing it here for ref

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Sep 25, 2020

Oh, yeah, I think that migration is wrong. It fetches the post value into postKey, unmarshals it (so it's actually the value), and then it updates using postKey as the key:

postKey := iterator.Value()
[...]
err := k.Cdc.UnmarshalBinaryBare(postKey, &v0100Post)
[...]
store.Set(postKey, bz)

It should be using iterator.Key() for the key, and unmarshal iterator.Value(). Not sure how that would relate to this problem, but it doesn't seem right, and would cause it to write to lots of keys that it shouldn't.

@erikgrinaker
Copy link
Contributor

I should also mention that it's not safe to write to a key domain while iterating over it. This is documented for tm-db, but doesn't seem to be documented for IAVL or the SDK:

https://github.com/tendermint/tm-db/blob/17536ad9315e5c54dd327d4d49c77cad4fbfbb08/types.go#L48

@erikgrinaker
Copy link
Contributor

I should also mention that it's not safe to write to a key domain while iterating over it.

Turns out that IAVL doesn't actually use DB.Iterator(), since it traverses the tree instead (obviously), so that may not apply here. I'll need to have a closer look at how IAVL does iteration to say for sure.

@RiccardoM
Copy link
Contributor Author

RiccardoM commented Sep 25, 2020

@erikgrinaker Here is the tar bal for the application database: application.tar.gz.

EDIT: This link does not download the correct status. I'm removing it.

@anilcse @erikgrinaker You're both right. That migration is not correct and I should use the iterator.Key() when storing the data. I will fix this in a later version and ask everyone to re-do the update by syncing from scratch.

@RiccardoM
Copy link
Contributor Author

I should also mention that it's not safe to write to a key domain while iterating over it.

Turns out that IAVL doesn't actually use DB.Iterator(), since it traverses the tree instead (obviously), so that may not apply here. I'll need to have a closer look at how IAVL does iteration to say for sure.

Even if it wasn't the case, which is the appropriate way of doing this? I was thinking about

  1. Iterating once to get the values and migrate them
  2. Iterate over the new values and store them

Are there better alternatives?

@erikgrinaker
Copy link
Contributor

Modifying while iterating is generally a bit iffy - I'll have to have a closer look at IAVL to see how it behaves when this happens.

The simplest approach is to just collect all the keys in a slice, close the iterator, then iterate over the slice to read and update them - at least this way you don't need to keep all values in memory, just the keys. This is assuming that you have enough memory to hold all keys.

@erikgrinaker
Copy link
Contributor

EDIT: This link does not download the correct status. I'm removing it.

Ok, I'll wait for a new one then.

@RiccardoM
Copy link
Contributor Author

The simplest approach is to just collect all the keys in a slice, close the iterator, then iterate over the slice to read and update them - at least this way you don't need to keep all values in memory, just the keys. This is assuming that you have enough memory to hold all keys.

I think I got what you mean. Could you please check if this is better? https://github.com/desmos-labs/desmos/blob/version/v0.12.2/x/posts/keeper/v0120migration.go#L15

Ok, I'll wait for a new one then.

It might take a while. It's actually ~20GB of database

@erikgrinaker
Copy link
Contributor

Could you please check if this is better?

You need to store iterator.Key() in the slice, not iterator.Value(). You're also not checking for iteration errors, you'll need to check iterator.Error() after the for-loop.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Sep 25, 2020

Ok, I ran a script on the database which inspected the latest IAVL version of each substore in the database, it found this:

s/k:params 532167
s/k:upgrade 532167
s/k:evidence 532167
s/k:magpie 532167
s/k:reports 532167
s/k:posts 532167
s/k:relationships 18596
s/k:slashing 532167
s/k:staking 532167
s/k:gov 532167
s/k:profiles 532167
s/k:main 532167
s/k:supply 532167
s/k:acc 532167
s/k:distribution 532167

Notice the s/k:relationships store which is only at version 18596. When loading the stores, it will fail on this since it doesn't have version 532167.

Notice that 18596 is almost exactly the number of heights since the upgrade (at 513573). This is not a coincidence. This is probably a new store that was registered during the upgrade, but it starts from version 1 (as all IAVL stores do).

Either the SDK needs to handle this case during upgrades by using the initial height API currently included in the IAVL 0.15.x prereleases, or it should just use a single IAVL store for all modules (which would also fix the partial commit problem in #6370) - I'm heavily in favor of the latter.

@alexanderbez
Copy link
Contributor

Great debugging @erikgrinaker. The issue totally makes sense now. However, for the solution I'm not totally convinced this responsibility should be placed on the SDK.

Because either way, the fix is to call to app.cms.SetInitialVersion(H), where H=upgrade height (+1?). However, this needs to happen in app.go (application constructor). Being that it's application-specific, there's really no clear way to have this happen automatically. I think we just need to:

  1. Backport SetInitialVersion to 0.39
  2. Have clear documentation on upgrade handlers

@erikgrinaker
Copy link
Contributor

Sure, that's your call, I don't know the details here - I'll backport the IAVL initial height stuff for 0.14.

@erikgrinaker
Copy link
Contributor

Occurred to me that SetInitialVersion won't fully fix the problem on its own. If we set the initial version to e.g. 1000 for one specific store, then loading any version before 1000 will fail since one of the IAVL stores does not have that version. Also note that setting an initial version will cause the tree to error on load if it finds a version before the given initial version, so it can't be set to the same for all IAVL stores in this scenario.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Sep 25, 2020

I've backported InitialVersion to 0.14 on this branch: https://github.com/cosmos/iavl/tree/0.14.x

Let me know when you've tried it out, and I'll cut 0.14.1.

@alexanderbez
Copy link
Contributor

That's unfortunate. How do you recommend we proceed here?

@erikgrinaker
Copy link
Contributor

Off the top of my head, maybe pad out the new store with empty versions corresponding to the versions in the other stores (which will depend on historical pruning settings and such), such that all stores have the same versions. Other options would be to check which stores have which ranges via e.g. AvailableVersions or new methods, or return specific error types that the SDK can handle appropriately. Longer term, use a single IAVL store.

@RiccardoM
Copy link
Contributor Author

Hey @alexanderbez, @erikgrinaker are there any news on this? Is there something I can do to help tackle this?

@alexanderbez
Copy link
Contributor

No news, not clear on how to handle loading new stores on upgrades. I think padding or doing anything fancy in the app construction is likely to be buggy and hacky.

I'm liking the idea of where @erikgrinaker was heading with modifying Store#CacheMultiStoreWithVersion:

(Current logic)

func (rs *Store) CacheMultiStoreWithVersion(version int64) (types.CacheMultiStore, error) {
	cachedStores := make(map[types.StoreKey]types.CacheWrapper)
	for key, store := range rs.stores {
		switch store.GetStoreType() {
		case types.StoreTypeIAVL:
			// If the store is wrapped with an inter-block cache, we must first unwrap
			// it to get the underlying IAVL store.
			store = rs.GetCommitKVStore(key)

			// Attempt to lazy-load an already saved IAVL store version. If the
			// version does not exist or is pruned, an error should be returned.
			iavlStore, err := store.(*iavl.Store).GetImmutable(version)
			if err != nil {
				return nil, err // this fails because store N is NEW and obviously doesn't exist at 'version'
			}

			cachedStores[key] = iavlStore

		default:
			cachedStores[key] = store
		}
	}

	return cachemulti.NewStore(rs.db, cachedStores, rs.keysByName, rs.traceWriter, rs.traceContext), nil
}

So I think we need to look into modifying this method s.t. it doesn't error possibly when it can't load that version. But, we have to be sure that the query is handled safely still and

  1. doesn't panic
  2. is still able to handle queries for valid store/version combinations

Do you feel like this is something you can look into? To start, it could be as simple as just not returning an error.

@alexanderbez
Copy link
Contributor

But I'm not sure that works either, because the new store needs to be at the same version as the others...

@RiccardoM
Copy link
Contributor Author

RiccardoM commented Sep 30, 2020

@alexanderbez I've tried implementing a solution inside #7415. Please let me know what you think about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants