-
Notifications
You must be signed in to change notification settings - Fork 771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kusama parachains block production slowed down #3314
Comments
Below is an extract from our data for the Kusama relay chain: --- number of blocks that took more than 6001 ms to produce (limited to start of February 2024 until today)
SELECT
CAST(block_time as DATE) as day,
count(number) as slow_blocks
FROM `parity-data-infra-evaluation.dotlake_playground.kusama_feb_24_blocktimes`
WHERE time_to_produce_ms > 6002
GROUP BY day
ORDER BY day ASC;
For reference, here's the same thing but by day and since January 2024: Here are the top 15, ordered by the time it took for them to be produced: SELECT
number,
block_time,
time_to_produce_ms
FROM `parity-data-infra-evaluation.dotlake_playground.kusama_feb_24_blocktimes`
ORDER BY time_to_produce_ms DESC
LIMIT 15; (Subscan linked for convenience)
And the list of slow blocks (that took longer than 6002 ms, February 2024) for reference: https://storage.googleapis.com/dashboards.data.paritytech.io/debug/slow_blocks.csv Anything else we can provide on the parachain side too please let us know. |
InvestigationWe did some initial analysis looking at the dashboard it seems that around Looking the dashboards around the same time we also see some errors in pending connection metrics on our kusama validators, not sure if it is related to the degradation, but I'm putting it here in case it rings a bell. Potential root-cause(Behaviour observed after 2023-02-12 18.00 PM UTC)Looking at few parachains we observed that there are 1 minute periods where no blocks gets backed for the parachains and after further investigation we observed that it is during the same group of the session that the parachains don't produce blocks: For example for session Parachain 1000: https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Frpc.dotters.network%2Fkusama#/explorer/query/21870979 Parachains 10002: https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Frpc.dotters.network%2Fkusama#/explorer/query/21870956 Same things happens during session The potential root-cause of this is that we have a bunch of validators either in a bad state or they are running a version that doesn't not support the logic for async-backing, so they can't participate in the backing process, so if 2 out of 3 validators in the backing group are bad the parachain won't produce blocks for the entire duration(1m) it is allocated to the backing group. Looking at https://apps.turboflakes.io/?chain=kusama#/insights at session 36926, there seems to be around 17 out 300 validators have 0 backing votes, so those might be the possible culprits for why we have a bad backing group every session. The list of validators for session Note! This is just a snapshot for session Next steps
|
You can try to query their version RPC. |
I hope that you don't find anyone that has opened their rpc port to the outside 🙃 |
What we should do is write some tool that connects to all validators on the p2p layer. We can query all validator addresses using the DHT and the initial message contains the version information about the node. Using this we could write some tool to generate a list of validators that run on an old version. |
I've heard someone already built this tool: https://forum.polkadot.network/t/experiment-p2p-network-visualization/4583/8 |
Perfect! This is exactly what we need. |
Would it be possible to implement a better incentive for validators to upgrade ? One idea is to remove any validator from the active set if their version is lagging 2+ releases behind, but I am not sure how feasible is to implement. |
Sounds like you are proposing to do hardforks :P Maybe a little bit softer as ETH hard forks, but yeah :P Apparently people should already get less rewards and they don't care. |
Here what I was able to extract from DHT with nebula:
Here script which I used:
|
There shouldn't be any forks :P
I think they do care more if they get kicked out of active set. Also we won't waste valuable time investigating issues caused by ancient validators, has happened in a few instances. |
I guess it is parachain with an embedded relaychain node, they are usually not configured and have a random name, Also p2p port is not open and Nebula is not able to connect and check version. The nodes I listed all have p2p port open to the internet. I forgot to add report summary:
7239 node has p2p port closed and may have any version. FYI, there are multiple telemetry: https://telemetry.w3f.community, https://telemetry.polkadot.io/ |
Any idea which one tell us the right picture ? |
A combination of them and also none of them; the w3f telemetry is for the 1kv program, and the polkadot.io one is for validators who are connected to telemetry. Telemetry is on by default, but plenty of validators do not run with public telemetry enabled, so it's not possible to get a complete picture this way. |
As already discussed in Element, we should change the rewards. Decreasing the rewards per block build to a minimum and put most of the rewards in doing useful work for the parachain consensus. The tool for finding out what version a validator is running, could also be used as data source for informing nominators on what kind of validators to nominate. |
There's definitely some of this going on. For example one of my parachain collators shows up on telemetry as "Parity Polkadot v0.9.28-something". That one definitely isn't validating. (And, yes, an upgrade is long overdue, and I worry about the chain stalling every time Kusama upgrades...) |
Generally we are binary compatible back to Kusama launch. Also most of the networking protocols are still in place, so you don't need to worry that much ;) |
It's more a nagging concern at the back of my mind than a full-on worry. I do hope to find the time / funding to upgrade it though. :-) [Edit: sorry BulatSaif already said this.] Oh, and I think @alexggh might have been asking why there are far fewer nodes on telemetry than in the P2P-derived stats. Being on telemetry is optional so many nodes aren't on it (including 3 of mine that I thought were). There's also the 1KV program telemetry that probably has a list of nodes, only some of which are on the main telemetry. |
Unfortunately, the situation seems to have drastically deterioarated overnight around 2023-02-15 22:00 UTC, the dashboards look even worse now. Runtime api requests jumped 20% The only things I see in the logs is that around that time we've got some disputes, but they seem to have concluded, around that time, this error was outputed during the dispute, then stopped.
Looking here: https://apps.turboflakes.io/?chain=kusama#/insights?mode=history, it seems that around that time a bunch of validators started getting very low reward rates(0) and they continue getting them for follow up sessions, so it feels like something is getting them in a bad state. |
Do you know which release we included this fix to prevent disputes being raised for candidates in finalized blocks? |
Don't know which fix you are referring to. |
Yeah this one and yeah it got included in 1.1. The log lines of state being discarded for the dispute reminded me of this. We should probably still figure out why the state wasn't available anymore as this should not happen. |
This issue has been mentioned on Polkadot Forum. There might be relevant details there: https://forum.polkadot.network/t/polkadot-digest-15-feb-2024/6218/1 |
@BulatSaif What is nebula and where can I find it? |
It is a DHT crawler: https://github.com/dennis-tra/nebula |
…ng (#3419) Add more debug logs to understand if statement-distribution is in a bad state, should be useful for debugging #3314 on production networks. Additionally, increase the number of parallel requests should make, since I notice that requests take around 100ms on kusama, and the 5 parallel request was picked mostly random, no reason why we can do more than that. --------- Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io> Co-authored-by: ordian <write@reusable.software> (cherry picked from commit b9c792a) Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
…ng (#3476) cherry picked from master commit b9c792a --------- Add more debug logs to understand if statement-distribution is in a bad state, should be useful for debugging #3314 on production networks. Additionally, increase the number of parallel requests should make, since I notice that requests take around 100ms on kusama, and the 5 parallel request was picked mostly random, no reason why we can do more than that. Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io> Co-authored-by: ordian <write@reusable.software>
We should really drop outdated hosts before updates that required host updates. In principle, we could just record minimum required and recommended host versions on-chain, make validators warn when behind the recommended, and make them shut down when behind the required. This doesn't solve the immediate problem however, so.. We could add host version into their NPoS candidacy, so then other validators would drop them in NPoS. I doubt anything more complex would be warranted. It's perfectly reasonable byzantine behavior to run an old host and lie about version numbers though, so probably not this whole problem. |
This is fairly similar to what I had in mind and been discussing offline, but @bkchr didn't fully support it at the time. These ancient hosts clearly not getting rewards in current situation so they still have incentive to upgrade. The only downside in this specific case is that they can degrade the network performance, which sounds to me that we need some form of punishment to prevent it rather than rely only on positive rewards. |
Yeah, generally I'm not full on board yet with reporting random versions to the runtime. I mean the node can just report whatever the runtime expects it to report. We would also need to standardize these versions as other node implementors would need to report the same version when they have implemented the same features etc. I opened an issue for reworking the reward points distribution: polkadot-fellows/runtimes#214 If this isn't working, I'm open for what you both proposed here. |
However you do implement this (if you ever do) please don't forget about us collators, running potentially pretty old versions on the Kusama side... :-). How old a version does the CI process test can still sync? |
Yes, but then if they change the code like this then it's their fault if they get slashed somehow. We could easily design this so they must look at the code before reporting the false version, at which point they're doing more work than not upgrading. Imho, that's overkill since we're trying to help them really, and an unecessary development burden (each version has an associated random u64). Arguably, we should do this on polakdot, but not immediately on kusama, since we want some miss behavior on kusama.
We're discussing validators mostly, but yes we do need to take host upgrades into consideration for collators too. It'll be off-chain messaging (XCMP) that really beings the chaos here. |
This will enable async-backing subsystems on polkadot, the relaychain would work in legacy mode where only candiddates built on top of pervious relay chain are activated. - Had to bring an unrelated change as well `minimum_backing_votes` because that was the v6, we can't skip versions, the value of that configuration on polkadot is 2, so that's what is going to be used after this runtime update. Changes have been running on kusama, so I would. Async backing subsytems is a major change in the way collator-protocol and the backing work, so there are still some unknowns if this is completely bug free. It has been running on kusama for a month already, but not without issues: - Validators that did not upgrade to compatible versions will not be able to participate in backing, so if enough of those are randomly picked that group won't be able to back anything. With backing_group_size = 5 and minimum_backing_votes = 2, 10% validator not upgraded that chance is about 2.5%. - Additionally, same un-upgraded groups won't be able to include the backing votes on chain when they author blocks, so 10% of the blocks won't back any candidates. - We are still not sure if item 2) from here paritytech/polkadot-sdk#3314 (comment) is caused by async backing, the observable issue is sometimes after restart/upgrade some validators are getting 0 rewards and apparently they are not backing anything. Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Not sure if this will be helpful on this issue, but it might so I'm posting my logs. |
https://apps.turboflakes.io/?chain=kusama#/validator/JKhBBSWkr8BJKh5eFBtRux4hsDq4sAxvvmMU426qUA9aqEQ
|
I remember we had issues with Docker in the past. Are other affected validators also running on Docker? |
These are the startup logs:
|
I'm not running on Docker 🤷🏻♂️ |
One interesting metric would be: "polkadot_parachain_peer_count" for the validation peerset. To be queried like this:
How do these numbers behave with regards to era points? Once you become a para validator this should be around 1k. |
We found the cause and are working on a fix. What operators can do right away: Make sure the node can persist all data, otherwise it will generate a new PeerId on each restart and this is causing issues. |
This issue has been mentioned on Polkadot Forum. There might be relevant details there: https://forum.polkadot.network/t/async-backing-development-updates/6176/5 |
ConclusionsThere are 3 failure-modes going-on kusama, all create validators with 0 backing rewards:
Also thank you @xxbbxb for providing logs and helping me root cause item number 2. Next priorities in order:
|
This will enable async-backing subsystems on polkadot, the relaychain would work in legacy mode where only candiddates built on top of pervious relay chain are activated. ## Notes - Had to bring an unrelated change as well `minimum_backing_votes` because that was the v6, we can't skip versions, the value of that configuration on polkadot is 2, so that's what is going to be used after this runtime update. Changes have been running on kusama. - `disabled_validators`, `node_features`, `approval_voting_params` are not related with async-backing, but given they are low risk, we decided to include them as well, see comments. ## Known risks Async backing subsytems is a major change in the way collator-protocol and the backing work, so there are still some unknowns if this is completely bug free. It has been running on kusama for a month already, but not without issues: - Validators that did not upgrade to compatible versions will not be able to participate in backing, so if enough of those are randomly picked that group won't be able to back anything. With backing_group_size = 5 and minimum_backing_votes = 2, 10% validator not upgraded, that chance is about 2.5%. - Additionally, same un-upgraded groups won't be able to include the backing votes on chain when they author blocks, so 10% of the blocks won't back any candidates. - We are still not sure if item 2) from here paritytech/polkadot-sdk#3314 (comment) is caused by async backing, the observable issue is sometimes after restart/upgrade some validators are getting 0 rewards and apparently they are not backing anything. <!-- Remember that you can run `/merge` to enable auto-merge in the PR --> <!-- Remember to modify the changelog. If you don't need to modify it, you can check the following box. Instead, if you have already modified it, simply delete the following line. --> --------- Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io> Co-authored-by: fellowship-merge-bot[bot] <151052383+fellowship-merge-bot[bot]@users.noreply.github.com>
Ok, I think I figure it out why this happens for new nodes entering the active set, long story-short, the node will start advertising its AuthorthyId on the DHT only after it becomes active, so the other nodes won't be able to detect it at the beginning of the session. More details about this in: https://github.com/paritytech/polkadot-sdk/pull/3722/files |
As discovered during investigation of #3314 and #3673 there are active validators which accidentally might change their network key during restart, that's not a safe operation when you are in the active set because of distributed nature of DHT, so the old records would still exist in the network until they expire 36h, so unless they have a good reason validators should avoid changing their key when they restart their nodes. There is an effort in parallel to improve this situation #3786, but those changes are way more intrusive and will need more rigorous testing, additionally they will reduce the time to less than 36h, but the propagation won't be instant anyway, so not changing your network during restart should be the safest way to run your node, unless you have a really good reason to change it. ## Proposal 1. Do not auto-generate the network if the network file does not exist in the provided path. Nodes where the key file does not exist will get the following error: ``` Error: 0: Starting an authorithy without network key in /home/alexggh/.local/share/polkadot/chains/ksmcc3/network/secret_ed25519. This is not a safe operation because the old identity still lives in the dht for 36 hours. Because of it your node might suffer from not being properly connected to other nodes for validation purposes. If it is the first time running your node you could use one of the following methods. 1. Pass --unsafe-force-node-key-generation and make sure you remove it for subsequent node restarts 2. Separetly generate the key with: polkadot key generate-node-key --file <YOUR_PATH_TO_NODE_KEY> ``` 2. Add an explicit parameters for nodes that do want to change their network despite the warnings or if they run the node for the first time. `--unsafe-force-node-key-generation` 3. For `polkadot key generate-node-key` add two new mutually exclusive parameters `base_path` and `default_base_path` to help with the key generation in the same path the polkadot main command would expect it. 4. Modify the installation scripts to auto-generate a key in default path if one was not present already there, this should help with making the executable work out of the box after an instalation. ## Notes Nodes that do not have already the key persisted will fail to start after this change, however I do consider that better than the current situation where they start but they silently hide that they might not be properly connected to their peers. ## TODO - [x] Make sure only nodes that are authorities on producation chains will be affected by this restrictions. - [x] Proper PRDOC, to make sure node operators are aware this is coming. --------- Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io> Co-authored-by: Dmitry Markin <dmitry@markin.tech> Co-authored-by: s0me0ne-unkn0wn <48632512+s0me0ne-unkn0wn@users.noreply.github.com> Co-authored-by: Bastian Köcher <git@kchr.de>
Closing this is not happening anymore and the remaining open action item is being tracked in #3673 |
Is there an existing issue?
Experiencing problems? Have you tried our Stack Exchange first?
Description of bug
We're investigating, let me know how we can help. @eskimor would you mind adding the peer count screenshot here as well for reference please?
cc @lovelaced
Steps to reproduce
The text was updated successfully, but these errors were encountered: