Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kusama parachains block production slowed down #3314

Closed
2 tasks done
KarimJedda opened this issue Feb 13, 2024 · 64 comments
Closed
2 tasks done

Kusama parachains block production slowed down #3314

KarimJedda opened this issue Feb 13, 2024 · 64 comments
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.

Comments

@KarimJedda
Copy link

KarimJedda commented Feb 13, 2024

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

  • Block production has slowed down on Kusama parachains
  • The issue was raised by Khala & Basilisk, with reports of 130s blocks on Basilisk
  • Kusama seems to be impacted as well

We're investigating, let me know how we can help. @eskimor would you mind adding the peer count screenshot here as well for reference please?

cc @lovelaced

Steps to reproduce

@KarimJedda KarimJedda added I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known. labels Feb 13, 2024
@KarimJedda
Copy link
Author

Below is an extract from our data for the Kusama relay chain:

--- number of blocks that took more than 6001 ms to produce (limited to start of February 2024 until today)
SELECT 
  CAST(block_time as DATE) as day, 
  count(number) as slow_blocks
FROM `parity-data-infra-evaluation.dotlake_playground.kusama_feb_24_blocktimes` 
WHERE time_to_produce_ms > 6002 
GROUP BY day
ORDER BY day ASC;
day slow_blocks
2024-02-01 148
2024-02-02 142
2024-02-03 106
2024-02-04 136
2024-02-05 137
2024-02-06 118
2024-02-07 162
2024-02-08 139
2024-02-09 109
2024-02-10 112
2024-02-11 106
2024-02-12 106
2024-02-13 (17:30 CET) 75

For reference, here's the same thing but by day and since January 2024:

Screenshot from 2024-02-13 21-02-38

Here are the top 15, ordered by the time it took for them to be produced:

SELECT 
  number,
  block_time,
  time_to_produce_ms
FROM `parity-data-infra-evaluation.dotlake_playground.kusama_feb_24_blocktimes` 
ORDER BY time_to_produce_ms DESC
LIMIT 15;

(Subscan linked for convenience)

number block_time time_to_produce_ms
21806210 2024-02-09 19:31:54.000000 UTC 42000
21824720 2024-02-11 02:30:06.000000 UTC 41999
21758223 2024-02-06 11:12:48.001000 UTC 18001
21768615 2024-02-07 04:36:00.001000 UTC 18001
21818664 2024-02-10 16:20:54.000000 UTC 18000
21754427 2024-02-06 04:52:30.000000 UTC 18000
21727213 2024-02-04 07:21:00.000000 UTC 18000
21851077 2024-02-12 22:32:18.000000 UTC 17999
21766166 2024-02-07 00:29:30.000000 UTC 17999
21686302 2024-02-01 10:58:12.000000 UTC 17999
21778337 2024-02-07 20:53:48.000000 UTC 17992
21745818 2024-02-05 14:28:43.863000 UTC 13862
21711425 2024-02-03 04:58:12.481000 UTC 12480
21811860 2024-02-10 04:58:06.168000 UTC 12167
21700641 2024-02-02 10:58:06.137000 UTC 12137

And the list of slow blocks (that took longer than 6002 ms, February 2024) for reference: https://storage.googleapis.com/dashboards.data.paritytech.io/debug/slow_blocks.csv

Anything else we can provide on the parachain side too please let us know.

@alexggh
Copy link
Contributor

alexggh commented Feb 14, 2024

Investigation

We did some initial analysis looking at the dashboard it seems that around 2023-02-12 18.00 PM UTC the average block times seems to have jumped slightly see the graph bellow.

Screenshot 2024-02-14 at 13 18 04

Looking the dashboards around the same time we also see some errors in pending connection metrics on our kusama validators, not sure if it is related to the degradation, but I'm putting it here in case it rings a bell.

Screenshot 2024-02-14 at 09 40 51

Potential root-cause(Behaviour observed after 2023-02-12 18.00 PM UTC)

Looking at few parachains we observed that there are 1 minute periods where no blocks gets backed for the parachains and after further investigation we observed that it is during the same group of the session that the parachains don't produce blocks:

For example for session 36926 parachains AssetHub and Bridge don't produce blocks while they should be allocated to group 20, this can be double-checked by looking at CandidateIncluded events and see that the parachains produce blocks while on group 19, then both have a gap while they should be on 20 and after that they start producing blocks when they assignment moves on group 21.

Parachain 1000: https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Frpc.dotters.network%2Fkusama#/explorer/query/21870979

Parachains 10002: https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Frpc.dotters.network%2Fkusama#/explorer/query/21870956

Same things happens during session 36927 and this time group 44 is the bad one.

The potential root-cause of this is that we have a bunch of validators either in a bad state or they are running a version that doesn't not support the logic for async-backing, so they can't participate in the backing process, so if 2 out of 3 validators in the backing group are bad the parachain won't produce blocks for the entire duration(1m) it is allocated to the backing group.

Looking at https://apps.turboflakes.io/?chain=kusama#/insights at session 36926, there seems to be around 17 out 300 validators have 0 backing votes, so those might be the possible culprits for why we have a bad backing group every session.

The list of validators for session 36927 is here:

Screenshot 2024-02-14 at 13 37 15 Screenshot 2024-02-14 at 13 37 27

Note! This is just a snapshot for session 36927, so it is not an exhaustive list.

Next steps

  • Ping validators on Kusama lounge to see what versions the ones with 0 backing votes are running and what logs they see.

@ggwpez
Copy link
Member

ggwpez commented Feb 14, 2024

Ping validators on Kusama lounge to see what versions the ones with 0 backing votes are running.

You can try to query their version RPC.

@bkchr
Copy link
Member

bkchr commented Feb 14, 2024

Ping validators on Kusama lounge to see what versions the ones with 0 backing votes are running.

You can try to query their version RPC.

I hope that you don't find anyone that has opened their rpc port to the outside 🙃

@bkchr
Copy link
Member

bkchr commented Feb 14, 2024

What we should do is write some tool that connects to all validators on the p2p layer. We can query all validator addresses using the DHT and the initial message contains the version information about the node. Using this we could write some tool to generate a list of validators that run on an old version.

@ordian
Copy link
Member

ordian commented Feb 14, 2024

What we should do is write some tool that connects to all validators on the p2p layer. We can query all validator addresses using the DHT and the initial message contains the version information about the node. Using this we could write some tool to generate a list of validators that run on an old version.

I've heard someone already built this tool: https://forum.polkadot.network/t/experiment-p2p-network-visualization/4583/8

@bkchr
Copy link
Member

bkchr commented Feb 14, 2024

Perfect! This is exactly what we need.

@sandreim
Copy link
Contributor

Would it be possible to implement a better incentive for validators to upgrade ? One idea is to remove any validator from the active set if their version is lagging 2+ releases behind, but I am not sure how feasible is to implement.

@bkchr
Copy link
Member

bkchr commented Feb 14, 2024

Sounds like you are proposing to do hardforks :P Maybe a little bit softer as ETH hard forks, but yeah :P

Apparently people should already get less rewards and they don't care.

@BulatSaif
Copy link
Contributor

Here what I was able to extract from DHT with nebula:

    802 "Parity Polkadot/v1.7.0-97df9dd6554 
    405 "Parity Polkadot/v1.6.0-680adc78ab3 
    106 "Parity Polkadot/v1.7.0-2fe3145ab9b 
     53 "Parity Polkadot/v1.5.0-a3dc2f15f23 
     53 "Parity Polkadot/v1.4.0-a3e4fce9d3f 
     37 "Parity Polkadot/v1.6.0-481165d9229 
     36 "Parity Polkadot/v0.9.41-e203bfb396e 
     34 "Parity Polkadot/v1.3.0-7c9fd83805c 
     19 "Parity Polkadot/v1.0.0-1ed6e2e50a4 
     13 "Parity Polkadot/v1.6.0-85191e94b5e 
      9 "Parity Polkadot/v1.4.0-00395a23d12 
      7 "Parity Polkadot/v1.1.0-f60318f6868 
      7 "Parity Polkadot/v0.9.42-9b1fc27cec4 
      3 "Parity Polkadot/v1.0.0-c9ec8c5a159 
      3 "Parity Polkadot/v0.9.43-ba42b9ce51d 
      2 "Parity Polkadot/v1.7.0-unknown 
      2 "Parity Polkadot/v1.6.0-unknown 
      2 "Parity Polkadot/v1.6.0-ff2e7dbd355 
      2 "Parity Polkadot/v1.6.0-c415f8f0bd6 
      2 "Parity Polkadot/v0.9.40-a2b62fb872b 
      2 "Parity Polkadot/v0.9.32-c71e872afed 
      1 "Parity Polkadot/v1.6.0-b8f55d1b769 
      1 "Parity Polkadot/v1.6.0-481165d 
      1 "Parity Polkadot/v1.5.0-e5b2adac7ad 
      1 "Parity Polkadot/v1.5.0-64d52f225e1 
      1 "Parity Polkadot/v1.2.0-930cabbe271 
      1 "Parity Polkadot/v1.2.0-72c45356393 
      1 "Parity Polkadot/v0.9.43-52209dcfe54 
      1 "Parity Polkadot/v0.9.38-909567fe6da 
      1 "Parity Polkadot/v0.9.31-32dd0c9cfcd 
      1 "Parity Polkadot/v0.9.30-064536093f5 
      1 "Parity Polkadot/v0.9.25-5174e9ae75b 

Here script which I used:

nebula crawl --network KUSAMA  --json-out /tmp/out > kusama.txt
cat kusama.txt | grep Agent | cut -d = -f3- | grep "Parity Polkadot"| cut -d'(' -f1| sort |  uniq -c |  sort -bgr

@alexggh
Copy link
Contributor

alexggh commented Feb 14, 2024

Here what I was able to extract from DHT with nebula:

Thank you, the number seems a lot more different than what I see on telemetry any idea why ?
Screenshot 2024-02-14 at 15 28 26

I see a lot more 1.3.0 and 0.9.43 and 0.9.42, any idea where the difference is coming from ?

@sandreim
Copy link
Contributor

Sounds like you are proposing to do hardforks :P Maybe a little bit softer as ETH hard forks, but yeah :P

There shouldn't be any forks :P

Apparently people should already get less rewards and they don't care.

I think they do care more if they get kicked out of active set. Also we won't waste valuable time investigating issues caused by ancient validators, has happened in a few instances.

@BulatSaif
Copy link
Contributor

I see a lot more 1.3.0 and 0.9.43 and 0.9.42, any idea where the difference is coming from ?

I guess it is parachain with an embedded relaychain node, they are usually not configured and have a random name, Also p2p port is not open and Nebula is not able to connect and check version. The nodes I listed all have p2p port open to the internet.

I forgot to add report summary:

crawlDuration=6m57.330136797s crawledPeers=9500 dialablePeers=2261 undialablePeers=7239

7239 node has p2p port closed and may have any version.

FYI, there are multiple telemetry: https://telemetry.w3f.community, https://telemetry.polkadot.io/

@alexggh
Copy link
Contributor

alexggh commented Feb 14, 2024

Parachains block production starting to look like their pre 2023-02-12 18.00 PM UTC values:

The parachain block production average
Screenshot 2024-02-14 at 16 10 23

AssetHub
Screenshot 2024-02-14 at 16 06 44

@alexggh
Copy link
Contributor

alexggh commented Feb 14, 2024

FYI, there are multiple telemetry: https://telemetry.w3f.community/, https://telemetry.polkadot.io/

Any idea which one tell us the right picture ?

@lovelaced
Copy link

FYI, there are multiple telemetry: https://telemetry.w3f.community/, https://telemetry.polkadot.io/

Any idea which one tell us the right picture ?

A combination of them and also none of them; the w3f telemetry is for the 1kv program, and the polkadot.io one is for validators who are connected to telemetry. Telemetry is on by default, but plenty of validators do not run with public telemetry enabled, so it's not possible to get a complete picture this way.

@bkchr
Copy link
Member

bkchr commented Feb 14, 2024

I think they do care more if they get kicked out of active set. Also we won't waste valuable time investigating issues caused by ancient validators, has happened in a few instances.

As already discussed in Element, we should change the rewards. Decreasing the rewards per block build to a minimum and put most of the rewards in doing useful work for the parachain consensus.

The tool for finding out what version a validator is running, could also be used as data source for informing nominators on what kind of validators to nominate.

@JelliedOwl
Copy link

JelliedOwl commented Feb 14, 2024

I see a lot more 1.3.0 and 0.9.43 and 0.9.42, any idea where the difference is coming from ?

I guess it is parachain with an embedded relaychain node, they are usually not configured and have a random name, Also p2p port is not open and Nebula is not able to connect and check version. The nodes I listed all have p2p port open to the internet.

There's definitely some of this going on. For example one of my parachain collators shows up on telemetry as "Parity Polkadot v0.9.28-something". That one definitely isn't validating.

(And, yes, an upgrade is long overdue, and I worry about the chain stalling every time Kusama upgrades...)

@bkchr
Copy link
Member

bkchr commented Feb 14, 2024

(And, yes, an upgrade is long overdue, and I worry about the chain stalling every time Kusama upgrades...)

Generally we are binary compatible back to Kusama launch. Also most of the networking protocols are still in place, so you don't need to worry that much ;)

@JelliedOwl
Copy link

JelliedOwl commented Feb 14, 2024

Generally we are binary compatible back to Kusama launch. Also most of the networking protocols are still in place, so you don't need to worry that much ;)

It's more a nagging concern at the back of my mind than a full-on worry. I do hope to find the time / funding to upgrade it though. :-)

[Edit: sorry BulatSaif already said this.] Oh, and I think @alexggh might have been asking why there are far fewer nodes on telemetry than in the P2P-derived stats. Being on telemetry is optional so many nodes aren't on it (including 3 of mine that I thought were). There's also the 1KV program telemetry that probably has a list of nodes, only some of which are on the main telemetry.
https://telemetry.w3f.community/#list/0xb0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe

@alexggh
Copy link
Contributor

alexggh commented Feb 15, 2024

Unfortunately, the situation seems to have drastically deterioarated overnight around 2023-02-15 22:00 UTC, the dashboards look even worse now.

Screenshot 2024-02-15 at 07 27 45

Runtime api requests jumped 20%
Screenshot 2024-02-15 at 07 28 13

Screenshot 2024-02-15 at 07 28 39

The only things I see in the logs is that around that time we've got some disputes, but they seem to have concluded, around that time, this error was outputed during the dispute, then stopped.

2024-02-14  cannot query the runtime API version: Api called for an unknown Block: State already discarded for 0xda8c113bdaf76c305020386f430bc49ff6edcd28bd38c84ff1f04d7b7e67c1cb

Looking here: https://apps.turboflakes.io/?chain=kusama#/insights?mode=history, it seems that around that time a bunch of validators started getting very low reward rates(0) and they continue getting them for follow up sessions, so it feels like something is getting them in a bad state.

@bkchr
Copy link
Member

bkchr commented Feb 15, 2024

but they seem to have concluded, around that time, this error was outputed during the dispute, then stopped.

Do you know which release we included this fix to prevent disputes being raised for candidates in finalized blocks?

@alexggh
Copy link
Contributor

alexggh commented Feb 15, 2024

but they seem to have concluded, around that time, this error was outputed during the dispute, then stopped.

Do you know which release we included this fix to prevent disputes being raised for candidates in finalized blocks?

Don't know which fix you are referring to.

@tdimitrov
Copy link
Contributor

@bkchr #1358 -> this one?
If I'm not mistaken it's included since 1.0.0

@bkchr
Copy link
Member

bkchr commented Feb 15, 2024

Yeah this one and yeah it got included in 1.1.

The log lines of state being discarded for the dispute reminded me of this. We should probably still figure out why the state wasn't available anymore as this should not happen.

@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/polkadot-digest-15-feb-2024/6218/1

@eskimor
Copy link
Member

eskimor commented Feb 15, 2024

@BulatSaif What is nebula and where can I find it?

@BulatSaif
Copy link
Contributor

@BulatSaif What is nebula and where can I find it?

It is a DHT crawler: https://github.com/dennis-tra/nebula

alexggh added a commit that referenced this issue Feb 26, 2024
…ng (#3419)

Add more debug logs to understand if statement-distribution is in a bad
state, should be useful for debugging
#3314 on production
networks.

Additionally, increase the number of parallel requests should make,
since I notice that requests take around 100ms on kusama, and the 5
parallel request was picked mostly random, no reason why we can do more
than that.

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Co-authored-by: ordian <write@reusable.software>
(cherry picked from commit b9c792a)
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
EgorPopelyaev pushed a commit that referenced this issue Feb 26, 2024
…ng (#3476)

cherry picked from master commit
b9c792a

---------

Add more debug logs to understand if statement-distribution is in a bad
state, should be useful for debugging
#3314 on production
networks.

Additionally, increase the number of parallel requests should make,
since I notice that requests take around 100ms on kusama, and the 5
parallel request was picked mostly random, no reason why we can do more
than that.

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Co-authored-by: ordian <write@reusable.software>
@burdges
Copy link

burdges commented Mar 4, 2024

We should really drop outdated hosts before updates that required host updates. In principle, we could just record minimum required and recommended host versions on-chain, make validators warn when behind the recommended, and make them shut down when behind the required. This doesn't solve the immediate problem however, so..

We could add host version into their NPoS candidacy, so then other validators would drop them in NPoS. I doubt anything more complex would be warranted.

It's perfectly reasonable byzantine behavior to run an old host and lie about version numbers though, so probably not this whole problem.

@sandreim
Copy link
Contributor

sandreim commented Mar 4, 2024

This is fairly similar to what I had in mind and been discussing offline, but @bkchr didn't fully support it at the time. These ancient hosts clearly not getting rewards in current situation so they still have incentive to upgrade. The only downside in this specific case is that they can degrade the network performance, which sounds to me that we need some form of punishment to prevent it rather than rely only on positive rewards.

@bkchr
Copy link
Member

bkchr commented Mar 4, 2024

Yeah, generally I'm not full on board yet with reporting random versions to the runtime. I mean the node can just report whatever the runtime expects it to report. We would also need to standardize these versions as other node implementors would need to report the same version when they have implemented the same features etc.

I opened an issue for reworking the reward points distribution: polkadot-fellows/runtimes#214

If this isn't working, I'm open for what you both proposed here.

@JelliedOwl
Copy link

However you do implement this (if you ever do) please don't forget about us collators, running potentially pretty old versions on the Kusama side... :-). How old a version does the CI process test can still sync?

@burdges
Copy link

burdges commented Mar 4, 2024

the node can just report whatever the runtime expects it to report

Yes, but then if they change the code like this then it's their fault if they get slashed somehow.

We could easily design this so they must look at the code before reporting the false version, at which point they're doing more work than not upgrading. Imho, that's overkill since we're trying to help them really, and an unecessary development burden (each version has an associated random u64).

Arguably, we should do this on polakdot, but not immediately on kusama, since we want some miss behavior on kusama.

please don't forget about us collators

We're discussing validators mostly, but yes we do need to take host upgrades into consideration for collators too. It'll be off-chain messaging (XCMP) that really beings the chaos here.

alexggh added a commit to alexggh/runtimes that referenced this issue Mar 7, 2024
This will enable async-backing subsystems on polkadot, the relaychain
would work in legacy mode where only candiddates built on top of
pervious relay chain are activated.

- Had to bring an unrelated change as well `minimum_backing_votes`
  because that was the v6, we can't skip versions, the value of that
  configuration on polkadot is 2, so that's what is going to be used
  after this runtime update. Changes have been running on kusama, so I
  would.

Async backing subsytems is a major change in the way collator-protocol
and  the backing work, so there are still some unknowns if this is
completely bug free. It has been running on kusama for a month
already, but not without issues:
 - Validators that did not upgrade to compatible versions will not be
   able to participate in backing, so if enough of those are randomly
   picked that group won't be able to back anything. With
   backing_group_size = 5 and minimum_backing_votes = 2, 10% validator
   not upgraded that chance is about 2.5%.

 - Additionally, same un-upgraded groups won't be able to include the
   backing votes on chain when they author blocks, so 10% of the blocks
   won't back any candidates.

 - We are still not sure if item 2) from here paritytech/polkadot-sdk#3314 (comment)
   is caused by async backing, the observable issue is sometimes after restart/upgrade
   some validators are getting 0 rewards and apparently they are not
   backing anything.

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@tunkas
Copy link

tunkas commented Mar 11, 2024

Not sure if this will be helpful on this issue, but it might so I'm posting my logs.
It happened on my Kusama validator, in the first session I paravalidated after getting into the active set.
I was missing all the votes in that session so 45 minutes into the active set restart was performed. Not sure if this had any influence on resolving the matter since I still missed all the votes within that session.
Next session went without missing significat amount of votes.
Here are the logs for that session, and subsequent one.
I'm running on the latest version, of course.
Kusama error report.txt

@dcolley
Copy link
Contributor

dcolley commented Mar 12, 2024

https://apps.turboflakes.io/?chain=kusama#/validator/JKhBBSWkr8BJKh5eFBtRux4hsDq4sAxvvmMU426qUA9aqEQ

  1. What version you are running.
    1.8.0

  2. Logs around that time, the more logs you can provide the better.


│2024-03-11 18:56:54 💤 Idle (12 peers), best: #22250622 (0x70e8…48f8), finalized #22250619 (0x0168…d6c3), ⬇ 2.9MiB/s ⬆ 2.6MiB/s                  │
│2024-03-11 18:56:59 💤 Idle (12 peers), best: #22250622 (0x70e8…48f8), finalized #22250620 (0x1248…9c9b), ⬇ 3.7MiB/s ⬆ 3.9MiB/s                  │
│2024-03-11 18:57:00 ✨ Imported #22250623 (0x44c9…5815)                                                                                          │
│2024-03-11 18:57:04 💤 Idle (12 peers), best: #22250623 (0x44c9…5815), finalized #22250620 (0x1248…9c9b), ⬇ 2.8MiB/s ⬆ 3.5MiB/s                  │
│2024-03-11 18:57:06 ✨ Imported #22250624 (0xf82d…5620)                                                                                          │
│2024-03-11 18:57:06 ✨ Imported #22250624 (0x6339…6a9e)                                                                                          │
│2024-03-11 18:57:09 💤 Idle (12 peers), best: #22250624 (0xf82d…5620), finalized #22250621 (0x1429…942a), ⬇ 2.2MiB/s ⬆ 3.0MiB/s                  │
│2024-03-11 18:57:12 ✨ Imported #22250625 (0xd906…805a)                                                                                          │
│2024-03-11 18:57:14 💤 Idle (12 peers), best: #22250625 (0xd906…805a), finalized #22250622 (0x70e8…48f8), ⬇ 1.6MiB/s ⬆ 2.1MiB/s                  │
│2024-03-11 18:57:18 ✨ Imported #22250626 (0x3d86…c5d6)                                                                                          │
│2024-03-11 18:57:18 ♻️  Reorg on #22250626,0x3d86…c5d6 to #22250626,0xa7af…d809, common ancestor #22250625,0xd906…805a                            │
│2024-03-11 18:57:18 ✨ Imported #22250626 (0xa7af…d809)                                                                                          │
│2024-03-11 18:57:19 💤 Idle (10 peers), best: #22250626 (0xa7af…d809), finalized #22250623 (0x44c9…5815), ⬇ 2.3MiB/s ⬆ 2.9MiB/s                  │
│2024-03-11 18:57:24 ✨ Imported #22250627 (0xffd6…0d62)                                                                                          │
│2024-03-11 18:57:24 ✨ Imported #22250627 (0x6aaf…cbca)                                                                                          │
│2024-03-11 18:57:24 ✨ Imported #22250627 (0x129f…5e18)                                                                                          │
│2024-03-11 18:57:24 💤 Idle (10 peers), best: #22250627 (0xffd6…0d62), finalized #22250624 (0xf82d…5620), ⬇ 2.1MiB/s ⬆ 2.8MiB/s                  │
│2024-03-11 18:57:29 💤 Idle (11 peers), best: #22250627 (0xffd6…0d62), finalized #22250625 (0xd906…805a), ⬇ 1.6MiB/s ⬆ 2.1MiB/s                  │
│2024-03-11 18:57:30 ✨ Imported #22250628 (0x41d1…dd49)                                                                                          │
│2024-03-11 18:57:30 ♻️  Reorg on #22250628,0x41d1…dd49 to #22250628,0xeb99…6bb5, common ancestor #22250627,0xffd6…0d62                            │
│2024-03-11 18:57:30 ✨ Imported #22250628 (0xeb99…6bb5)                                                                                          │
│2024-03-11 18:57:30 ✨ Imported #22250628 (0x1091…17dc)                                                                                          │
│2024-03-11 18:57:30 ✨ Imported #22250628 (0xd5e2…0382)                                                                                          │
│2024-03-11 18:57:34 💤 Idle (12 peers), best: #22250628 (0xeb99…6bb5), finalized #22250625 (0xd906…805a), ⬇ 2.1MiB/s ⬆ 3.5MiB/s     
...
2024-03-11 18:57:48 ✨ Imported #22250631 (0xf38f…2bce)                                                                                          │
│2024-03-11 18:57:48 Cluster has too many pending statements, something wrong with our connection to our group peers                              │
│                                                                                                                                                 │
│Restart might be needed if validator gets 0 backing rewards for more than 3-4 consecutive sessions pending_statements={ValidatorIndex(91): {(Vali│
datorIndex(91), CompactStatement::Seconded(0x5e3efe877a0496964281fe2fb92125aed001360d96d6521e0e0c9abc66ecdb6c))}, ValidatorIndex(92): {(ValidatorI│
ndex(91), CompactStatement::Seconded(0x5e3efe877a0496964281fe2fb92125aed001360d96d6521e0e0c9abc66ecdb6c))}, ValidatorIndex(90): {(ValidatorIndex(9│
1), CompactStatement::Seconded(0x5e3efe877a0496964281fe2fb92125aed001360d96d6521e0e0c9abc66ecdb6c))}} parent_hash=0xeb997bbb60ca75e90935d59c1b142f│
a57aac5402cd28b577fa2211cd7a026bb5                                                                                                                │
│2024-03-11 18:57:48 Cluster has too many pending statements, something wrong with our connection to our group peers                              │
│                                                                                                                                                 │
│Restart might be needed if validator gets 0 backing rewards for more than 3-4 consecutive sessions pending_statements={ValidatorIndex(90): {(Vali│
datorIndex(91), CompactStatement::Seconded(0x54ec5323d263a93d6fc973d7b5d7a426649ce7d3e87d33e447e09089f5e9b2bc))}, ValidatorIndex(91): {(ValidatorI│
ndex(91), CompactStatement::Seconded(0x54ec5323d263a93d6fc973d7b5d7a426649ce7d3e87d33e447e09089f5e9b2bc))}, ValidatorIndex(92): {(ValidatorIndex(9│
1), CompactStatement::Seconded(0x54ec5323d263a93d6fc973d7b5d7a426649ce7d3e87d33e447e09089f5e9b2bc))}} parent_hash=0x1091044e68e1e359c1518de03a776d│
0f14b81f7aed347fbc0fca7a49dfed17dc                                                                                                                │
│2024-03-11 18:57:48 Cluster has too many pending statements, something wrong with our connection to our group peers                              │
│                                                                                                                                                 │
│Restart might be needed if validator gets 0 backing rewards for more than 3-4 consecutive sessions pending_statements={ValidatorIndex(91): {(Vali│
datorIndex(91), CompactStatement::Seconded(0x43bd5e82bdc8ce77c77d3e68f983719bb94dd4e2e6b98af5d6e42523d9a1de5a))}, ValidatorIndex(92): {(ValidatorI│
ndex(91), CompactStatement::Seconded(0x43bd5e82bdc8ce77c77d3e68f983719bb94dd4e2e6b98af5d6e42523d9a1de5a))}, ValidatorIndex(90): {(ValidatorIndex(9│
1), CompactStatement::Seconded(0x43bd5e82bdc8ce77c77d3e68f983719bb94dd4e2e6b98af5d6e42523d9a1de5a))}} parent_hash=0xd5e2b78489e5eda3fabe2845417bce│
90215b94a7685d1419af1a75b6eacb0382                                                                                                                │
│2024-03-11 18:57:48 Cluster has too many pending statements, something wrong with our connection to our group peers                              │
│                                                                                                                                                 │
│Restart might be needed if validator gets 0 backing rewards for more than 3-4 consecutive sessions pending_statements={ValidatorIndex(90): {(Vali│
datorIndex(91), CompactStatement::Seconded(0x927a5f537c174e8d4790b7e2708a242bebba6572db71424dbb4ace9511b8d1c3))}, ValidatorIndex(91): {(ValidatorI│
ndex(91), CompactStatement::Seconded(0x927a5f537c174e8d4790b7e2708a242bebba6572db71424dbb4ace9511b8d1c3))}, ValidatorIndex(92): {(ValidatorIndex(9│
1), CompactStatement::Seconded(0x927a5f537c174e8d4790b7e2708a242bebba6572db71424dbb4ace9511b8d1c3))}} parent_hash=0x41d1cf34c40cdda18adf75b22ebf37│
bad22b623fad674d0d050d3531865ddd49                                                                                                                │
│2024-03-11 18:57:49 💤 Idle (12 peers), best: #22250631 (0xf38f…2bce), finalized #22250628 (0x1091…17dc), ⬇ 3.4MiB/s ⬆ 3.4MiB/s                  │
│2024-03-11 18:57:54 ✨ Imported #222506
  1. Any event around that time, e.g we upgraded/restart the validator, we observed validator entered the active set, anything that you think might be relevant.
    No intervention

  2. Anything relevant/special about your network setup
    Running in docker

  3. OS, does the validator passes the hardware requirement checks.
    yes

  4. Any action take to get the validator out of this state or did it recover by itself.
    recovered by itself, next session was A+

@eskimor
Copy link
Member

eskimor commented Mar 12, 2024

Running in docker

I remember we had issues with Docker in the past. Are other affected validators also running on Docker?

@dcolley
Copy link
Contributor

dcolley commented Mar 12, 2024

These are the startup logs:

│2024-03-04 20:09:38 This chain is not in any way                                                                                                 ┤
│2024-03-04 20:09:38       endorsed by the                                                                                                        ┤
│2024-03-04 20:09:38      KUSAMA FOUNDATION                                                                                                       ┤
│2024-03-04 20:09:38 ----------------------------                                                                                                 ┤
│2024-03-04 20:09:38 Parity Polkadot                                                                                                              ┤
│2024-03-04 20:09:38 ✌️  version 1.8.0-ec7817e5adc                                                                                                 ┤
│2024-03-04 20:09:38 ❤️  by Parity Technologies <admin@parity.io>, 2017-2024                                                                       ┤
│2024-03-04 20:09:38 📋 Chain specification: Kusama                                                                                               ┤
│2024-03-04 20:09:38 🏷  Node name: METASPAN3 (ALSO TRY POOL #50)                                                                                  ┤
│2024-03-04 20:09:38 👤 Role: AUTHORITY                                                                                                           ┤
│2024-03-04 20:09:38 💾 Database: ParityDb at /data/chains/ksmcc3/paritydb/full                                                                   ┤
│2024-03-04 20:09:40 🚀 Using prepare-worker binary at: "/usr/lib/polkadot/polkadot-prepare-worker"                                               ┤
│2024-03-04 20:09:40 🚀 Using execute-worker binary at: "/usr/lib/polkadot/polkadot-execute-worker"                                               ┤
│2024-03-04 20:09:40 Can't use warp sync mode with a partially synced database. Reverting to full sync mode.                                      ┤
│2024-03-04 20:09:40 🏷  Local node identity is: 12D3KooWPVSft536jRzKSGdiYLx4FQ2XekXrjkNjaSfxpxDCM1LW                                              ┤
│2024-03-04 20:09:40 Warp sync failed. Continuing with full sync.                                                                                 ┤
│2024-03-04 20:09:40 💻 Operating system: linux                                                                                                   ┤
│2024-03-04 20:09:40 💻 CPU architecture: x86_64                                                                                                  ┤
│2024-03-04 20:09:40 💻 Target environment: gnu                                                                                                   ┤
│2024-03-04 20:09:40 💻 CPU: 12th Gen Intel(R) Core(TM) i9-12900K                                                                                 ┤
│2024-03-04 20:09:40 💻 CPU cores: 16                                                                                                             ┤
│2024-03-04 20:09:40 💻 Memory: 128584MB                                                                                                          ┤
│2024-03-04 20:09:40 💻 Kernel: 5.15.0-89-generic                                                                                                 ┤
│2024-03-04 20:09:40 💻 Linux distribution: Ubuntu 22.04.3 LTS                                                                                    ┤
│2024-03-04 20:09:40 💻 Virtual machine: no                                                                                                       ┤
│2024-03-04 20:09:40 📦 Highest known block at #22150878                                                                                          ┤
│2024-03-04 20:09:40 〽️ Prometheus exporter started at 0.0.0.0:9615                                                                               ┤
│2024-03-04 20:09:40 Running JSON-RPC server: addr=127.0.0.1:9944, allowed origins=["http://localhost:*", "http://127.0.0.1:*", "https://localhost┤
│2024-03-04 20:09:40 👶 Starting BABE Authorship worker                                                                                           ┤
│2024-03-04 20:09:40 🥩 BEEFY gadget waiting for BEEFY pallet to become available...                                                              ┤
│2024-03-04 20:09:40 🚨 Some security issues have been detected.                                                                                  ┤
│Running validation of malicious PVF code has a higher risk of compromising this machine.                                                         ┤
│  - Optional: Cannot unshare user namespace and change root, which are Linux-specific kernel security features: not available: unshare user and m┤
│  - Optional: Cannot call clone with all sandboxing flags, a Linux-specific kernel security features: not available: could not clone, errno: EPER┤
│2024-03-04 20:09:40 👮‍♀️ Running in Secure Validator Mode. It is highly recommended that you operate according to our security guidelines.        ┤
│More information: https://wiki.polkadot.network/docs/maintain-guides-secure-validator#secure-validator-mode                                      ┤
│2024-03-04 20:09:40 Sending fatal alert BadCertificate                                                                                           ┤
│2024-03-04 20:09:40 🔍 Discovered new external address for our node: /ip4/195.144.22.130/tcp/30333/p2p/12D3KooWPVSft536jRzKSGdiYLx4FQ2XekXrjkNjaS┤
│2024-03-04 20:09:41 ✨ Imported #22150879 (0xa7de…1dc5)

@tunkas
Copy link

tunkas commented Mar 12, 2024

I remember we had issues with Docker in the past. Are other affected validators also running on Docker?

I'm not running on Docker 🤷🏻‍♂️

@eskimor
Copy link
Member

eskimor commented Mar 12, 2024

One interesting metric would be: "polkadot_parachain_peer_count" for the validation peerset. To be queried like this:

polkadot_parachain_peer_count{protocol=~"validation/.*"}

How do these numbers behave with regards to era points? Once you become a para validator this should be around 1k.

@xxbbxb
Copy link

xxbbxb commented Mar 12, 2024

yes I can confirm we do experience significant drop of connected peers after updating/restarting the node until that moment we rotate session keys
image

@eskimor
Copy link
Member

eskimor commented Mar 12, 2024

We found the cause and are working on a fix. What operators can do right away: Make sure the node can persist all data, otherwise it will generate a new PeerId on each restart and this is causing issues.

@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/async-backing-development-updates/6176/5

@alexggh
Copy link
Contributor

alexggh commented Mar 13, 2024

Conclusions

There are 3 failure-modes going-on kusama, all create validators with 0 backing rewards:

  1. Validators not updated to versions supporting async-backing, they will get 0 points until they upgrade -> Nothing to be done here.

  2. Validators that change their Peer-ID during restart because it is not persisted, fix should come with Make polkadot behave correctly if node changes peer-id at restart #3673, till then there are ways to workaround this problem by persistenting the PeerID. They will get 0 points after restart until workaround is applied.

  3. Validators reporting that the first session when they enter the active set they get 0 backing points, reported several times in this ticket and here Node massively misses paravalidation votes. #3613. This is not as problematic as the other, because validators recover next session, but a problem nonetheless.

Also thank you @xxbbxb for providing logs and helping me root cause item number 2.

Next priorities in order:

  • Root-cause what is wrong with validators entering the active set.
  • Provide fixes for validators changing their PeerID during restart.

fellowship-merge-bot bot added a commit to polkadot-fellows/runtimes that referenced this issue Mar 13, 2024
This will enable async-backing subsystems on polkadot, the relaychain
would work in legacy mode where only candiddates built on top of
pervious relay chain are activated.

## Notes
- Had to bring an unrelated change as well `minimum_backing_votes`
because that was the v6, we can't skip versions, the value of that
configuration on polkadot is 2, so that's what is going to be used after
this runtime update. Changes have been running on kusama.
- `disabled_validators`, `node_features`, `approval_voting_params` are
not related with async-backing, but given they are low risk, we decided
to include them as well, see comments.

## Known risks

Async backing subsytems is a major change in the way collator-protocol
and the backing work, so there are still some unknowns if this is
completely bug free. It has been running on kusama for a month already,
but not without issues:
- Validators that did not upgrade to compatible versions will not be
able to participate in backing, so if enough of those are randomly
picked that group won't be able to back anything. With
backing_group_size = 5 and minimum_backing_votes = 2, 10% validator not
upgraded, that chance is about 2.5%.

- Additionally, same un-upgraded groups won't be able to include the
backing votes on chain when they author blocks, so 10% of the blocks
won't back any candidates.

- We are still not sure if item 2) from here
paritytech/polkadot-sdk#3314 (comment)
is caused by async backing, the observable issue is sometimes after
restart/upgrade some validators are getting 0 rewards and apparently
they are not backing anything.


<!-- Remember that you can run `/merge` to enable auto-merge in the PR
-->

<!-- Remember to modify the changelog. If you don't need to modify it,
you can check the following box.
Instead, if you have already modified it, simply delete the following
line. -->

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Co-authored-by: fellowship-merge-bot[bot] <151052383+fellowship-merge-bot[bot]@users.noreply.github.com>
@alexggh
Copy link
Contributor

alexggh commented Mar 17, 2024

Root-cause what is wrong with validators entering the active set.

Ok, I think I figure it out why this happens for new nodes entering the active set, long story-short, the node will start advertising its AuthorthyId on the DHT only after it becomes active, so the other nodes won't be able to detect it at the beginning of the session. More details about this in: https://github.com/paritytech/polkadot-sdk/pull/3722/files

github-merge-queue bot pushed a commit that referenced this issue Apr 15, 2024
As discovered during investigation of
#3314 and
#3673 there are active
validators which accidentally might change their network key during
restart, that's not a safe operation when you are in the active set
because of distributed nature of DHT, so the old records would still
exist in the network until they expire 36h, so unless they have a good
reason validators should avoid changing their key when they restart
their nodes.

There is an effort in parallel to improve this situation
#3786, but those changes
are way more intrusive and will need more rigorous testing, additionally
they will reduce the time to less than 36h, but the propagation won't be
instant anyway, so not changing your network during restart should be
the safest way to run your node, unless you have a really good reason to
change it.

## Proposal
1. Do not auto-generate the network if the network file does not exist
in the provided path. Nodes where the key file does not exist will get
the following error:
```
Error: 
   0: Starting an authorithy without network key in /home/alexggh/.local/share/polkadot/chains/ksmcc3/network/secret_ed25519.
      
       This is not a safe operation because the old identity still lives in the dht for 36 hours.
      
       Because of it your node might suffer from not being properly connected to other nodes for validation purposes.
      
       If it is the first time running your node you could use one of the following methods.
      
       1. Pass --unsafe-force-node-key-generation and make sure you remove it for subsequent node restarts
      
       2. Separetly generate the key with: polkadot key generate-node-key --file <YOUR_PATH_TO_NODE_KEY>
```

2. Add an explicit parameters for nodes that do want to change their
network despite the warnings or if they run the node for the first time.
`--unsafe-force-node-key-generation`

3. For `polkadot key generate-node-key` add two new mutually exclusive
parameters `base_path` and `default_base_path` to help with the key
generation in the same path the polkadot main command would expect it.
 
4. Modify the installation scripts to auto-generate a key in default
path if one was not present already there, this should help with making
the executable work out of the box after an instalation.

## Notes

Nodes that do not have already the key persisted will fail to start
after this change, however I do consider that better than the current
situation where they start but they silently hide that they might not be
properly connected to their peers.

## TODO
- [x] Make sure only nodes that are authorities on producation chains
will be affected by this restrictions.
- [x] Proper PRDOC, to make sure node operators are aware this is
coming.

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
Co-authored-by: s0me0ne-unkn0wn <48632512+s0me0ne-unkn0wn@users.noreply.github.com>
Co-authored-by: Bastian Köcher <git@kchr.de>
@alexggh
Copy link
Contributor

alexggh commented Apr 29, 2024

Closing this is not happening anymore and the remaining open action item is being tracked in #3673

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.
Projects
None yet
Development

No branches or pull requests